Exercise 3.11 If the current state is S t S_t St, and actions are selected according to stochastic policy π \pi π, then what is the expectation of R t + 1 R_{t+1} Rt+1 in terms of π \pi π and the four-argument function p p p(3.2)?
P r ( S t = s , A t = a ) = p ( a ∣ s ) ⋅ P r ( S t = s ) = π ( a ∣ s ) ⋅ P r ( S t = s ) ( 1 ) \begin{aligned} Pr(S_t = s, A_t = a) &= p(a|s) \cdot Pr(S_t = s) \\ &= \pi(a|s) \cdot Pr(S_t = s) \qquad{(1)} \end{aligned} Pr(St=s,At=a)=p(a∣s)⋅Pr(St=s)=π(a∣s)⋅Pr(St=s)(1) E ( R t + 1 ∣ S t = s ) = ∑ r ∈ R [ r ⋅ P r ( R t + 1 = r ∣ S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S P r ( R t + 1 = r , S t + 1 = s ′ , S t = s ) P r ( S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S ∑ a ∈ A P r ( R t + 1 = r , S t + 1 = s ′ , S t = s , A t = a ) P r ( S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S ∑ a ∈ A P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ P r ( S t = s , A t = a ) P r ( S t = s ) ] ( 2 ) \begin{aligned} \mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot Pr(R_{t+1} = r|S_t = s) \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s) \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{Pr (R_{t+1} = r, S_{t+1} = s', S_t = s)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s', S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot Pr(S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \qquad{(2)} \\ \end{aligned} E(Rt+1∣St=s)=r∈R∑[r⋅Pr(Rt+1=r∣St=s)]=r∈R∑[r⋅s′∈S∑Pr(Rt+1=r,St+1=s′∣St=s)]=r∈R∑[r⋅s′∈S∑Pr(St=s)Pr(Rt+1=r,St+1=s′,St=s)]=r∈R∑[r⋅s′∈S∑Pr(St=s)∑a∈APr(Rt+1=r,St+1=s′,St=s,At=a)]=r∈R∑[r⋅s′∈S∑Pr(St=s)∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅Pr(St=s,At=a)](2) Substitute equation (1) into (2), there is : E ( R t + 1 ∣ S t = s ) = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S ∑ a ∈ A P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ π ( a ∣ s ) ⋅ P r ( S t = s ) P r ( S t = s ) ] = ∑ r ∈ R { r ⋅ ∑ s ′ ∈ S ∑ a ∈ A [ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ π ( a ∣ s ) ] } = ∑ r ∈ R { r ⋅ ∑ s ′ ∈ S ∑ a ∈ A [ p ( r , s ′ ∣ s , a ) ⋅ π ( a ∣ s ) ] } ( 3 ) \begin{aligned} \mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s)\cdot Pr(S_t=s)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s) \bigr ] \Bigr \} \\ &= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ p(r, s' | s, a) \cdot \pi(a|s) \bigr ] \Bigr \} \qquad{(3)} \\ \end{aligned} E(Rt+1∣St=s)=r∈R∑[r⋅s′∈S∑Pr(St=s)∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)⋅Pr(St=s)]=r∈R∑{r⋅s′∈S∑a∈A∑[Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)]}=r∈R∑{r⋅s′∈S∑a∈A∑[p(r,s′∣s,a)⋅π(a∣s)]}(3) Equation (3) is the result.