Exercise 3.13 Give an equation for q π q_\pi qπ in terms of v π v_\pi vπ and the four-argument p p p.
First, we need to derive a formula from multiplication formula of probability theory: p ( x ∣ y ) = p ( x , y ) p ( y ) = ∑ z p ( x , y , z ) p ( y ) = ∑ z [ p ( x ∣ y , z ) ⋅ p ( z ∣ y ) ⋅ p ( y ) ] p ( y ) = ∑ z [ p ( x ∣ y , z ) ⋅ p ( z ∣ y ) ] ( 1 ) \begin{aligned} p(x|y) &= \frac {p(x,y)}{p(y)} \\ &= \frac {\sum_z p(x,y,z)}{p(y)} \\ &= \frac {\sum_z \bigl [ p(x |y,z) \cdot p(z|y) \cdot p(y) \bigr ] } { p(y) } \\ &= \sum_z \bigl [ p(x|y,z) \cdot p(z|y) \bigr ] \qquad \qquad{(1)}\\ \end{aligned} p(x∣y)=p(y)p(x,y)=p(y)∑zp(x,y,z)=p(y)∑z[p(x∣y,z)⋅p(z∣y)⋅p(y)]=z∑[p(x∣y,z)⋅p(z∣y)](1) With formula (1), we can calculate q π ( s , a ) q_\pi(s,a) qπ(s,a) as below: q π ( s , a ) = E π ( G t ∣ S t = s , A t = a ) = E π ( R t + 1 + γ G t + 1 ∣ S t = s , A t = a ) = E π ( R t + 1 ∣ S t = s , A t = a ) + γ E π ( G t + 1 ∣ S t = s , A t = a ) = ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ P r ( G t + 1 = g t + 1 ∣ S t = s , A t = a ) \begin{aligned} q_\pi(s,a)&= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &=\mathbb E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi(R_{t+1} | S_t = s, A_t = a) + \gamma \mathbb E_\pi(G_{t+1}|S_t = s, A_t = a) \\ &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=g_{t+1}|S_t=s, A_t=a) \\ \end{aligned} qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γgt+1∑gt+1⋅Pr(Gt+1=gt+1∣St=s,At=a) Here, according to definition, g t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k g_{t+1} = \sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} gt+1=∑k=0∞γk⋅rt+2+k. And use formula (1), we can derive: q π ( s , a ) = ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t = s , A t = a ) = ∑ r r ⋅ ∑ s ′ P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ ∑ s ′ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t = s , A t = a , S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , a t = a ) \begin{aligned} q_\pi(s,a) &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a) \\ &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a, S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ \end{aligned} qπ(s,a)=r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γgt+1∑gt+1⋅Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St=s,At=a)=r∑r⋅s′∑Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γgt+1∑gt+1⋅s′∑Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,at=a) Because in Markov Process, G t + 1 G_{t+1} Gt+1 is the reward of status S t + 1 = s ′ S_{t+1} = s' St+1=s′, the information of S t = s S_t = s St=s and A t = a A_t = a At=a are no effect on G t + 1 G_{t+1} Gt+1. So: q π ( s , a ) = ∑ r r ⋅ ∑ s ′ P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) + γ ∑ g t + 1 g t + 1 ⋅ ∑ s ′ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t + 1 = s ′ ) ⋅ P r ( S t + 1 = s ′ ∣ S t = s , a t = a ) = ∑ s ′ { P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ [ ∑ r r ⋅ P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) + γ ∑ g t + 1 g t + 1 ⋅ P r ( G t + 1 = ∑ k = 0 ∞ γ k ⋅ r t + 2 + k ∣ S t + 1 = s ′ ) ] } = ∑ s ′ { p ( s ′ ∣ s , a ) ⋅ [ E π ( r ∣ s , a , s ′ ) + γ ⋅ E π ( G t + 1 ∣ S t + 1 = s ′ ) ] } = ∑ s ′ { p ( s ′ ∣ s , a ) ⋅ [ E π ( r ∣ s , a , s ′ ) + γ ⋅ v π ( s ′ ) ] } \begin{aligned} q_\pi(s,a) &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ &= \sum_{s'} \biggl \{ Pr(S_{t+1} = s' | S_t=s, A_t=a) \cdot \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \Bigr ] \biggr \}\\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot \mathbb E_\pi(G_{t+1}|S_{t+1}=s') \Bigr ] \biggr \} \\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned} qπ(s,a)=r∑r⋅s′∑Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γgt+1∑gt+1⋅s′∑Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)=s′∑{Pr(St+1=s′∣St=s,At=a)⋅[r∑r⋅Pr(Rt+1=r∣St=s,At=a,St+1=s′)+γgt+1∑gt+1⋅Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St+1=s′)]}=s′∑{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅Eπ(Gt+1∣St+1=s′)]}=s′∑{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅vπ(s′)]} Denote p ( s ′ ∣ a , s ) = P s , s ′ a p(s' | a , s ) = P_{s,s'}^a p(s′∣a,s)=Ps,s′a and E π ( r ∣ s , a , s ′ ) = R s , s ′ a \mathbb E_\pi ( r | s, a, s' ) = R_{s,s'}^a Eπ(r∣s,a,s′)=Rs,s′a, then q π ( s , a ) = ∑ s ′ { P s , s ′ a ⋅ [ R s , s ′ a + γ ⋅ v π ( s ′ ) ] } ( 2 ) q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \qquad{(2)} qπ(s,a)=s′∑{Ps,s′a⋅[Rs,s′a+γ⋅vπ(s′)]}(2) Here, the equation (2) is the result.