Reinforcement Learning Exercise 3.12

    xiaoxiao2025-04-11  24

    Exercise 3.12 Give an equation for v π v_\pi vπ in terms of q π q_\pi qπ and π \pi π.

    v π ( s ) = E π ( G t ∣ S t = s ) = ∑ g t [ g t ⋅ p ( g t ∣ s ) ] = ∑ g t [ g t ⋅ p ( g t , s ) p ( s ) ] = ∑ g t [ g t ⋅ ∑ a ∈ A p ( g t , s , a ) p ( s ) ] = ∑ g t { g t ⋅ ∑ a ∈ A [ p ( g t ∣ s , a ) ⋅ p ( s , a ) ] p ( s ) } = ∑ g t { g t ⋅ ∑ a ∈ A [ p ( g t ∣ s , a ) ⋅ p ( a ∣ s ) ⋅ p ( s ) ] p ( s ) ] } = ∑ g t { g t ⋅ ∑ a ∈ A [ p ( g t ∣ s , a ) ⋅ p ( a ∣ s ) ] } = ∑ a ∈ A { p ( a ∣ s ) ∑ g t [ g t ⋅ p ( g t ∣ s , a ) ] } \begin{aligned} v_\pi(s) &= \mathbb E_\pi(G_t|S_t=s) \\ &=\sum_{g_t}\bigl [ g_t \cdot p(g_t|s) \bigr ] \\ &=\sum_{g_t}\bigl [ g_t \cdot \frac {p(g_t, s)}{p(s)} \bigr ] \\ &=\sum_{g_t}\bigl [ g_t \cdot \frac{ \sum_{a \in \mathcal A} p(g_t, s, a)}{p(s)} \bigr ] \\ &=\sum_{g_t}\Bigl \{ g_t \cdot \frac{ \sum_{a \in \mathcal A} \bigl [p(g_t| s, a) \cdot p(s, a) \bigr ] }{p(s)} \Bigr \} \\ &=\sum_{g_t}\Bigl \{ g_t \cdot \frac{ \sum_{a \in \mathcal A} \bigl [p(g_t| s, a) \cdot p(a | s) \cdot p(s) \bigr ]}{p(s) \bigr ] } \Bigr \} \\ &=\sum_{g_t}\Bigl \{ g_t \cdot \sum_{a \in \mathcal A} \bigl [p(g_t| s, a) \cdot p(a | s) \bigr ] \Bigr \} \\ &=\sum_{a \in \mathcal A} \Bigl \{ p(a|s) \sum_{g_t} \bigl [ g_t \cdot p(g_t | s, a) \bigr ] \Bigr \} \end{aligned} vπ(s)=Eπ(GtSt=s)=gt[gtp(gts)]=gt[gtp(s)p(gt,s)]=gt[gtp(s)aAp(gt,s,a)]=gt{gtp(s)aA[p(gts,a)p(s,a)]}=gt{gtp(s)]aA[p(gts,a)p(as)p(s)]}=gt{gtaA[p(gts,a)p(as)]}=aA{p(as)gt[gtp(gts,a)]} According to definition, p ( a ∣ s ) = π ( a ∣ s ) p(a|s) = \pi(a|s) p(as)=π(as), ∑ g t [ g t ⋅ p ( g t ∣ s , a ) ] = q π ( s , a ) \sum_{g_t} \bigl [ g_t \cdot p(g_t | s, a) \bigr ] = q_\pi(s,a) gt[gtp(gts,a)]=qπ(s,a), so there is: v π ( s ) = ∑ a ∈ A [ π ( a ∣ s ) ⋅ q π ( s , a ) ] v_\pi(s) = \sum_{a \in \mathcal A} \bigl [ \pi(a|s) \cdot q_\pi(s,a) \bigr ] vπ(s)=aA[π(as)qπ(s,a)]

    最新回复(0)