Reinforcement Learning Exercise 3.11

    xiaoxiao2023-11-25  136

    Exercise 3.11 If the current state is S t S_t St, and actions are selected according to stochastic policy π \pi π, then what is the expectation of R t + 1 R_{t+1} Rt+1 in terms of π \pi π and the four-argument function p p p(3.2)?

    P r ( S t = s , A t = a ) = p ( a ∣ s ) ⋅ P r ( S t = s ) = π ( a ∣ s ) ⋅ P r ( S t = s ) ( 1 ) \begin{aligned} Pr(S_t = s, A_t = a) &= p(a|s) \cdot Pr(S_t = s) \\ &= \pi(a|s) \cdot Pr(S_t = s) \qquad{(1)} \end{aligned} Pr(St=s,At=a)=p(as)Pr(St=s)=π(as)Pr(St=s)(1) E ( R t + 1 ∣ S t = s ) = ∑ r ∈ R [ r ⋅ P r ( R t + 1 = r ∣ S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S P r ( R t + 1 = r , S t + 1 = s ′ , S t = s ) P r ( S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S ∑ a ∈ A P r ( R t + 1 = r , S t + 1 = s ′ , S t = s , A t = a ) P r ( S t = s ) ] = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S ∑ a ∈ A P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ P r ( S t = s , A t = a ) P r ( S t = s ) ] ( 2 ) \begin{aligned} \mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot Pr(R_{t+1} = r|S_t = s) \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s) \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{Pr (R_{t+1} = r, S_{t+1} = s', S_t = s)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s', S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot Pr(S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \qquad{(2)} \\ \end{aligned} E(Rt+1St=s)=rR[rPr(Rt+1=rSt=s)]=rR[rsSPr(Rt+1=r,St+1=sSt=s)]=rR[rsSPr(St=s)Pr(Rt+1=r,St+1=s,St=s)]=rR[rsSPr(St=s)aAPr(Rt+1=r,St+1=s,St=s,At=a)]=rR[rsSPr(St=s)aAPr(Rt+1=r,St+1=sSt=s,At=a)Pr(St=s,At=a)](2) Substitute equation (1) into (2), there is : E ( R t + 1 ∣ S t = s ) = ∑ r ∈ R [ r ⋅ ∑ s ′ ∈ S ∑ a ∈ A P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ π ( a ∣ s ) ⋅ P r ( S t = s ) P r ( S t = s ) ] = ∑ r ∈ R { r ⋅ ∑ s ′ ∈ S ∑ a ∈ A [ P r ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ π ( a ∣ s ) ] } = ∑ r ∈ R { r ⋅ ∑ s ′ ∈ S ∑ a ∈ A [ p ( r , s ′ ∣ s , a ) ⋅ π ( a ∣ s ) ] } ( 3 ) \begin{aligned} \mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s)\cdot Pr(S_t=s)}{Pr(S_t = s)} \bigr ] \\ &= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s) \bigr ] \Bigr \} \\ &= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ p(r, s' | s, a) \cdot \pi(a|s) \bigr ] \Bigr \} \qquad{(3)} \\ \end{aligned} E(Rt+1St=s)=rR[rsSPr(St=s)aAPr(Rt+1=r,St+1=sSt=s,At=a)π(as)Pr(St=s)]=rR{rsSaA[Pr(Rt+1=r,St+1=sSt=s,At=a)π(as)]}=rR{rsSaA[p(r,ss,a)π(as)]}(3) Equation (3) is the result.

    最新回复(0)