PDF Publication Title:
Text from PDF Page: 022
get 2.6 policy gradients 14 ∇θEτ[R(τ)] = Eτ t=0 rt T−1 t ∇θ logπ(at′ |st′,θ) T−1 t′=0 T−1 = Eτ ∇θ logπ(at |st,θ) rt′ . (1) t=0 t′=t The second formula (Equation (1)) results from the first formula by reordering the sum- mation. We will mostly work with the second formula, as it is more convenient for numerical implementation. We can further reduce the variance of the policy gradient estimator by using a baseline: that is, we subtract a function b(st) from the empirical returns, giving us the following formula for the policy gradient: T−1 T−1 ∇θEτ [R(τ)] = Eτ ∇θ log π(at | st, θ) rt′ − b(st) t=0 t′=t (2) This equality holds for arbitrary baseline functions b. To derive it, we’ll show that the added terms b(st) have no effect on the expectation, i.e., that Eτ [∇θ log π(at′ | st′ , θ)b(st)] = 0. To show this, split up the expectation over whole trajectories Eτ [. . . ] into an expecta- tion over all variables before at, and all variables after and including it. Eτ [∇θ log π(at | st, θ)b(st)] = E s0:t ,a0:(t−1) = E s0:t ,a0:(t−1) E [∇ log π(a | s , θ)b(s )] (break up expectation) s(t+1):T ,at:(T −1) θ t t t b(s )E [∇ log π(a | s , θ)] (pull baseline term out) t s(t+1):T ,at:(T −1) θ t t = Es0:t,a0:(t−1) [b(st)Eat [∇θ log π(at | st, θ)]] (remove irrelevant vars.) = Es0:t,a0:(t−1) [b(st) · 0] The last equation follows because Eat [∇θ logπ(at′ |st′,θ)] = ∇θEat [1] = 0 by the defini- tion of the score function gradient estimator. A near-optimal choice of baseline is the state-value function, Vπ(s) = Ert +rt+1 +···+rT−1 |st = s, at:(T−1) ∼ π See [GBB04] for a discussion of the choice of baseline that optimally reduces variance of the policy gradient estimator. So in practice, we will generally choose the baseline to approximate the value function, b(s) ≈ Vπ(s).PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)