OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 022

get 2.6 policy gradients 14 􏳉 ∇θEτ[R(τ)] = Eτ t=0 rt 􏳈T−1 t 􏰋􏰋 ∇θ logπ(at′ |st′,θ) T−1 􏳉 t′=0 􏰋􏰋 􏳈T−1 = Eτ ∇θ logπ(at |st,θ) rt′ . (1) t=0 t′=t The second formula (Equation (1)) results from the first formula by reordering the sum- mation. We will mostly work with the second formula, as it is more convenient for numerical implementation. We can further reduce the variance of the policy gradient estimator by using a baseline: that is, we subtract a function b(st) from the empirical returns, giving us the following formula for the policy gradient: 􏳈T−1 􏱀T−1 􏱁􏳉 􏰋􏰋 ∇θEτ [R(τ)] = Eτ ∇θ log π(at | st, θ) rt′ − b(st) t=0 t′=t (2) This equality holds for arbitrary baseline functions b. To derive it, we’ll show that the added terms b(st) have no effect on the expectation, i.e., that Eτ [∇θ log π(at′ | st′ , θ)b(st)] = 0. To show this, split up the expectation over whole trajectories Eτ [. . . ] into an expecta- tion over all variables before at, and all variables after and including it. Eτ [∇θ log π(at | st, θ)b(st)] = E s0:t ,a0:(t−1) = E s0:t ,a0:(t−1) 􏰌E [∇ log π(a | s , θ)b(s )]􏰍 (break up expectation) s(t+1):T ,at:(T −1) θ t t t 􏰌b(s )E [∇ log π(a | s , θ)]􏰍 (pull baseline term out) t s(t+1):T ,at:(T −1) θ t t = Es0:t,a0:(t−1) [b(st)Eat [∇θ log π(at | st, θ)]] (remove irrelevant vars.) = Es0:t,a0:(t−1) [b(st) · 0] The last equation follows because Eat [∇θ logπ(at′ |st′,θ)] = ∇θEat [1] = 0 by the defini- tion of the score function gradient estimator. A near-optimal choice of baseline is the state-value function, Vπ(s) = E􏰄rt +rt+1 +···+rT−1 |st = s, at:(T−1) ∼ π􏰅 See [GBB04] for a discussion of the choice of baseline that optimally reduces variance of the policy gradient estimator. So in practice, we will generally choose the baseline to approximate the value function, b(s) ≈ Vπ(s).

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)