OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 058

4.3 advantage function estimation 50 since the term γkV(st+k) becomes more heavily discounted, and the term −V(st) does ∞ 􏰋 not affect the bias. Taking k → ∞, we get k−1 ˆ(∞) 􏰋 A = γlδV =−V(s)+ γlr , t t t t+l l=0 l=0 which is simply the empirical returns minus the value function baseline. The generalized advantage estimator GAE(γ, λ) is defined as the exponentially-weighted average of these k-step estimators: AˆGAE(γ,λ) :=(1−λ)􏰈Aˆ(1)+λAˆ(2)+λ2Aˆ(3)+...􏰉 =(1−λ)􏰈δVt +λ(δVt +γδVt+1)+λ2(δVt +γδVt+1+γ2δVt+2)+...􏰉 = (1−λ)(δVt (1+λ+λ2 +...)+γδVt+1(λ+λ2 +λ3 +...) +γδVt+2(λ2 +λ3 +λ4 +...)+...) 􏰹V􏰹1􏰺 V􏰹λ􏰺 2V􏰹λ2􏰺 􏰺 = (1−λ) δt 1−λ +γδt+1 1−λ +γ δt+2 1−λ +... 􏰋∞ (γλ)lδVt+l (26) From Equation (26), we see that the advantage estimator has a remarkably simple for- mula involving a discounted sum of Bellman residual terms. Section 4.4 discusses an interpretation of this formula as the returns in an MDP with a modified reward func- tion. The construction we used above is closely analogous to the one used to define TD(λ) [SB98], however TD(λ) is an estimator of the value function, whereas here we are estimating the advantage function. There are two notable special cases of this formula, obtained by setting λ = 0 and λ = 1. GAE(γ,0): Aˆt :=δt =rt+γV(st+1)−V(st) (27) ˆ􏰋∞ 􏰋∞ GAE(γ, 1) : At := γlδt+l = γlrt+l − V(st) (28) l=0 l=0 GAE(γ,1) is γ-just regardless of the accuracy of V, but it has high variance due to the sum of terms. GAE(γ, 0) is γ-just for V = V π,γ and otherwise induces bias, but it typically has much lower variance. The generalized advantage estimator for 0 < λ < 1 makes a compromise between bias and variance, controlled by parameter λ. t ttt = l=0

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)