OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 057

4.3 advantage function estimation This section will be concerned with producing an accurate estimate Aˆ t of the discounted advantage function Aπ,γ(st,at), which will then be used to construct a policy gradient estimator of the following form: 1 􏰋N 􏰋∞ ˆ Ant∇θlogπθ(ant |snt) (25) where n indexes over a batch of episodes. Let V be an approximate value function. Define δVt = rt + γV(st+1) − V(st), i.e., the TD residual of V with discount γ [SB98]. Note that δVt can be considered as an estimate of the advantage of the action at. In fact, if we have the correct value function V = Vπ,γ, then it is a γ-just advantage estimator, and in fact, an unbiased estimator of Aπ,γ: E 􏰌δVπ,γ􏰍 = E [r +γVπ,γ(s )−Vπ,γ(s )] st+1 t st+1 t t+1 t = Est+1 [Qπ,γ(st, at) − Vπ,γ(st)] = Aπ,γ(st, at). However, this estimator is only γ-just for V = Vπ,γ, otherwise it will yield biased policy gradient estimates. Next, let us consider taking the sum of k of these δ terms, which we will denote by Aˆ ( k ) t Aˆ(1):=δV t t Aˆ(2):=δV+γδV t t t+1 =−V(s)+r+γV(s ) t t t+1 Aˆ(3):=δV+γδV t t t+1 t+2 k−1 ˆ(k) 􏰋lV k gˆ=N n=1 t=0 =−V(s)+r +γr t t t+1 +γ2V(s ) t+2 4.3 advantage function estimation 49 +γ2δV γ δt = −V(st)+rt +γrt+1 +···+γ rt+k−1 +γ V(st+k) =−V(s)+r +γr t t t+1 t+2 At := These equations result from a telescoping sum, and we see that Aˆ (k) involves a k-step l=0 t estimate of the returns, minus a baseline term −V(st). Analogously to the case of δVt = Aˆ (1), we can consider Aˆ (k) to be an estimator of the advantage function, which is only t π,γ t γ-just when V = V . However, note that the bias generally becomes smaller as k → ∞, +γ2r k−1 +γ3V(s ) t+3

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)