OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 056

4.2 preliminaries 48 The following section discusses how to obtain biased (but not too biased) estimators for Aπ,γ, giving us noisy estimates of the discounted policy gradient in Equation (23). Before proceeding, we will introduce the notion of a γ-just estimator of the advantage function, which is an estimator that does not introduce bias when we use it in place of Aπ,γ (which is not known and must be estimated) in Equation (23) to estimate gγ.1 Consider an advantage estimator Aˆ t(s0:∞, a0:∞), which may in general be a function of the entire trajectory. Definition 2. The estimator Aˆ t is γ-just if Es0:∞ 􏰄Aˆt(s0:∞,a0:∞)∇θlogπθ(at|st)􏰅=Es0:∞ [Aπ,γ(st,at)∇θlogπθ(at|st)]. a0:∞ a0:∞ It follows immediately that if Aˆ t is γ-just for all t, then Es0:∞ a0:∞ t=0 􏳈 􏰋∞ ˆ 􏳉 At(s0:∞, a0:∞)∇θ log πθ(at | st) = gγ (24) One sufficient condition for Aˆ t to be γ-just is that Aˆ t decomposes as the difference between two functions Qt and bt, where Qt can depend on any trajectory variables but gives an unbiased estimator of the γ-discounted Q-function, and bt is an arbitrary function of the states and actions sampled before at. The proof is provided in Section 4.9. It is easy to verify that the following expressions are γ-just advantage estimators for Aˆ t: • 􏰊∞l=0γlrt+l • Aπ,γ(st,at) • Qπ,γ(st,at) • rt+γVπ,γ(st+1)−Vπ,γ(st). 1 Note, that we have already introduced bias by using Aπ,γ in place of Aπ; here we are concerned with obtaining an unbiased estimate of gγ, which is a biased estimate of the policy gradient of the undiscounted MDP. Proposition 2. Suppose that Aˆ t can be written in the form Aˆ t(s0:∞, a0:∞) = Qt(s0:∞, a0:∞) − bt(s0:t, a0:t−1) such that for all (st, at), Est+1:∞,at+1:∞ | st,at [Qt(st:∞, at:∞)] = Qπ,γ(st, at). Then Aˆ is γ-just.

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--056

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP