OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 036

rather than 1 N N ∂ n=1 ∂θi 3.7 • Our theory ignores estimation error for the advantage function. Kakade and Lang- ford [KL02] consider this error in their derivation, and the same arguments would hold in the setting of this chapter, but we omit them for simplicity. connections with prior work 3.7 connections with prior work 28 􏰊 of the gradients. That is, we estimate Aij as 1 􏰊N ∂2 DKL(πθ (· | sn) ∥ πθ(· | sn)), log πθ(an | sn) ∂ log πθ(an | sn). The analytic estimator integrates ∂θj over the action at each state sn, and does not depend on the action an that was sampled. As described in Section 3.12, this analytic estimator has computational benefits in the large-scale setting, since it removes the need to store a dense Hessian or all policy gradi- ents from a batch of trajectories. The rate of improvement in the policy is similar to the empirical FIM, as shown in the experiments. Let us briefly summarize the relationship between the theory from Section 3.3 and the practical algorithm we have described: • The theory justifies optimizing a surrogate objective with a penalty on KL diver- gence. However, the large penalty coefficient 2εγ leads to prohibitively small N n=1 ∂θi∂θj old (2−γ)2 steps, so we would like to decrease this coefficient. Empirically, it is hard to robustly choose the penalty coefficient, so we use a hard constraint instead of a penalty, with parameter δ (the bound on KL divergence). • The constraint on Dmax(θ , θ) is hard for numerical optimization and estimation, KL old so instead we constrain DKL(θold, θ). As mentioned in Section 3.4, our derivation results in a policy update that is related to several prior methods, providing a unifying perspective on a number of policy update schemes. The natural policy gradient [Kak02] can be obtained as a special case of the update in Equation (14) by using a linear approximation to L and a quadratic approxi- mation to the DKL constraint, resulting in the following problem: maximize[∇θLθold(θ)􏰆􏰆θ=θold ·(θ−θold)] θ subject to 12(θold − θ)T A(θold)(θold − θ) 􏳆 δ, where A(θold)ij = ∂ ∂ Es∼ρπ [DKL(π(· | s, θold) ∥ π(· | s, θ))] 􏰆􏰆θ=θ ∂θi ∂θj old (17) .

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)