PDF Publication Title:
Text from PDF Page: 036
rather than 1 N N ∂ n=1 ∂θi 3.7 • Our theory ignores estimation error for the advantage function. Kakade and Lang- ford [KL02] consider this error in their derivation, and the same arguments would hold in the setting of this chapter, but we omit them for simplicity. connections with prior work 3.7 connections with prior work 28 of the gradients. That is, we estimate Aij as 1 N ∂2 DKL(πθ (· | sn) ∥ πθ(· | sn)), log πθ(an | sn) ∂ log πθ(an | sn). The analytic estimator integrates ∂θj over the action at each state sn, and does not depend on the action an that was sampled. As described in Section 3.12, this analytic estimator has computational benefits in the large-scale setting, since it removes the need to store a dense Hessian or all policy gradi- ents from a batch of trajectories. The rate of improvement in the policy is similar to the empirical FIM, as shown in the experiments. Let us briefly summarize the relationship between the theory from Section 3.3 and the practical algorithm we have described: • The theory justifies optimizing a surrogate objective with a penalty on KL diver- gence. However, the large penalty coefficient 2εγ leads to prohibitively small N n=1 ∂θi∂θj old (2−γ)2 steps, so we would like to decrease this coefficient. Empirically, it is hard to robustly choose the penalty coefficient, so we use a hard constraint instead of a penalty, with parameter δ (the bound on KL divergence). • The constraint on Dmax(θ , θ) is hard for numerical optimization and estimation, KL old so instead we constrain DKL(θold, θ). As mentioned in Section 3.4, our derivation results in a policy update that is related to several prior methods, providing a unifying perspective on a number of policy update schemes. The natural policy gradient [Kak02] can be obtained as a special case of the update in Equation (14) by using a linear approximation to L and a quadratic approxi- mation to the DKL constraint, resulting in the following problem: maximize[∇θLθold(θ)θ=θold ·(θ−θold)] θ subject to 12(θold − θ)T A(θold)(θold − θ) δ, where A(θold)ij = ∂ ∂ Es∼ρπ [DKL(π(· | s, θold) ∥ π(· | s, θ))] θ=θ ∂θi ∂θj old (17) .PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)