logo

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 063

4.6 experiments 55 2. Can generalized advantage estimation, along with trust region algorithms for pol- icy and value function optimization, be used to optimize large neural network policies for challenging control problems? 4.6.1 Policy Optimization Algorithm While generalized advantage estimation can be used along with a variety of different policy gradient methods, for these experiments, we performed the policy updates using trust region policy optimization (TRPO) [Sch+15c]. TRPO updates the policy by approx- imately solving the following constrained optimization problem each iteration: minimize Lθold (θ) θ subject to Dθold(π ,π ) 􏳆 ε KL θold whereLθold(θ)=1􏰋N θ πθ(an|sn) Aˆn N n=1 πθold (an | sn) θ 1􏰋N D old(π ,π )= D (π (·|s )∥π (·|s )) (35) KL θold θ N KL θold n θ n n=1 As described in [Sch+15c], we approximately solve this problem by linearizing the ob- jective and quadraticizing the constraint, which yields a step in the direction θ − θold ∝ −F−1g, where F is the average Fisher information matrix, and g is a policy gradient es- timate. This policy update yields the same step direction as the natural policy gradient [Kak01a] and natural actor-critic [PS08], however it uses a different stepsize determina- tion scheme and numerical procedure for computing the step. Since prior work [Sch+15c] compared TRPO to a variety of different policy optimiza- tion algorithms, we will not repeat these comparisons; rather, we will focus on varying the γ, λ parameters of policy gradient estimator while keeping the underlying algorithm fixed. For completeness, the whole algorithm for iteratively updating policy and value func- tion is given below: Note that the policy update θi → θi+1 is performed using the value function Vφi for advantage estimation, not Vφi+1 . Additional bias would have been introduced if we updated the value function first. To see this, consider the extreme case where we overfit the value function, and the Bellman residual rt + γV(st+1) − V(st) becomes zero at all timesteps—the policy gradient estimate would be zero.

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--063

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP