OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 037

3.8 experiments 29 The update is θnew = θold + λ1 A(θold)−1∇θL(θ)􏰆􏰆θ=θold , where the stepsize λ1 is typically treated as an algorithm parameter. This differs from our approach, which enforces the constraint at each update. Though this difference might seem subtle, our experiments demonstrate that it significantly improves the algorithm’s performance on larger prob- lems. We can also obtain the standard policy gradient update by using an l2 constraint or penalty: maximize[∇θLθold(θ)􏰆􏰆θ=θold ·(θ−θold)] (18) θ subject to 12∥θ − θold∥2 􏳆 δ. The policy iteration update can also be obtained by solving the unconstrained problem maximizeπ Lπold (π), using L as defined in Equation (5). Several other methods employ an update similar to Equation (14). Relative entropy policy search (REPS) [PMA10] constrains the state-action marginals p(s, a), while TRPO constrains the conditionals p(a | s). Unlike REPS, our approach does not require a costly nonlinear optimization in the inner loop. Levine and Abbeel [LA14] also use a KL di- vergence constraint, but its purpose is to encourage the policy not to stray from regions where the estimated dynamics model is valid, while we do not attempt to estimate the system dynamics explicitly. Pirotta et al. [Pir+13] also build on and generalize Kakade and Langford’s results, and they derive different algorithms from the ones here. 3.8 experiments We designed our experiments to investigate the following questions: 1. What are the performance characteristics of the single path and vine sampling pro- cedures? 2. TRPO is related to prior methods (e.g. natural policy gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient. How does this affect the performance of the algorithm? 3. Can TRPO be used to solve challenging large-scale problems? How does TRPO compare with other methods when applied to large-scale problems, with regard to final performance, computation time, and sample complexity? To answer (1) and (2), we compare the performance of the single path and vine variants of TRPO, several ablated variants, and a number of prior policy optimization algorithms. With regard to (3), we show that both the single path and vine algorithm can obtain high-

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)