OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 032

3.5 sample-based estimation of the objective and constraint 24 In practice, if we used the penalty coefficient C recommended by the theory above, the step sizes would be very small. One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint: maximize Lθold (θ) (13) θ subject to Dmax(θ , θ) 􏳆 δ. KL old This problem imposes a constraint that the KL divergence is bounded at every point in the state space. While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. Instead, we can use a heuristic approximation which considers the average KL divergence: Dρ (θ ,θ ):=E 􏰄D (π (·|s)∥π (·|s))􏰅. KL 1 2 s∼ρ KL θ1 θ2 We therefore propose solving the following optimization problem to generate a policy update: maximize Lθold (θ) θ subject to Dρθold (θ Similar policy updates have been proposed in prior work [BS03; PS08; PMA10], and we compare our approach to prior methods in Section 3.7 and in the experiments in Section 3.8. Our experiments also show that this type of constrained update has similar empirical performance to the maximum KL divergence constraint in Equation (13). 3.5 sample-based estimation of the objective and constraint The previous section proposed a constrained optimization problem on the policy param- eters (Equation (14)), which optimizes an estimate of the expected total reward η subject to a constraint on the change in the policy at each update. This section describes how the objective and constraint functions can be approximated using Monte Carlo simulation. We seek to solve the following optimization problem, obtained by expanding Lθold in Equation (14): 􏰋􏰋 maximize ρθold (s) πθ(a | s)Aθold (s, a) θsa subject to Dρθold (θ , θ) 􏳆 δ. (15) KL old , θ) 􏳆 δ. KL old (14)

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)