OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 014

1.6 contributions of this thesis 6 have not shown good performance across as wide of a variety of tasks that policy gradi- ent methods have; however, when they work, they tend to be more sample-efficient than policy gradient methods. While the approach of this thesis simplifies the problem of reinforcement learning by reducing it to a more well-understood kind of optimization with stochastic gradients, there are still two sources of difficulty that arise, motivating the work of this thesis. 1. Most prior applications of deep learning involve an objective where we have access to the loss function and how it depends on the parameters of our function approx- imator. On the other hand, reinforcement learning involves a dynamics model that is unknown and possibly nondifferentiable. We can still obtain gradient estimates, but they have high variance, which leads to slow learning. 2. In the typical supervised learning setting, the input data doesn’t depend on the current predictor; on the other hand, in reinforcement learning, the input data strongly depends on the current policy. The dependence of the state distribution on the policy makes it harder to devise stable reinforcement learning algorithms. 1.6 contributions of this thesis This thesis develops policy optimization methods that are more stable and sample effi- cient than their predecessors and that work effectively when using neural networks as function approximators. First, we study the following question: after collecting a batch of data using the current policy, how should we update the policy? In a theoretical analysis, we show that there is cer- tain loss function that provides a local approximation of the policy performance, and the accuracy of this approximation is bounded in terms of the KL divergence between the old policy (used to collect the data) and the new policy (the policy after the update). This theory justifies a policy updating scheme that is guaranteed to monotonically improve the policy (ignoring sampling error). This contrasts with previous analyses of policy gra- dient methods (such as [JJS94]), which did not specify what finite-sized stepsizes would guarantee policy improvement. By making some practically-motivated approximations to this scheme, we develop an algorithm called trust region policy optimization (TRPO). This algorithm is shown to yield strong empirical results in two domains: simulated robotic locomotion, and Atari games using images as input. TRPO is closely related to natural gradient methods, [Kak02; BS03; PS08]; however, there are some changes intro- duced, which make the algorithm more scalable and robust. Furthermore, the derivation of TRPO motivates a new class of policy gradient methods that controls the size of the

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)