PDF Publication Title:
Text from PDF Page: 014
1.6 contributions of this thesis 6 have not shown good performance across as wide of a variety of tasks that policy gradi- ent methods have; however, when they work, they tend to be more sample-efficient than policy gradient methods. While the approach of this thesis simplifies the problem of reinforcement learning by reducing it to a more well-understood kind of optimization with stochastic gradients, there are still two sources of difficulty that arise, motivating the work of this thesis. 1. Most prior applications of deep learning involve an objective where we have access to the loss function and how it depends on the parameters of our function approx- imator. On the other hand, reinforcement learning involves a dynamics model that is unknown and possibly nondifferentiable. We can still obtain gradient estimates, but they have high variance, which leads to slow learning. 2. In the typical supervised learning setting, the input data doesn’t depend on the current predictor; on the other hand, in reinforcement learning, the input data strongly depends on the current policy. The dependence of the state distribution on the policy makes it harder to devise stable reinforcement learning algorithms. 1.6 contributions of this thesis This thesis develops policy optimization methods that are more stable and sample effi- cient than their predecessors and that work effectively when using neural networks as function approximators. First, we study the following question: after collecting a batch of data using the current policy, how should we update the policy? In a theoretical analysis, we show that there is cer- tain loss function that provides a local approximation of the policy performance, and the accuracy of this approximation is bounded in terms of the KL divergence between the old policy (used to collect the data) and the new policy (the policy after the update). This theory justifies a policy updating scheme that is guaranteed to monotonically improve the policy (ignoring sampling error). This contrasts with previous analyses of policy gra- dient methods (such as [JJS94]), which did not specify what finite-sized stepsizes would guarantee policy improvement. By making some practically-motivated approximations to this scheme, we develop an algorithm called trust region policy optimization (TRPO). This algorithm is shown to yield strong empirical results in two domains: simulated robotic locomotion, and Atari games using images as input. TRPO is closely related to natural gradient methods, [Kak02; BS03; PS08]; however, there are some changes intro- duced, which make the algorithm more scalable and robust. Furthermore, the derivation of TRPO motivates a new class of policy gradient methods that controls the size of thePDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP |