OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 013

1.5 optimizing stochastic policies 5 Finally, there are actor-critic methods that combine elements from both policy opti- mization and dynamic programming. These methods optimize a policy, but they use value functions to speed up this optimization, and often use ideas from approximate dy- namic programming to fit the value functions. The method described in Chapter 4, along with deterministic policy gradient methods [Lil+15; Hee+15], are examples of actor-critic methods. 1.5 optimizing stochastic policies This thesis focuses on a particular branch in the family tree of RL algorithms from the previous section—methods that optimize a stochastic policy, using gradient based meth- ods. Why stochastic policies, (defining π(a | s) = probability of action given state) rather than deterministic policies (a = π(s))? Stochastic policies have several advantages: • Even with a discrete action space, it’s possible to make an infinitesimal change to a stochastic policy. That enables policy gradient methods, which estimate the gradient of performance with respect to the policy parameters. Policy gradients do not make sense with a discrete action space. • We can use the score function gradient estimator, which tries to make good actions more probable. This estimator, and its alternative, the pathwise derivative estimator, will be discussed in Chapter 5. The score function estimator is better at dealing with systems that contain discrete-valued or discontinuous components. • The randomness inherent in the policy leads to exploration, which is crucial for most learning problems. In other RL methods that aren’t based on stochastic poli- cies, randomness usually needs to be added in some other way. On the other hand, stochastic policies explore poorly in many problems, and policy gradient methods often converge to suboptimal solutions. The approach taken in this thesis—optimizing stochastic policies using gradient-based methods—makes reinforcement learning much more like other domains where deep learning is used. Namely, we repeatedly compute a noisy estimate of the gradient of performance, and plug that into a stochastic gradient descent algorithm. This situation contrasts with methods that use function approximation along with dynamic program- ming methods like value iteration and policy iteration—there, we can also formulate optimization problems; however, we are not directly optimizing the expected perfor- mance. While there has been success using neural networks in value iteration [Mni+13], this sort of algorithm is hard to analyze because it is not clear how errors in the dynamic programming updates will accumulate or affect the performance—thus, these methods

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--013

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP