OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 012

1.4 what to learn, what to approximate 4 DFO / Evolution Policy Gradients Policy Iteration Value Iteration Policy Optimization Dynamic Programming modified policy iteration Actor-Critic Methods Q-Learning Policy optimization methods are centered around the policy, the function that maps the agent’s state to its next action. These methods view reinforcement learning as a nu- merical optimization problem where we optimize the expected reward with respect to the policy’s parameters. There are two ways to optimize a policy. First, there are deriva- tive free optimization (DFO) algorithms, including evolutionary algorithms. These algo- rithms work by perturbing the policy parameters in many different ways, measuring the performance, and then moving in the direction of good performance. They are simple to implement and work very well for policies with a small number of parameters, but they scale poorly with the number of parameters. Some DFO algorithms used for policy optimization include cross-entropy method [SL06], covariance matrix adaptation [WP09], and natural evolution strategies [Wie+08] (these three use Gaussian distributions); and HyperNEAT, which also evolves the network topology [Hau+12]. Second, there are pol- icy gradient methods [Wil92; Sut+99; JJS94; Kak02]. These algorithms can estimate the policy improvement direction by using various quantities that were measured by the agent; unlike DFO algorithms, they don’t need to perturb the parameters to measure the improvement direction. Policy gradient methods are a bit more complicated to imple- ment, and they have some difficulty optimizing behaviors that unfold over a very long timescale, but they are capable of optimizing much larger policies than DFO algorithms. The second approach for deriving RL algorithms is through approximate dynamic programming (ADP). These methods focus on learning value functions, which predict how much reward the agent is going to receive. The true value functions obey certain consistency equations, and ADP algorithms work by trying to satisfy these equations. There are two well-known algorithms for exactly solving RL problems that have a finite number of states and actions: policy iteration and value iteration. (Both of these algo- rithms are special cases of a general algorithm called modified policy iteration.) These algorithms can be combined with function approximation in a variety of different ways; currently, the leading descendents of value iteration work by approximating Q-functions (e.g., [Mni+15]).

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)