OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 054

4.2 preliminaries 46 the formula has been proposed in prior work [KK98; Waw09], our analysis is novel and enables GAE to be applied with a more general set of algorithms, including the batch trust-region algorithm we use for our experiments. 2. We propose the use of a trust region optimization method for the value function, which we find is a robust and efficient way to train neural network value functions with thousands of parameters. 3. By combining (1) and (2) above, we obtain an algorithm that empirically is effective at learning neural network policies for challenging control tasks. The results extend the state of the art in using reinforcement learning for high-dimensional continuous control. Videos are available at https://sites.google.com/site/gaepapersupp. 4.2 preliminaries In this chapter, we consider an undiscounted formulation of the policy optimization problem. The initial state s0 is sampled from distribution ρ0. A trajectory (s0, a0, s1, a1, . . . ) is generated by sampling actions according to the policy at ∼ π(at | st) and sampling the states according to the dynamics st+1 ∼ P(st+1 | st, at), until a terminal (absorbing) state is reached. A reward rt = r(st, at, st+1) is received at each timestep. The goal is to maximize the expected total reward 􏰊∞t=0 rt, which is assumed to be finite for all policies. Note that we are not using a discount as part of the problem specification; it will appear below as an algorithm parameter that adjusts a bias-variance tradeoff. But the discounted problem (maximizing 􏰊∞t=0 γtrt) can be handled as an instance of the undiscounted problem in which we absorb the discount factor into the reward function, making it time-dependent. Policy gradient methods maximize the expected total reward by repeatedly estimating the gradient g := ∇θE [􏰊∞t=0 rt]. There are several different related expressions for the policy gradient, which have the form g = E where Ψt may be one of the following: Ψt∇θ logπθ(at |st) , (22) 1. 􏰊∞t=0 rt: total reward of the trajectory. 􏰊∞ previous formula. 4. Qπ(st, at): state-action value function. 5. Aπ(st,at): advantage function. 6. rt + Vπ(st+1) − Vπ(st): TD residual. 2. 􏰊t′=t rt′: reward following action at. ∞ 3. t′=t rt′ − b(st): baselined version of 􏳈􏰋∞ t=0 􏳉

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)