OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 034

3.5 sample-based estimation of the objective and constraint 26 trajectories sampling trajectories a2 ρ all state-action pairs used in objective ρ 0 rollout set 0 sa s1 nnn two rollouts using CRN Figure 1: Left: illustration of single path procedure. Here, we generate a set of trajectories via simulation of the policy and incorporate all state-action pairs (sn, an) into the objective. Right: illustration of vine procedure. We generate a set of “trunk” trajectories, and then generate “branch” rollouts from a subset of the reached states. For each of these states sn, we perform multiple actions (a1 and a2 here) and perform a rollout after each action, using common random numbers (CRN) to reduce the variance. denoted s1, s2, . . . , sN, which we call the “rollout set”. For each state sn in the rollout set, we sample K actions according to an,k ∼ q(· | sn). Any choice of q(· | sn) with a support that includes the support of πθi (· | sn) will produce a consistent estimator. In practice, we found that q(· | sn) = πθi (· | sn) works well on continuous problems, such as robotic lo- comotion, while the uniform distribution works well on discrete tasks, such as the Atari games, where it can sometimes achieve better exploration. ˆ For each action an,k sampled at each state sn, we estimate Qθi (sn, an,k) by performing a rollout (i.e., a short trajectory) starting with state sn and action an,k. We can greatly re- duce the variance of the Q-value differences between rollouts by using the same random number sequence for the noise in each of the K rollouts, i.e., common random numbers. See [Ber05] for additional discussion on Monte Carlo estimation of Q-values and [NJ00] for a discussion of common random numbers in reinforcement learning. In small, finite action spaces, we can generate a rollout for every possible action from a given state. The contribution to Lθold from a single state sn is as follows: 􏰋K ˆ πθ(ak | sn)Q(sn, ak), k=1 where the action space is A = {a1, a2, . . . , aK}. In large or continuous state spaces, we can construct an estimator of the surrogate objective using importance sampling. The self-normalized estimator (Owen [Owe13], Chapter 8) of Lθold obtained at a single state Ln(θ) =

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)