PDF Publication Title:
Text from PDF Page: 034
3.5 sample-based estimation of the objective and constraint 26 trajectories sampling trajectories a2 ρ all state-action pairs used in objective ρ 0 rollout set 0 sa s1 nnn two rollouts using CRN Figure 1: Left: illustration of single path procedure. Here, we generate a set of trajectories via simulation of the policy and incorporate all state-action pairs (sn, an) into the objective. Right: illustration of vine procedure. We generate a set of “trunk” trajectories, and then generate “branch” rollouts from a subset of the reached states. For each of these states sn, we perform multiple actions (a1 and a2 here) and perform a rollout after each action, using common random numbers (CRN) to reduce the variance. denoted s1, s2, . . . , sN, which we call the “rollout set”. For each state sn in the rollout set, we sample K actions according to an,k ∼ q(· | sn). Any choice of q(· | sn) with a support that includes the support of πθi (· | sn) will produce a consistent estimator. In practice, we found that q(· | sn) = πθi (· | sn) works well on continuous problems, such as robotic lo- comotion, while the uniform distribution works well on discrete tasks, such as the Atari games, where it can sometimes achieve better exploration. ˆ For each action an,k sampled at each state sn, we estimate Qθi (sn, an,k) by performing a rollout (i.e., a short trajectory) starting with state sn and action an,k. We can greatly re- duce the variance of the Q-value differences between rollouts by using the same random number sequence for the noise in each of the K rollouts, i.e., common random numbers. See [Ber05] for additional discussion on Monte Carlo estimation of Q-values and [NJ00] for a discussion of common random numbers in reinforcement learning. In small, finite action spaces, we can generate a rollout for every possible action from a given state. The contribution to Lθold from a single state sn is as follows: K ˆ πθ(ak | sn)Q(sn, ak), k=1 where the action space is A = {a1, a2, . . . , aK}. In large or continuous state spaces, we can construct an estimator of the surrogate objective using importance sampling. The self-normalized estimator (Owen [Owe13], Chapter 8) of Lθold obtained at a single state Ln(θ) =PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP |