OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 033

3.5 sample-based estimation of the objective and constraint 25 We first replace 􏰊 ρθ (s)[. . .] in the objective by the expectation s old 1 Es∼ρ [. . . ]. 1−γ θold Next, we replace the advantage values Aθold by the Q-values Qθold in Equation (15), which only changes the objective by a constant. Last, we replace the sum over the actions by an importance sampling estimator. Using q to denote the sampling distribution, the contribution of a single sn to the loss function is 􏰋 􏰻πθ(a|sn) 􏰼 πθ(a | sn)Aθold (sn, a) = Ea∼q q(a | sn) Aθold (sn, a) . a Our optimization problem in Equation (15) is exactly equivalent to the following one, written in terms of expectations: maximize Es∼ρθ ,a∼q θ old q(a|s) 􏰻πθ(a|s) 􏰼 Qθold (s, a) (16) 􏰄DKL(πθ (· | s) ∥ πθ(· | s))􏰅 􏳆 δ. subject to Es∼ρ θold old All that remains is to replace the expectations by sample averages and replace the Q value by an empirical estimate. The following sections describe two different schemes for performing this estimation. The first sampling scheme, which we call single path, is the one that is typically used for policy gradient estimation [BB11], and is based on sampling individual trajectories. The second scheme, which we call vine, involves constructing a rollout set and then performing multiple actions from each state in the rollout set. This method has mostly been explored in the context of policy iteration methods [LP03; GGS13]. 3.5.1 Single Path In this estimation procedure, we collect a sequence of states by sampling s0 ∼ ρ0 and then simulating the policy πθold for some number of timesteps to generate a trajectory s0,a0,s1,a1,...,sT−1,aT−1,sT. Hence, q(a|s) = πθold(a|s). Qθold(s,a) is computed at each state-action pair (st, at) by taking the discounted sum of future rewards along the trajectory. 3.5.2 Vine In this estimation procedure, we first sample s0 ∼ ρ0 and simulate the policy πθi to gener- ate a number of trajectories. We then choose a subset of N states along these trajectories,

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)