PDF Publication Title:
Text from PDF Page: 035
sn is L (θ)= n K πθ(an,k | sn) Qˆ (sn, an,k) k=1πθold(an,k|sn) K πθ(an,k |sn) k=1 πθold(an,k |sn) , 3.6 practical algorithm 27 assuming that we performed K actions an,1, an,2, . . . , an,K from state sn. This self-normalized estimator removes the need to use a baseline for the Q-values (note that the gradient is unchanged by adding a constant to the Q-values). Averaging over sn ∼ ρ(π), we obtain an estimator for Lθold , as well as its gradient. The vine and single path methods are illustrated in Figure 1. We use the term vine, since the trajectories used for sampling can be likened to the stems of vines, which branch at various points (the rollout set) into several short offshoots (the rollout trajectories). The benefit of the vine method over the single path method that is our local estimate of the objective has much lower variance given the same number of Q-value samples in the surrogate objective. That is, the vine method gives much better estimates of the advantage values. The downside of the vine method is that we must perform far more calls to the simulator for each of these advantage estimates. Furthermore, the vine method requires us to generate multiple trajectories from each state in the rollout set, which limits this algorithm to settings where the system can be reset to an arbitrary state. In contrast, the single path algorithm requires no state resets and can be directly implemented on a physical system [PS08]. 3.6 practical algorithm Here we present two practical policy optimization algorithm based on the ideas above, which use either the single path or vine sampling scheme from the preceding section. The algorithms repeatedly perform the following steps: 1. Use the single path or vine procedures to collect a set of state-action pairs along with Monte Carlo estimates of their Q-values. 2. By averaging over samples, construct the estimated objective and constraint in Equation (16). 3. Approximately solve this constrained optimization problem to update the policy’s parameter vector θ. We use the conjugate gradient algorithm followed by a line search, which is altogether only slightly more expensive than computing the gradi- ent itself. See Section 3.12 for details. With regard to (3), we construct the Fisher information matrix (FIM) by analytically computing the Hessian of the KL divergence, rather than using the covariance matrixPDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)