OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 062

4.6 experiments 54 For the experiments in this work, we used a trust region method to optimize the value function in each iteration of a batch optimization procedure. The trust region helps us to avoid overfitting to the most recent batch of data. To formulate the trust region problem, we first compute σ2 = N1 􏰊Nn=1∥Vφold (sn) − Vˆn∥2, where φold is the parameter vector before optimization. Then we solve the following constrained optimization problem: 􏰋N ˆ This constraint is equivalent to constraining the average KL divergence between the previous value function and the new value function to be smaller than ε, where the value function is taken to parameterize a conditional Gaussian distribution with mean Vφ(s) and variance σ2. We compute an approximate solution to the trust region problem using the conjugate gradient algorithm [WN99]. Specifically, we are solving the quadratic program minimize φ subject to ∥Vφ(sn) − Vn∥2 1 􏰋N ∥Vφ(sn) − Vφold (sn)∥2 minimize n=1 N n=1 2σ2 􏳆 ε. (33) gT (φ − φold) 1 􏰋N to a σ2 factor) the Fisher information matrix when interpreting the value function as a conditional probability distribution. Using matrix-vector products v → Hv to implement the conjugate gradient algorithm, we compute a step direction s ≈ −H−1g. Then we rescale s → αs such that 12 (αs)T H(αs) = ε and take φ = φold + αs. This procedure is analogous to the procedure we use for updating the policy, which is described further in Section 4.6 and based on [Sch+15c]. 4.6 experiments We designed a set of experiments to investigate the following questions: 1. What is the empirical effect of varying λ ∈ [0, 1] and γ ∈ [0, 1] when optimizing episodic total reward using generalized advantage estimation? φ subject to N (φ − φold)T H(φ − φold) 􏳆 ε. (34) where g is the gradient of the objective, and H = 1 􏰊 jnjTn, where jn = ∇φVφ(sn). Note n=1 Nn that H is the “Gauss-Newton” approximation of the Hessian of the objective, and it is (up

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)