OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 061

4.5 value function estimation 53 Next, let us revisit the discount factor γ and the approximation we are making by using Aπ,γ rather than Aπ,1. The discounted policy gradient estimator from Equation (23) has a sum of terms of the form 􏰋∞ l=0 Using a discount γ < 1 corresponds to dropping the terms with l ≫ 1/(1 − γ). Thus, the error introduced by this approximation will be small if χ rapidly decays as l increases, i.e., if the effect of an action on rewards is “forgotten” after ≈ 1/(1 − γ) timesteps. If the reward function r ̃ were obtained using Φ = Vπ,γ, we would have E [r ̃t+l | st, at] = E [r ̃t+l | st] = 0 for l > 0, i.e., the response function would only be nonzero at l = 0. Therefore, this shaping transformation would turn temporally extended response into an immediate response. Given that Vπ,γ completely reduces the temporal spread of the response function, we can hope that a good approximation V ≈ Vπ,γ partially reduces it. This observation suggests an interpretation of Equation (26): reshape the rewards using V to shrink the temporal extent of the response function, and then introduce a “steeper” discount γλ to cut off the noise arising from long delays, i.e., ignore terms ∇θ log πθ(at | st)δVt+l where l ≫ 1/(1 − γλ). 4.5 value function estimation A variety of different methods can be used to estimate the value function (see, e.g., [Ber12]). When using a nonlinear function approximator to represent the value function, the simplest approach is to solve a nonlinear regression problem: 􏰋N ˆ ∇θ log πθ(at | st)Aπ,γ(st, at) = ∇θ log πθ(at | st) γlχ(l; s, a). minimize φ n=1 ∥Vφ(sn) − Vn∥2, (32) where Vˆt = 􏰊∞l=0 γlrt+l is the discounted sum of rewards, and n indexes over all timesteps in a batch of trajectories. This is sometimes called the Monte Carlo or TD(1) approach for estimating the value function [SB98].2 2 Another natural choice is to compute target values with an estimator based on the TD(λ) backup [Ber12; SB98], mirroring the expression we use for policy gradient estimation: Vˆλ = V (s ) + 􏰊∞ (γλ)lδ t φold n l=0 t+l . While we experimented with this choice, we did not notice a difference in performance from the λ = 1 estimator in Equation (32).

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--061

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP