OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 059

4.4 interpretation as reward shaping 51 We’ve described an advantage estimator with two separate parameters γ and λ, both of which contribute to the bias-variance tradeoff when using an approximate value function. However, they serve different purposes and work best with different ranges of values. γ most importantly determines the scale of the value function Vπ,γ, which does not depend on λ. Taking γ < 1 introduces bias into the policy gradient estimate, regardless of the value function’s accuracy. On the other hand, λ < 1 introduces bias only when the value function is inaccurate. Empirically, we find that the best value of λ is much lower than the best value of γ, likely because λ introduces far less bias than γ for a reasonably accurate value function. Using the generalized advantage estimator, we can construct a biased estimator of gγ, the discounted policy gradient from Equation (23): 􏳈􏰋∞ 􏳉􏳈􏰋∞􏰋∞􏳉 ∇θ log πθ(at | st)Aˆ GAE(γ,λ) = E t ∇θ log πθ(at | st) (γλ)lδVt+l , (29) gγ ≈ E where equality holds when λ = 1. l=0 t=0 t=0 4.4 interpretation as reward shaping In this section, we discuss how one can interpret λ as an extra discount factor applied after performing a reward shaping transformation on the MDP. We also introduce the notion of a response function to help understand the bias introduced by γ and λ. Reward shaping [NHR99] refers to the following transformation of the reward function of an MDP: let Φ : S → R be an arbitrary scalar-valued function on state space, and define the transformed reward function r ̃ by r ̃(s, a, s′) = r(s, a, s′) + γΦ(s′) − Φ(s), (30) which in turn defines a transformed MDP. This transformation leaves the discounted advantage function Aπ,γ unchanged for any policy π. To see this, consider the discounted sum of rewards of a trajectory starting with state st: 􏰋∞ 􏰋∞ γlr ̃(st+l, at, st+l+1) = Letting Q ̃ π,γ, V ̃ π,γ, A ̃ π,γ be the value and advantage functions of the transformed MDP, l=0 l=0 γlr(st+l, at+l, st+l+1) − Φ(st). (31)

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)