OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 060

4.4 interpretation as reward shaping 52 one obtains from the definitions of these quantities that Q ̃ π,γ(s, a) = Qπ,γ(s, a) − Φ(s) V ̃π,γ(s,a) = Vπ,γ(s,a)−Φ(s) A ̃ π,γ(s, a) = (Qπ,γ(s, a) − Φ(s)) − (Vπ(s) − Φ(s)) = Aπ,γ(s, a). Note that if Φ happens to be the state-value function Vπ,γ from the original MDP, then the transformed MDP has the interesting property that V ̃π,γ(s) is zero at every state. Note that [NHR99] showed that the reward shaping transformation leaves the pol- icy gradient and optimal policy unchanged when our objective is to maximize the discounted sum of rewards 􏰊∞t=0 γtr(st, at, st+1). In contrast, this chapter is concerned with maximizing the undiscounted sum of rewards, where the discount γ is used as a variance-reduction parameter. Having reviewed the idea of reward shaping, let us consider how we could use it to get a policy gradient estimate. The most natural approach is to construct policy gradient estimators that use discounted sums of shaped rewards r ̃. However, Equation (31) shows that we obtain the discounted sum of the original MDP’s rewards r minus a baseline term. Next, let’s consider using a “steeper” discount γλ, where 0 􏳆 λ 􏳆 1. It’s easy to see that the shaped reward r ̃ equals the Bellman residual term δV , introduced in Section 4.3, where we set Φ = V. Letting Φ = V, we see that 􏰋∞ (γλ)lr ̃(st+l, at, st+l+1) = l=0 Hence, by considering the γλ-discounted sum of shaped rewards, we exactly obtain the generalized advantage estimators from Section 4.3. As shown previously, λ = 1 gives an unbiased estimate of gγ, whereas λ < 0 gives a biased estimate. To further analyze the effect of this shaping transformation and parameters γ and λ, it will be useful to introduce the notion of a response function χ, which we define as follows: χ(l;st,at) = E[rt+l |st,at]−E[rt+l |st]. Note that Aπ,γ(s,a) = 􏰊∞l=0γlχ(l;s,a), hence the response function decomposes the advantage function across timesteps. The response function lets us quantify the tempo- ral credit assignment problem: long range dependencies between actions and rewards correspond to nonzero values of the response function for l ≫ 0. 􏰋∞ ˆ GAE(γ,λ) (γλ)lδVt+l = At . l=0

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--060

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP