OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 028

3.2 preliminaries 20 where s0 ∼ ρ0 and the actions are chosen according to π. We can rewrite Equation (3) with a sum over states instead of timesteps: η(π ̃) = η(π) + = η(π)+ 􏰋∞􏰋 􏰋 P(st = s | π ̃) π ̃(a | s)γtAπ(s, a) a t=0 s 􏰋􏰋∞ 􏰋 γtP(st = s|π ̃) π ̃(a|s)Aπ(s,a) s t=0 a 􏰋􏰋 ρπ ̃(s) π ̃(a|s)Aπ(s,a). (4) sa = η(π)+ This equation implies that any policy update π → π ̃ that has a nonnegative expected advantageateverystates,i.e.,􏰊 π ̃(a|s)Aπ(s,a)􏳇0,isguaranteedtoincreasethepol- a icy performance η, or leave it constant in the case that the expected advantage is zero everywhere. This implies the classic result that the update performed by exact policy iteration,whichusesthedeterministicpolicyπ ̃(s)=argmax Aπ(s,a),improvesthepol- a icy if there is at least one state-action pair with a positive advantage value and nonzero state visitation probability, otherwise the algorithm has converged to the optimal policy. However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation error, that there will be some states s for which the expected advan- tage is negative, that is, 􏰊 π ̃(a|s)Aπ(s,a) < 0. The complex dependency of ρπ ̃(s) on a π ̃ makes Equation (4) difficult to optimize directly. Instead, we introduce the following local approximation to η: 􏰋􏰋 Lπ(π ̃) = η(π) + ρπ(s) π ̃(a | s)Aπ(s, a). (5) sa Note that Lπ uses the visitation frequency ρπ rather than ρπ ̃, ignoring changes in state visitation density due to changes in the policy. However, if we have a parameterized policy πθ, where πθ(a | s) is a differentiable function of the parameter vector θ, then Lπ matches η to first order (see Kakade and Langford [KL02]). That is, for any parameter value θ0, Lπθ0 (πθ0 ) = η(πθ0 ), ∇θLπθ0(πθ)􏰆􏰆θ=θ0 =∇θη(πθ)􏰆􏰆θ=θ0. (6) Equation (6) implies that a sufficiently small step πθ0 → π ̃ that improves Lπθold will also improve η, but does not give us any guidance on how big of a step to take.

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)