PDF Publication Title:
Text from PDF Page: 028
3.2 preliminaries 20 where s0 ∼ ρ0 and the actions are chosen according to π. We can rewrite Equation (3) with a sum over states instead of timesteps: η(π ̃) = η(π) + = η(π)+ ∞ P(st = s | π ̃) π ̃(a | s)γtAπ(s, a) a t=0 s ∞ γtP(st = s|π ̃) π ̃(a|s)Aπ(s,a) s t=0 a ρπ ̃(s) π ̃(a|s)Aπ(s,a). (4) sa = η(π)+ This equation implies that any policy update π → π ̃ that has a nonnegative expected advantageateverystates,i.e., π ̃(a|s)Aπ(s,a)0,isguaranteedtoincreasethepol- a icy performance η, or leave it constant in the case that the expected advantage is zero everywhere. This implies the classic result that the update performed by exact policy iteration,whichusesthedeterministicpolicyπ ̃(s)=argmax Aπ(s,a),improvesthepol- a icy if there is at least one state-action pair with a positive advantage value and nonzero state visitation probability, otherwise the algorithm has converged to the optimal policy. However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation error, that there will be some states s for which the expected advan- tage is negative, that is, π ̃(a|s)Aπ(s,a) < 0. The complex dependency of ρπ ̃(s) on a π ̃ makes Equation (4) difficult to optimize directly. Instead, we introduce the following local approximation to η: Lπ(π ̃) = η(π) + ρπ(s) π ̃(a | s)Aπ(s, a). (5) sa Note that Lπ uses the visitation frequency ρπ rather than ρπ ̃, ignoring changes in state visitation density due to changes in the policy. However, if we have a parameterized policy πθ, where πθ(a | s) is a differentiable function of the parameter vector θ, then Lπ matches η to first order (see Kakade and Langford [KL02]). That is, for any parameter value θ0, Lπθ0 (πθ0 ) = η(πθ0 ), ∇θLπθ0(πθ)θ=θ0 =∇θη(πθ)θ=θ0. (6) Equation (6) implies that a sufficiently small step πθ0 → π ̃ that improves Lπθold will also improve η, but does not give us any guidance on how big of a step to take.PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP |