PDF Publication Title:
Text from PDF Page: 029
3.3 monotonic improvement guarantee for general stochastic policies 21 To address this issue, Kakade and Langford [KL02] proposed a policy updating scheme called conservative policy iteration, for which they could provide explicit lower bounds on the improvement of η. To define the conservative policy iteration update, let πold denote the current policy, and let π′ = argminπ′ Lπold(π′). The new policy πnew was defined to be the following mixture: πnew(a|s) = (1−α)πold(a|s)+απ′(a|s). (7) Kakade and Langford proved the following result for this update: η(πnew)Lπold(πnew)− 2εγ α2, (1−γ(1−α))(1−γ) where ε = max|Ea∼π′(a | s) [Aπ(s, a)]| (8) s Since α, γ ∈ [0, 1], Equation (8) implies the following simpler bound, which we refer to in the next section: η(πnew) Lπ (πnew) − 2εγ α2. (9) old (1 − γ)2 The simpler bound is only slightly weaker when α ≪ 1, which is typically the case in the conservative policy iteration method of Kakade and Langford [KL02]. Note, however, that so far this bound only applies to mixture policies generated by Equation (7). This policy class is unwieldy and restrictive in practice, and it is desirable for a practical policy update scheme to be applicable to all general stochastic policy classes. 3.3 monotonic improvement guarantee for general stochastic poli- cies Equation (8), which applies to conservative policy iteration, implies that a policy update that improves the right-hand side is guaranteed to improve the true performance η. Our principal theoretical result is that the policy improvement bound in Equation (8) can be extended to general stochastic policies, rather than just mixture polices, by replacing α with a distance measure between π and π ̃. Since mixture policies are rarely used in practice, this result is crucial for extending the improvement guarantee to practical problems. The particular distance measure we use is the total variation divergence, which is defined by D (p ∥ q) = 1 |p − q | for discrete probability distributions p, q.1 Define TV 2iii 1 Our result is straightforward to extend to continuous states and actions by replacing the sums with inte- grals.PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)