OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 029

3.3 monotonic improvement guarantee for general stochastic policies 21 To address this issue, Kakade and Langford [KL02] proposed a policy updating scheme called conservative policy iteration, for which they could provide explicit lower bounds on the improvement of η. To define the conservative policy iteration update, let πold denote the current policy, and let π′ = argminπ′ Lπold(π′). The new policy πnew was defined to be the following mixture: πnew(a|s) = (1−α)πold(a|s)+απ′(a|s). (7) Kakade and Langford proved the following result for this update: η(πnew)􏳇Lπold(πnew)− 2εγ α2, (1−γ(1−α))(1−γ) where ε = max|Ea∼π′(a | s) [Aπ(s, a)]| (8) s Since α, γ ∈ [0, 1], Equation (8) implies the following simpler bound, which we refer to in the next section: η(πnew) 􏳇 Lπ (πnew) − 2εγ α2. (9) old (1 − γ)2 The simpler bound is only slightly weaker when α ≪ 1, which is typically the case in the conservative policy iteration method of Kakade and Langford [KL02]. Note, however, that so far this bound only applies to mixture policies generated by Equation (7). This policy class is unwieldy and restrictive in practice, and it is desirable for a practical policy update scheme to be applicable to all general stochastic policy classes. 3.3 monotonic improvement guarantee for general stochastic poli- cies Equation (8), which applies to conservative policy iteration, implies that a policy update that improves the right-hand side is guaranteed to improve the true performance η. Our principal theoretical result is that the policy improvement bound in Equation (8) can be extended to general stochastic policies, rather than just mixture polices, by replacing α with a distance measure between π and π ̃. Since mixture policies are rarely used in practice, this result is crucial for extending the improvement guarantee to practical problems. The particular distance measure we use is the total variation divergence, which is defined by D (p ∥ q) = 1 􏰊 |p − q | for discrete probability distributions p, q.1 Define TV 2iii 1 Our result is straightforward to extend to continuous states and actions by replacing the sums with inte- grals.

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)