OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 030

3.3 monotonic improvement guarantee for general stochastic policies 22 Dmax(π,π ̃) as Dmax(π, π ̃) = max D (π(· | s) ∥ π ̃(· | s)). (10) TV sTV TV Proposition 1. Let α = Dmax(π , π ). Then Equation (9) holds. TV old new We provide two proofs in the appendix. The first proof extends Kakade and Langford’s result using the fact that the random variables from two distributions with total variation divergence less than α can be coupled, so that they are equal with probability 1 − α. The second proof uses perturbation theory to prove a slightly stronger version of Equation (9), with a more favorable definition of ε that depends on π ̃. Next, we note the following relationship between the total variation divergence and the KL divergence (Pollard [Pol00], Ch. 3): D (p ∥ q)2 􏳆 D (p ∥ q). Let Dmax(π,π ̃) = TV KL KL maxs DKL(π(· | s) ∥ π ̃(· | s)). The following bound then follows directly from Equation (9): η(π ̃)􏳇L (π ̃)−CDmax(π,π ̃), π KL whereC= 2εγ . (11) (1−γ)2 Algorithm 3 describes an approximate policy iteration scheme based on the policy im- provement bound in Equation (11). Note that for now, we assume exact evaluation of the advantage values Aπ. Algorithm 3 uses a constant ε′ 􏳆 ε that is simpler to describe in terms of measurable quantities. It follows from Equation (11) that Algorithm 3 is guaranteed to generate a mono- tonically improving sequence of policies η(π0) 􏳆 η(π1) 􏳆 η(π2) 􏳆 .... To see this, let M (π) = L (π)−CDmax(π ,π). Then iπi KLi η(πi+1) 􏳇 Mi(πi+1) by Equation (11) η(πi) = Mi(πi), therefore, η(πi+1) − η(πi) 􏳇 Mi(πi+1) − M(πi). (12) Thus, by maximizing Mi at each iteration, we guarantee that the true objective η is non-decreasing. This algorithm is a type of minorization-maximization (MM) algorithm [HL04], which is a class of methods that also includes expectation maximization. In the terminology of MM algorithms, Mi is the surrogate function that minorizes η with

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)