OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 046

where 3.11 perturbation theory proof of policy improvement bound 38 􏲼􏰊 (π ̃(a|s)Qπ(s,a)−π(a|s)Qπ(s,a))􏲽 ε = min a 􏰊 (20) s a|π ̃(a|s)−π(a|s)| Note that the ε defined in Equation (20) is less than or equal to the ε defined in Proposition 1. So Proposition 1a is slightly stronger. Proof. Let G = (1+γPπ +(γPπ)2 +...) = (1−γPπ)−1, and similarly Let G ̃ = (1+γPπ ̃ + (γPπ ̃ )2 + . . . ) = (1 − γPπ ̃ )−1. We will use the convention that ρ (a density on state space) is a vector and r (a reward function on state space) is a dual vector (i.e., linear functional on vectors), thus rρ is a scalar meaning the expected reward under density ρ. Note that η(π) = rGρ0, and η(π ̃) = cG ̃ρ0. Let ∆ = Pπ ̃ −Pπ. We want to bound η(π ̃)−η(π) = r(G ̃ − G)ρ0. We start with some standard perturbation theory manipulations. G−1 −G ̃−1 = (1−γPπ)−(1−γPπ ̃) = γ∆. Left multiply by G and right multiply by G ̃ . G ̃ − G = γG∆G ̃ G ̃ = G + γG∆G ̃ Substituting the right-hand side into G ̃ gives So we have G ̃ =G+γG∆G+γ2G∆G∆G ̃ η(π ̃)−η(π)=r(G ̃−G)ρ=γrG∆Gρ0+γ2rG∆G∆G ̃ρ0 Let us first consider the leading term γrG∆Gρ0. Note that rG = v, i.e., the infinite- horizon state-value function. Also note that Gρ0 = ρπ. Thus we can write γcG∆Gρ0 = γv∆ρπ. We will show that this expression equals the expected advantage Lπ(π ̃) − Lπ(π). 􏰋􏰋 Lπ(π ̃) − Lπ(π) = ρπ(s) (π ̃(a | s) − π(a | s))Aπ(s, a) sa 􏰋􏰋􏰋′′ = ρπ(s) (πθ(a|s)−πθ ̃(a|s))[r(s)+ p(s |s,a)γv(s )−v(s)] sa s′ 􏰋􏰋􏰋′′ = ρπ(s) (π(a|s)−π ̃(a|s))p(s |s,a)γv(s) s s′a 􏰋􏰋′′′ = ρπ(s) (pπ(s |s)−pπ ̃(s |s))γv(s) s′ = γv∆ρπ s

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)