PDF Publication Title:
Text from PDF Page: 045
Proof. 3.11 perturbation theory proof of policy improvement bound 37 ∞ ̄ ∞ ̄ η(π ̃)−Lπ(π ̃) = Eτ∼π ̃ γtAπ,π ̃(st) −Eτ∼π γtAπ,π ̃(st) t=0 t=0 ∞ ̄ ̄ = γt(Est∼π ̃ Aπ,π ̃(st)−Est∼π Aπ,π ̃(st)) t=0 ∞ t ̄π,π ̃ |η(π ̃) − Lπ(π ̃)| γ Est∼π ̃ A t=0 ∞ ̄π,π ̃ (st) − Est∼π A (st) γt ·2ε·(1−(1−αt)) t=0 = 2εγα (1−γ)(1−γ(1−α)) Last, we need to use the correspondence between total variation divergence and cou- pled random variables: Suppose pX and pY are distributions with DT V (pX ∥ pY ) = α. Then there exists a joint distribution (X,Y) whose marginals are pX,pY, for which X = Y with probability 1 − α. See [LPW09], Proposition 4.7. It follows that if we have two policies π and π ̃ such that maxs DT V (π(· | s) ∥ π ̃ (· | s))α, then we can define an α-coupled policy pair (π,π ̃) with appropriate marginals. Proposi- tion 1 follows. 3.11 perturbation theory proof of policy improvement bound We also provide a different proof of Proposition 1 using perturbation theory. This method makes it possible to provide slightly stronger bounds. Proposition 1a. Let α denote the maximum total variation divergence between stochastic policies π and π ̃, as defined in Equation (10), and let L be defined as in Equation (5). Then 2γε (1−γ)2 η ( π ̃ ) L ( π ̃ ) − α 2PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP |