PDF Publication Title:
Text from PDF Page: 043
3.10 proof of policy improvement bound 35 Proof. First note that Aπ(s, a) = Es′∼P(s′ | s,a) [r(s) + γVπ(s′) − Vπ(s)]. Therefore, = − η ( π ) + η ( π ̃ ) Rearranging, the result follows. Eτ | π ̃ = Eτ | π ̃ ∞ γtAπ(st, at) t=0 ∞ γt(r(st) + γVπ(st+1) − Vπ(st)) ∞ γtr(st) ∞ γtr(st) t=0 −Vπ(s0) + = Eτ | π ̃ = −Es0 [Vπ(s0)] + Eτ | π ̃ Define A ̄ π,π ̃ (s) to be the expected advantage of π ̃ over π at state s: A ̄ π , π ̃ ( s ) = E a ∼ π ̃ ( · | s ) [ A π ( s , a ) ] . Now Lemma 1 can be written as follows: η(π ̃) = η(π)+Eτ∼π ̃ Note that Lπ can be written as ∞ ̄ γtAπ,π ̃(st) t=0 ∞ ̄ γtAπ,π ̃(st) t=0 Lπ(π ̃) = η(π)+Eτ∼π t=0 t=0 (19) The difference in these equations is whether the states are sampled using π or π ̃. To bound the difference between η(π ̃) and Lπ(π ̃), we will bound the difference arising from each timestep. To do this, we first need to introduce a measure of how much π and π ̃ agree. Specifically, we’ll couple the policies, so that they define a joint distribution over pairs of actions.PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)