OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 027

3.2 preliminaries Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple (S,A,P,r,ρ0,γ), where S is a finite set of states, A is a finite set of actions, P : S×A×S → R is the transition probability distribution, r : S → R is the reward function, ρ0 : S → R is the distribution of the initial state s0, and γ ∈ (0,1) is the discount factor. Note that this setup differs from the Chapter 2 due to the discount, which is necessary for the theoretical analysis. Let π denote a stochastic policy π : S × A → [0, 1], and let η(π) denote its expected discounted reward: t=0 􏳈􏰋∞ 􏳉 γtr(st) , where η(π) = Es0,a0,... s0 ∼ ρ0(s0), at ∼ π(at | st), st+1 ∼ P(st+1 | st, at). We will use the following standard definitions of the state-action value function Qπ, the value function Vπ, and the advantage function Aπ: Qπ(st, at) = Est+1,at+1,... 􏳈􏰋∞ 􏳉 γlr(st+l) , γlr(st+l) , Aπ(s,a)= Qπ(s,a)−Vπ(s), where l=0 􏳈􏰋∞ 􏳉 Vπ(st) = Eat,st+1,... at ∼ π(at |st),st+1 ∼ P(st+1 |st,at) for t 􏳇 0. l=0 The following useful identity expresses the expected return of another policy π ̃ in terms of the advantage over π, accumulated over timesteps (see Kakade and Langford [KL02] or Appendix 3.10 for proof): t=0 be the (unnormalized) discounted visitation frequencies ρπ(s)=P(s0 = s)+γP(s1 = s)+γ2P(s2 = s)+..., η(π ̃) = η(π) + Es0,a0,···∼π ̃ where the notation Es0,a0,···∼π ̃ [. . . ] indicates that actions are sampled at ∼ π ̃(· | st). Let ρπ 􏳈􏰋∞ 􏳉 γtAπ(st, at) (3) 3.2 preliminaries 19

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)