OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 031

3.4 optimization of parameterized policies 23 Algorithm 3 Approximate policy iteration algorithm guaranteeing non-decreasing ex- pected return η Initialize π0. for i = 0,1,2,... until convergence do Compute all advantage values Aπi (s, a). Solve the constrained optimization problem end for π = arg max[L (π) − ( 2ε′γ )Dmax(π , π)] πi (1−γ)2 KL i sa􏰋􏰋 i+1 where ε′ = max max|Aπ(s, a)| π and Lπi(π)=η(πi)+ ρπi(s) π(a|s)Aπi(s,a) sa equality at πi. This algorithm is also reminiscent of proximal gradient methods and mirror descent. Trust region policy optimization, which we propose in the following section, is an ap- proximation to Algorithm 3, which uses a constraint on the KL divergence rather than a penalty to robustly allow large updates. 3.4 optimization of parameterized policies In the previous section, we considered the policy optimization problem independently of the parameterization of π and under the assumption that the policy can be evaluated at all states. We now describe how to derive a practical algorithm from these theoretical foundations, under finite sample counts and arbitrary parameterizations. Since we consider parameterized policies πθ(a | s) with parameter vector θ, we will overload our previous notation to use functions of θ rather than π, e.g. η(θ) := η(πθ), Lθ(θ ̃) := Lπθ(πθ ̃), and DKL(θ ∥ θ ̃) := DKL(πθ ∥ πθ ̃). We will use θold to denote the previous policy parameters that we want to improve upon. The preceding section showed that η(θ) 􏳇 L (θ) − CDmax(θ , θ), with equality θold KL old at θ = θold. Thus, by performing the following maximization, we are guaranteed to improve the true objective η: maximize[L (θ) − CDmax(θ , θ)]. θθold KLold

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)