OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 020

2.6 policy gradients Policy gradient methods are a class of reinforcement learning algorithms that work by repeatedly estimating the gradient of the policy’s performance with respect to its pa- rameters. The simplest way to derive them is to use the score function gradient estimator, a general method for estimating gradients of expectations. Suppose that x is a random variable with probability density p(x | θ), f is a scalar-valued function (say, the reward), and we are interested in computing ∇θEx[f(x)]. Then we have the following equality: ∇θEx[f(x)] = Ex[∇θ log p(x | θ)f(x)]. This equation can be derived by writing the expectation as an integral: 􏳀􏳀 ∇θEx[f(x)] = ∇θ dx p(x | θ)f(x) = dx ∇θp(x | θ)f(x) 􏳀 = dx p(x | θ)∇θ log p(x | θ)f(x) = Ex[f(x)∇θ log p(x | θ)]. To use this estimator, we can sample values x ∼ p(x | θ), and compute the LHS of the equation above (averaged over N samples) to get an estimate of the gradient (which becomes increasingly accurate as N → ∞. That is, we take x1, x2, . . . , xN ∼ p(x | θ), and then take our gradient estimate gˆ to be 1 􏰋N gˆ = N n=1 ∇θ log p(xi | θ)f(xi) To use this idea in reinforcement learning, we will need to use a stochastic policy. That means that at each state s, our policy gives us a probability distribution over actions, which will be denoted π(a | s). Since the policy also has a parameter vector θ, we’ll write πθ(a|s) or π(a|s,θ). In the following discussion, a trajectory τ will refer to a sequence of states and actions τ ≡ (s0,a0,s1,a1,...,sT). Let p(τ|θ) denote the probability of the entire trajectory τ under policy parameters θ, and let R(τ) denote the total reward of the trajectory. The derivation of the score function gradient estimato tells us that ∇θEτ[R(τ)] = Eτ[∇θ log p(τ | θ)R(τ)] Next, we need to expand the quantity log p(τ | θ) to derive a practical formula. Using the chain rule of probabilities, we obtain p(τ | θ) =μ(s0)π(a0 | s0, θ)P(s1, r0 | s0, a0)π(a1 | s1, θ) P(s2,r1 |s1,a1)...π(aT−1 |sT−1,θ)P(sT,rT−1 |sT−1,aT−1), 2.6 policy gradients 12

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)