OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 070

4.8.2 Why Don’t You Just Use a Q-Function? Previous actor critic methods, e.g. in [KT03], use a Q-function to obtain potentially low-variance policy gradient estimates. Recent papers, including [Hee+15; Lil+15], have shown that a neural network Q-function approximator can used effectively in a policy gradient method. However, there are several advantages to using a state-value function in the manner of this paper. First, the state-value function has a lower-dimensional input and is thus easier to learn than a state-action value function. Second, the method of this paper allows us to smoothly interpolate between the high-bias estimator (λ = 0) and the low-bias estimator (λ = 1). On the other hand, using a parameterized Q-function only allows us to use a high-bias estimator. We have found that the bias is prohibitively large when using a one-step estimate of the returns, i.e., the λ = 0 estimator, Aˆ t = δVt = rt + γV(st+1) − V(st). We expect that similar difficulty would be encountered when us- ing an advantage estimator involving a parameterized Q-function, Aˆ t = Q(s, a) − V (s). There is an interesting space of possible algorithms that would use a parameterized Q-function and attempt to reduce bias, however, an exploration of these possibilities is beyond the scope of this work. 4.9 proofs Proof of Proposition 1: First we can split the expectation into terms involving Q and b, Es0:∞,a0:∞ [∇θ log πθ(at | st)(Qt(s0:∞, a0:∞) + bt(s0:t, a0:t−1))] = Es0:∞,a0:∞ [∇θ log πθ(at | st)(Qt(s0:∞, a0:∞))] + Es0:∞,a0:∞ [∇θ log πθ(at | st)(bt(s0:t, a0:t−1))] We’ll consider the terms with Q and b in turn. Es0:∞,a0:∞ [∇θ log πθ(at | st)Qt(s0:∞, a0:∞)] = Es ,a 0:t 0:t = Es ,a 0:t 0:t 􏰄Es ,a [∇θ log πθ(at | st)Qt(s0:∞, a0:∞)]􏰅 t+1:∞ t+1:∞ 􏰄∇θ log πθ(at | st)Es ,a [Qt(s0:∞, a0:∞)]􏰅 t+1:∞ t+1:∞ = Es0:t,a0:t−1 [∇θ log πθ(at | st)Aπ(st, at)] 4.9 proofs 62

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)