PDF Publication Title:
Text from PDF Page: 070
4.8.2 Why Don’t You Just Use a Q-Function? Previous actor critic methods, e.g. in [KT03], use a Q-function to obtain potentially low-variance policy gradient estimates. Recent papers, including [Hee+15; Lil+15], have shown that a neural network Q-function approximator can used effectively in a policy gradient method. However, there are several advantages to using a state-value function in the manner of this paper. First, the state-value function has a lower-dimensional input and is thus easier to learn than a state-action value function. Second, the method of this paper allows us to smoothly interpolate between the high-bias estimator (λ = 0) and the low-bias estimator (λ = 1). On the other hand, using a parameterized Q-function only allows us to use a high-bias estimator. We have found that the bias is prohibitively large when using a one-step estimate of the returns, i.e., the λ = 0 estimator, Aˆ t = δVt = rt + γV(st+1) − V(st). We expect that similar difficulty would be encountered when us- ing an advantage estimator involving a parameterized Q-function, Aˆ t = Q(s, a) − V (s). There is an interesting space of possible algorithms that would use a parameterized Q-function and attempt to reduce bias, however, an exploration of these possibilities is beyond the scope of this work. 4.9 proofs Proof of Proposition 1: First we can split the expectation into terms involving Q and b, Es0:∞,a0:∞ [∇θ log πθ(at | st)(Qt(s0:∞, a0:∞) + bt(s0:t, a0:t−1))] = Es0:∞,a0:∞ [∇θ log πθ(at | st)(Qt(s0:∞, a0:∞))] + Es0:∞,a0:∞ [∇θ log πθ(at | st)(bt(s0:t, a0:t−1))] We’ll consider the terms with Q and b in turn. Es0:∞,a0:∞ [∇θ log πθ(at | st)Qt(s0:∞, a0:∞)] = Es ,a 0:t 0:t = Es ,a 0:t 0:t Es ,a [∇θ log πθ(at | st)Qt(s0:∞, a0:∞)] t+1:∞ t+1:∞ ∇θ log πθ(at | st)Es ,a [Qt(s0:∞, a0:∞)] t+1:∞ t+1:∞ = Es0:t,a0:t−1 [∇θ log πθ(at | st)Aπ(st, at)] 4.9 proofs 62PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)