PDF Publication Title:
Text from PDF Page: 069
4.8 frequently asked questions 61 approach could allow the value function and policy representations to share useful fea- tures of the input, resulting in even faster learning. In concurrent work, researchers have been developing policy gradient methods that involve differentiation with respect to the continuous-valued action [Lil+15; Hee+15]. While we found empirically that the one-step return (λ = 0) leads to excessive bias and poor performance, those papers show that such methods can work when tuned appro- priately. However, note that those papers consider control problems with substantially lower-dimensional state and action spaces than the ones considered here. A comparison between both classes of approach would be useful for future work. 4.8 frequently asked questions 4.8.1 What’s the Relationship with Compatible Features? Compatible features are often mentioned in relation to policy gradient algorithms that make use of a value function, and the idea was proposed in the paper On Actor-Critic Methods by Konda and Tsitsiklis [KT03]. These authors pointed out that due to the lim- ited representation power of the policy, the policy gradient only depends on a certain subspace of the space of advantage functions. This subspace is spanned by the com- patible features ∇θi log πθ(at | st), where i ∈ {1, 2, . . . , dim θ}. This theory of compatible features provides no guidance on how to exploit the temporal structure of the problem to obtain better estimates of the advantage function, making it mostly orthogonal to the ideas in this chapter. The idea of compatible features motivates an elegant method for computing the natu- ral policy gradient [Kak01a; PS08]. Given an empirical estimate of the advantage function Aˆ t at each timestep, we can project it onto the subspace of compatible features by solving the following least squares problem: minimize ∥r · ∇θ log πθ(at | st) − Aˆ t∥2. r If Aˆ is γ-just, the least squares solution is the natural policy gradient [Kak01a]. Note that any estimator of the advantage function can be substituted into this formula, including the ones we derive in this paper. For our experiments, we also compute natural policy gradient steps, but we use the more computationally efficient numerical procedure from [Sch+15c], as discussed in Section 4.6. tPDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)