OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 069

4.8 frequently asked questions 61 approach could allow the value function and policy representations to share useful fea- tures of the input, resulting in even faster learning. In concurrent work, researchers have been developing policy gradient methods that involve differentiation with respect to the continuous-valued action [Lil+15; Hee+15]. While we found empirically that the one-step return (λ = 0) leads to excessive bias and poor performance, those papers show that such methods can work when tuned appro- priately. However, note that those papers consider control problems with substantially lower-dimensional state and action spaces than the ones considered here. A comparison between both classes of approach would be useful for future work. 4.8 frequently asked questions 4.8.1 What’s the Relationship with Compatible Features? Compatible features are often mentioned in relation to policy gradient algorithms that make use of a value function, and the idea was proposed in the paper On Actor-Critic Methods by Konda and Tsitsiklis [KT03]. These authors pointed out that due to the lim- ited representation power of the policy, the policy gradient only depends on a certain subspace of the space of advantage functions. This subspace is spanned by the com- patible features ∇θi log πθ(at | st), where i ∈ {1, 2, . . . , dim θ}. This theory of compatible features provides no guidance on how to exploit the temporal structure of the problem to obtain better estimates of the advantage function, making it mostly orthogonal to the ideas in this chapter. The idea of compatible features motivates an elegant method for computing the natu- ral policy gradient [Kak01a; PS08]. Given an empirical estimate of the advantage function Aˆ t at each timestep, we can project it onto the subspace of compatible features by solving the following least squares problem: minimize 􏰋∥r · ∇θ log πθ(at | st) − Aˆ t∥2. r If Aˆ is γ-just, the least squares solution is the natural policy gradient [Kak01a]. Note that any estimator of the advantage function can be substituted into this formula, including the ones we derive in this paper. For our experiments, we also compute natural policy gradient steps, but we use the more computationally efficient numerical procedure from [Sch+15c], as discussed in Section 4.6. t

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)