logo

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 082

5.5 algorithms 74 line3 (see [GBB04] for a more thorough discussion of baselines and their variance reduc- tion properties). We can make a general statement for the case of stochastic computation graphs—that we can add a baseline to every stochastic node, which depends all of the nodes it doesn’t influence. Let NonInfluenced(v) := {w | v ⊀ w}. theorem 2. ∂􏳈􏰋􏳉 􏰋􏰹∂ ∂θE c = E  c∈C ∂θ log p(v | parentsv) 􏰺ˆ (Qv − b(NonInfluenced(v)) + 􏰋∂ ∂θc Proof: See Section 5.8. 5.5 algorithms  v∈S v≻θ c∈C≽θ As shown in Section 5.3, the gradient estimator can be obtained by differentiating a surrogate objective function L. Hence, this derivative can be computed by performing the backpropagation algorithm on L. That is likely to be the most practical and efficient method, and can be facilitated by automatic differentiation software. Algorithm 4 shows explicitly how to compute the gradient estimator in a backwards pass through the stochastic computation graph. The algorithm will recursively compute 􏰻􏰊 􏰼 gv := ∂ E c∈C c at every deterministic and input node v. ∂v v≺c 5.6 related work As discussed in Section 5.2, the score function and pathwise derivative estimators have been used in a variety of different fields, under different names. See [Fu06] for a review of gradient estimation, mostly from the simulation optimization literature. Glasserman’s textbook provides an extensive treatment of various gradient estimators and Monte Carlo estimators in general. Griewank and Walther’s textbook [GW08] is a comprehensive ref- erence on computation graphs and automatic differentiation (of deterministic programs.) The notation and nomenclature we use is inspired by Bayes nets and influence diagrams 3 The optimal baseline for scalar θ is in fact the weighted expectation Ex[f(x)s(x)2] where s(x) = Ex [s(x)2 ] ∂ logp(x; θ). ∂θ

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--082

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP