OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 074

we can use the score function (SF) estimator [Fu06]: ∂􏰻∂􏰼 ∂θ Ex [f(x)] = Ex f(x) ∂θ log p(x; θ) . (36) This classic equation is derived as follows: ∂∂􏳀􏳀∂ ∂θEx [f(x)] = ∂θ dx p(x; θ)f(x) = dx ∂θp(x; θ)f(x) 􏳀∂􏰻∂􏰼 = dx p(x; θ)∂θ log p(x; θ)f(x) = Ex f(x)∂θ log p(x; θ) . (37) This equation is valid if and only if p(x; θ) is a continuous function of θ; however, it does not need to be a continuous function of x [Gla03]. • x may be a deterministic, differentiable function of θ and another random variable z, i.e., we can write x(z, θ). Then, we can use the pathwise derivative (PD) estimator, defined as follows. ∂􏰻∂􏰼 ∂θ Ez [f(x(z, θ))] = Ez ∂θ f(x(z, θ)) . This equation, which merely swaps the derivative and expectation, is valid if and only if f(x(z, θ)) is a continuous function of θ for all z [Gla03]. 1 That is not true if, for example, f is a step function. • Finally θ might appear both in the probability distribution and inside the expecta- tion, e.g., in ∂ Ez∼p(·; θ) [f(x(z, θ))]. Then the gradient estimator has two terms: ∂θ ∂􏰻∂􏰹∂􏰺􏰼 ∂θ Ez∼p(·; θ) [f(x(z, θ))] = Ez∼p(·; θ) ∂θ f(x(z, θ)) + ∂θ log p(z; θ) f(x(z, θ)) . This formula can be derived by writing the expectation as an integral and differen- tiating, as in Equation (37). In some cases, it is possible to reparameterize a probabilistic model—moving θ from the distribution to inside the expectation or vice versa. See [Fu06] for a general discussion, and see [KW13; RMW14] for a recent application of this idea to variational inference. The SF and PD estimators are applicable in different scenarios and have different properties. 1 Note that for the pathwise derivative estimator, f(x(z, θ)) merely needs to be a continuous function of θ—it is sufficient that this function is almost-everywhere differentiable. A similar statement can be made about p(x; θ) and the score function estimator. See Glasserman [Gla03] for a detailed discussion of the technical requirements for these gradient estimators to be valid. 5.2 preliminaries 66

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)