OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 075

5.2 preliminaries 67 1. SF is valid under more permissive mathematical conditions than PD. SF can be used if f is discontinuous, or if x is a discrete random variable. 2. SF only requires sample values f(x), whereas PD requires the derivatives f′(x). In the context of control (reinforcement learning), SF can be used to obtain unbiased policy gradient estimators in the “model-free” setting where we have no model of the dynamics, we only have access to sample trajectories. 3. SF tends to have higher variance than PD, when both estimators are applicable (see for instance [Fu06; RMW14]). The variance of SF increases (often linearly) with the dimensionality of the sampled variables. Hence, PD is usually preferable when x is high-dimensional. On the other hand, PD has high variance if the function f is rough, which occurs in many time-series problems due to an “exploding gradient problem” / “butterfly effect”. 4. PD allows for a deterministic limit, SF does not. This idea is exploited by the deter- ministic policy gradient algorithm [Sil+14]. nomenclature. The methods of estimating gradients of expectations have been in- dependently proposed in several different fields, which use differing terminology. What we call the score function estimator (via [Fu06]) is alternatively called the likelihood ratio es- timator [Gly90] and REINFORCE [Wil92]. We chose this term because the score function is a well-known object in statistics. What we call the pathwise derivative estimator (from the mathematical finance literature [Gla03] and reinforcement learning [Mun06]) is alter- natively called infinitesimal perturbation analysis and stochastic backpropagation [RMW14]. We chose this term because pathwise derivative is evocative of propagating a derivative through a sample path. 5.2.2 Stochastic Computation Graphs The results of this chapter will apply to stochastic computation graphs, which are defined as follows: Definition 3 (Stochastic Computation Graph). A directed, acyclic graph, with three types of nodes: 1. Input nodes, which are set externally, including the parameters we differentiate with respect to. 2. Deterministic nodes, which are functions of their parents.

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)