OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 048

3.12 efficiently solving the trust-region constrained optimization problem 40 The method we will describe involves two steps: (1) compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint; and (2) perform a line search in that direction, ensuring that we improve the nonlinear objective while satisfying the nonlinear constraint. The search direction is computed by approximately solving the equation Ax = g, where A is the Fisher information matrix, i.e., the quadratic approximation to the KL di- vergenceconstraint:D (θ ,θ)≈ 1(θ−θ )TA(θ−θ ),whereA = ∂ ∂ D (θ ,θ). KL old 2 old old ij ∂θi ∂θj KL old In large-scale problems, it is prohibitively costly (with respect to computation and mem- ory) to form the full matrix A (or A−1). However, the conjugate gradient algorithm allows us to approximately solve the equation Ax = b without forming this full matrix, when we merely have access to a function that computes matrix-vector products y → Ay. Section 3.12.1 describes the most efficient way to compute matrix-vector products with the Fisher information matrix. For additional exposition on the use of Hessian-vector products for optimizing neural network objectives, see [MS12] and [PB13]. Having computed the search direction s ≈ A−1g, we next need to compute the maxi- mal step length β such that θ + βs will satisfy the KL divergence constraint. To do this, let δ = DKL ≈ 21(βs)TA(βs) = 21β2sTAs. From this, we obtain β = 􏱂2δ/sTAs, where δ is the desired KL divergence. The term sT As can be computed through a single Hessian vector product, and it is also an intermediate result produced by the conjugate gradient algorithm. Last, we use a line search to ensure improvement of the surrogate objective and sat- isfaction of the KL divergence constraint, both of which are nonlinear in the parameter vector θ (and thus depart from the linear and quadratic approximations used to com- pute the step). We perform the line search on the objective Lθold (θ) − X[DKL(θold, θ) 􏳆 δ], where X[. . . ] equals zero when its argument is true and +∞ when it is false. Starting with the maximal value of the step length β computed in the previous paragraph, we shrink β exponentially until the objective improves. Without this line search, the algorithm oc- casionally computes large steps that cause a catastrophic degradation of performance. 3.12.1 Computing the Fisher-Vector Product Here we will describe how to compute the matrix-vector product between the averaged Fisher information matrix and arbitrary vectors; this calculation is also described in other references such as [PB13], but we include it here for self-containedness. This matrix- vector product enables us to perform the conjugate gradient algorithm. Suppose that the parameterized policy maps from the input x to “distribution parameter” vector μθ(x),

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)