PDF Publication Title:
Text from PDF Page: 048
3.12 efficiently solving the trust-region constrained optimization problem 40 The method we will describe involves two steps: (1) compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint; and (2) perform a line search in that direction, ensuring that we improve the nonlinear objective while satisfying the nonlinear constraint. The search direction is computed by approximately solving the equation Ax = g, where A is the Fisher information matrix, i.e., the quadratic approximation to the KL di- vergenceconstraint:D (θ ,θ)≈ 1(θ−θ )TA(θ−θ ),whereA = ∂ ∂ D (θ ,θ). KL old 2 old old ij ∂θi ∂θj KL old In large-scale problems, it is prohibitively costly (with respect to computation and mem- ory) to form the full matrix A (or A−1). However, the conjugate gradient algorithm allows us to approximately solve the equation Ax = b without forming this full matrix, when we merely have access to a function that computes matrix-vector products y → Ay. Section 3.12.1 describes the most efficient way to compute matrix-vector products with the Fisher information matrix. For additional exposition on the use of Hessian-vector products for optimizing neural network objectives, see [MS12] and [PB13]. Having computed the search direction s ≈ A−1g, we next need to compute the maxi- mal step length β such that θ + βs will satisfy the KL divergence constraint. To do this, let δ = DKL ≈ 21(βs)TA(βs) = 21β2sTAs. From this, we obtain β = 2δ/sTAs, where δ is the desired KL divergence. The term sT As can be computed through a single Hessian vector product, and it is also an intermediate result produced by the conjugate gradient algorithm. Last, we use a line search to ensure improvement of the surrogate objective and sat- isfaction of the KL divergence constraint, both of which are nonlinear in the parameter vector θ (and thus depart from the linear and quadratic approximations used to com- pute the step). We perform the line search on the objective Lθold (θ) − X[DKL(θold, θ) δ], where X[. . . ] equals zero when its argument is true and +∞ when it is false. Starting with the maximal value of the step length β computed in the previous paragraph, we shrink β exponentially until the objective improves. Without this line search, the algorithm oc- casionally computes large steps that cause a catastrophic degradation of performance. 3.12.1 Computing the Fisher-Vector Product Here we will describe how to compute the matrix-vector product between the averaged Fisher information matrix and arbitrary vectors; this calculation is also described in other references such as [PB13], but we include it here for self-containedness. This matrix- vector product enables us to perform the conjugate gradient algorithm. Suppose that the parameterized policy maps from the input x to “distribution parameter” vector μθ(x),PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)