PDF Publication Title:
Text from PDF Page: 049
3.12 efficiently solving the trust-region constrained optimization problem 41 which parameterizes the distribution π(u | x). (For example, for a Gaussian distribution, μ could be the mean and standard deviation concatenated; for a categorical distribution, it could be the vector of probabilities or log-probabilities.) Now the KL divergence for a given input x can be written as follows: DKL(πθold (· | x) ∥ πθ(· | x)) = kl(μθ(x), μold(x)) where kl is the KL divergence between the distributions corresponding to the two mean parameter vectors. Let us assume we can compute kl analytically in terms of its argu- ments. Differentiating kl twice with respect to θ, we obtain ∂μa(x) kl′′ (μ (x),μ (x))∂μb(x) + ∂2μa(x) kl′ (μ (x),μ (x)) (21) ∂θi ab θ old ∂θj ∂θi∂θj a θ old JT MJ =0 at μθ=μold where the primes (′) indicate differentiation with respect to the first argument, and there is an implied summation over indices a,b. The second term vanishes because the KL divergence is minimized at μθ = μold, and the derivative is zero at a minimum. Let J := ∂μa(x) (the Jacobian), then the Fisher information matrix can be written in matrix form as ∂θi JTMJ, where M = kl′′ (μθ(x),μold) is the Fisher information matrix of the distribution ab in terms of the mean parameter μ (as opposed to the parameter θ). M has a simple form for most parameterized distributions of interest. The Fisher-vector product can now be written as a function y → JT MJy. Multiplica- tion by JT and J can be performed by automatic differentiation software such as Theano [Ber+10], and the matrix M (the Fisher matrix with respect to μ) can be computed ana- lytically for the distribution of interest. Note that multiplication by JT is the well-known backpropagation operation, whereas multiplication by J is tangent-propagation [Gri+89] or the R-Op (in Theano). There is a simpler but (slightly) less efficient way to calculate the Fisher-vector prod- ucts using only reverse mode automatic differentiation. This technique is described in [WN99], chapter 8. Let f(θ) = kl(μθ(x),μold), then we want to compute the Hessian- vector product Hy, where y is a vector, and H is the Hessian of f(θ). We can first form the expression for the gradient-vector product ∇θf(θ) · p, then we differentiate this ex- pression to get the Hessian-vector product. This method is slightly less efficient than the one above as it does not exploit the fact that the second derivatives of μ(x) (i.e., the sec- ond term in Equation (21)) can be ignored, but may be substantially easier to implement. We have described a procedure for computing the Fisher-vector product y → Ay, where the Fisher information matrix is averaged over a set of inputs to the function μ.PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS
PDF Search Title:
OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHSOriginal File Name Searched:
thesis-optimizing-deep-learning.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)