OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 089

5.10 examples 81 Variational Autoencoder, Deep Latent Gaussian Models and Reparameterization. Here we’ll note out that in some cases, the stochastic computation graph can be trans- formed to give the same probability distribution for the observed variables, but one obtains a different gradient estimator. Kingma and Welling [KW13] and Rezende et al. [RMW14] consider a model that is similar to the one proposed by Mnih et al. [MG14] but with continuous latent variables, and they re-parameterize their inference network to enable the use of the PD estimator. The original objective, the variational lower bound, is Lorig(θ, φ) = Eh∼qφ 􏰻 pθ(x|h)pθ(h)􏰼 log qφ(h|x) . The second term, the entropy of qφ, can be computed analytically for the parametric forms of q considered in the paper (Gaussians). For qφ being conditionally Gaussian, i.e. qφ(h|x) = N(h|μφ(x), σφ(x)) re-parameterizing leads to h = hφ(ε; x) = μφ(x) + εσφ(x), giving Lre(θ, φ) = Eε∼ρ 􏰄log pθ(x|hφ(ε, x)) + log pθ(hφ(ε, x))􏰅 + H[qφ(·|x)]. The stochastic computation graph before and after reparameterization is shown in Fig- ure 13. Given ε ∼ ρ an estimate of the gradient is obtained as ∂Lre ≈ ∂ 􏰄logpθ(x|hφ(ε,x))+logpθ(hφ(ε,x))􏰅, ∂θ ∂θ ∂Lre 􏰻∂ ∂ 􏰼∂h ∂ ∂φ ≈ ∂h log pθ(x|hφ(ε, x)) + ∂h log pθ(hφ(ε, x)) ∂φ + ∂φH[qφ(·|x)]. 5.10.2 Policy Gradients in Reinforcement Learning. In reinforcement learning, an agent interacts with an environment according to its policy π, and the goal is to maximize the expected sum of rewards, called the return. Policy gradient methods seek to directly estimate the gradient of expected return with respect to the policy parameters [Wil92; BB01; Sut+99]. In reinforcement learning, we typically assume that the environment dynamics are not available analytically and can only be sampled. Below we distinguish two important cases: the Markov decision process (MDP) and the partially observable Markov decision process (POMDP).

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)