OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 050

3.13 approximating factored policies with neural networks 42 Computing the Fisher-vector product is typically about as expensive as computing the gradient of an objective that depends on μ(x) [WN99]. Furthermore, we need to compute k of these Fisher-vector products per gradient, where k is the number of iterations of the conjugate gradient algorithm we perform. We found k = 10 to be quite effective, and using higher k did not result in faster policy improvement. Hence, a naïve implemen- tation would spend more than 90% of the computational effort on these Fisher-vector products. However, we can greatly reduce this burden by subsampling the data for the computation of Fisher-vector product. Since the Fisher information matrix merely acts as a metric, it can be computed on a subset of the data without severely degrading the qual- ity of the final step. Hence, we can compute it on 10% of the data, and the total cost of Hessian-vector products will be about the same as computing the gradient. With this op- timization, the computation of a natural gradient step A−1g does not incur a significant extra computational cost beyond computing the gradient g. 3.13 approximating factored policies with neural networks The policy, which is a conditional probability distribution πθ(a | s), can be parameterized with a neural network. The most straightforward way to do so is to have the neural network map (deterministically) from the state vector s to a vector μ that specifies a distribution over action space. Then we can compute the likelihood p(a | μ) and sample a ∼ p(a|μ). For our experiments with continuous state and action spaces, we used a Gaussian distribution, where the covariance matrix was diagonal and independent of the state. A neural network with several fully-connected (dense) layers maps from the input features to the mean of a Gaussian distribution. A separate set of parameters specifies the log standard deviation of each element. More concretely, the parameters include a set of weights and biases for the neural network computing the mean, {Wi, bi}Li=1, and a vector r (log standard deviation) with the same dimension as a. Then, the policy is defined by the normal distribution N 􏰈mean = NeuralNet 􏰈s; {Wi, bi}L 􏰉 , stdev = exp(r)􏰉. Here, i=1 μ = [mean, stdev]. For the experiments with discrete actions (Atari), we use a factored discrete action space, where each factor is parameterized as a categorical distribution. These factors correspond to the action components (left, no-op, right), (up, no-op, down), (fire, no- fire). Thus, the neural network output a vector of dimension 3 + 3 + 2 = 8, where each of the components was normalized. The process for computing the factored probability distribution is shown in Figure 5 below.

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)