OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 021

2.6 policy gradients 13 where μ is the initial state distribution. When we take the logarithm, the product turns into a sum, and when we differentiate with respect to θ, the terms P(st | st−1, at−1) terms drop out as does μ(s0). We obtain 􏳈T−1 􏳉 ∇θ log π(at | st, θ)R(τ) It is somewhat remarkable that we are able to compute the policy gradient without knowing anything about the system dynamics, which are encoded in transition proba- bilities P. The intuitive interpretation is that we collect a trajectory, and then increase its log-probability proportionally to its goodness. That is, if the reward R(τ) is very high, we ought to move in the the direction in parameter space that increases log p(τ | θ). ∇θEτ[R(τ)] = Eτ 􏰋 t=0 Note: trajectory lengths and time-dependence. Here, we are considering trajecto- ries with fixed length T, whereas the definition of MDPs and POMDPs above as- sumed variable or infinite length, and stationary (time-independent) dynamics. The derivations in policy gradient methods are much easier to analyze with fixed length trajectories—otherwise we end up with infinite sums. The fixed-length case can be made to mostly subsume the variable-length case, by making T very large, and in- stead of trajectories ending, the system goes into a sink state with zero reward. As a result of using finite-length trajectories, certain quantities become time-dependent, because the problem is no longer stationary. However, we can include time in the state so that we don’t need to separately account for the dependence on time. Thus, we will omit the time-dependence of various quantities below, such as the state-value function Vπ. We can derive versions of this formula that eliminate terms to reduce variance. This calculation is provided in much more generality in Chapter 5 on stochastic computation graphs, but we’ll include it here because the concrete setting of this chapter will be easier to understand. First, we can apply the above argument to compute the gradient for a single reward term: 􏳈􏰋t 􏳉 ∇θ logπ(at′ |st′,θ)rt ∇θEτ[rt] = Eτ Note that the sum goes up to t, because the expectation over rt can be written in terms t′=0 of actions a ′ with t′ 􏳆 t. Summing over time (taking 􏰊T−1 of the above equation), we t t=0

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)