OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 024

2.6 policy gradients 16 a factor of γt′−t. By adding up a series with coefficients 1,γ,γ2,..., we are effectively including 1/(1 − γ) timesteps in the sum. The policy gradient formulas given above can be used in a practical algorithm for optimizing policies. Algorithm 2 “Vanilla” policy gradient algorithm Initialize policy parameter θ, baseline b foriteration=1,2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute 􏰊T −1 ′ the return Rt = t′=t γt −trt′ , and the advantage estimate Aˆ t = Rt − b(st). Re-fit the baseline, by minimizing ∥b(st) − Rt∥2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate gˆ, which is a sum of terms ∇θ log π(at | st, θ)Aˆ t end for In the algorithm above, the policy update can be performed with stochastic gradient ascent, θ → θ + εgˆ , or one can use a more sophisticated method such as Adam [KB14]. To numerically compute the policy gradient estimate using automatic differentiation software, we swap the sum with the expectation in the policy gradient estimator: T−1 􏱀T−1 􏱁 􏰋􏰋′ gˆ = ∇θ logπθ(at|st) rt′γt −t −b(st) t=0 t′=t T−1 =∇ 􏰋logπ(a|s)Aˆ θ θttt t=0 Hence, one can construct the scalar quantity 􏰊 log πθ(at|st)Aˆ t and differentiate it to obtain the policy gradient. The vanilla policy gradient method described above has been well-known for a long time; some early papers include [Wil92; Sut+99; JJS94]. It was considered to be a poor choice on most problems because of its high sample complexity. A couple of other prac- tical difficulties are that (1) it is hard to choose a stepsize that works for the entire course of the optimization, especially because the statistics of the states and rewards changes; t

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--024

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP