logo

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 023

2.6 policy gradients 15 We can intuitively justify the choice b(s) ≈ Vπ(s) as follows. Suppose we collect a trajectory and compute a noisy gradient estimate T−1 T−1 􏰋􏰋 gˆ = ∇θ logπ(at |st,θ) rt′ t=0 t′=t which we will use to update our policy θ → θ + εgˆ. This update increases the log- probabilityofat proportionallytothesumofrewardsrt+rt+1+···+rT−1,followingthat action. In otherwords, if the sum of rewards is high, then the action was probably good, so we increase its probability. To get a better estimate of whether the action was good, we should check to see if the returns were better than expected. Before taking the action, the 􏰊T −1 expected returns were Vπ(st). Thus, the difference t′=t rt′ − b(st) is an approximate estimate of the goodness of action at—Chapter 4 discusses in a more precise way how it is an estimate of the advantage function. Including the baseline in our policy gradient estimator, we get T−1 􏱀T−1 􏱁 􏰋􏰋 gˆ= ∇θlogπ(at|st,θ) t=0 rt′−b(st) , t′=t which increases the probability of the actions that we infer to be good—meaning that 􏰊T −1 the estimated advantage At = t′=t rt′ − b(st) is positive. If the trajectories are very long (i.e., T is high), then the preceding formula will have excessive variance. Thus, practitioners generally use a discount factor, which reduces variance at the cost of some bias. The following expression gives a biased estimator of the policy gradient. T−1 􏱀T−1 􏱁 􏰋􏰋′ gˆ = ∇θ logπθ(at|st) rt′γt −t −b(st) t=0 t′=t To reduce variance in this biased estimator, we should choose b(st) to optimally estimate the discounted sum of rewards, b(s)≈V (s)=E γ rt′ 􏰆st =s;at:(T−1)∼π t′=t 􏰆 ˆ π,γ 􏳈T−1 􏳉 􏰋t′−t 􏰆􏰆 Intuitively, the discount makes us pretend that the action at has no effect on the reward rt′ for t′ sufficiently far in the future, i.e., we are downweighting delayed effects by

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

optimizing-expectations-from-deep-reinforcement-learning-to--023

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP