OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 040

3.8 experiments 32 Cartpole 10 8 6 4 2 0 0 10 20 30 40 50 number of policy iterations 0.15 0.10 0.05 0.00 0.05 Swimmer 0.10 0 10 20 30 40 50 number of policy iterations 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Hopper Vine Sing Natu Max Empi CEM CMA RWR le Path ral Gradient KL rical FIM Vine Sing Natu Empi CEM CMA RWR le Path ral Gradient rical FIM Vine Single Path Natural Grad CEM RWR ient 0 50 100 150 200 number of policy iterations 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Walker Vine Single Path Natural Grad CEM ient RWR Figure 4: Learning curves for locomotion tasks, averaged across five runs of each algorithm with random initializations. Note that for the hopper and walker, a score of −1 is achievable without any forward velocity, indicating a policy that simply learned balanced standing, but not walking. maximum KL divergence. Videos of the policies learned by TRPO may be viewed on the project website: http://sites.google.com/site/trpopaper. Note that TRPO learned all of the gaits with general-purpose policies and simple reward functions, using minimal prior knowledge. This is in contrast with most prior methods for learning locomotion, which typically rely on hand-architected policy classes that explicitly encode notions of balance and stepping [TZS04; GPW06; WP09]. 3.8.2 Playing Games from Images To evaluate TRPO on a task with high-dimensional observations, we trained policies for playing Atari games, using raw images as input. The games require learning a variety of behaviors, such as dodging bullets and hitting balls with paddles. Aside from the high dimensionality, challenging elements of these games include delayed rewards (no immediate penalty is incurred when a life is lost in Breakout or Space Invaders); complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms); and non-stationary image statistics (Enduro involves a changing and flickering background). We tested our algorithms on the same seven games reported on in [Mni+13] and [Guo+14], which are made available through the Arcade Learning Environment [Bel+13] The images were preprocessed following the protocol in Mnih et al [Mni+13], and the policy was represented by the convolutional neural network shown in Figure 3, with two 0 50 100 150 200 number of policy iterations reward reward reward cost (-velocity + ctrl)

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)