OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 016

BACKGROUND 2.1 markov decision processes 2 A Markov Decision Process (MDP) is a mathematical object that describes an agent in- teracting with a stochastic environment. It is defined by the following components: • S: state space, a set of states of the environment. • A: action space, a set of actions, which the agent selects from at each timestep. • P(r, s′ | s, a): a transition probability distribution. For each state s and action a, P specifies the probability that the environment will emit reward r and transition to state s′. In certain problem settings, we will also be concerned with an initial state distribution μ(s), which is the probability distribution that the initial state s0 is sampled from. Various different definitions of MDP are used throughout the literature. Sometimes, the reward is defined as a deterministic function R(s), R(s, a), or R(s, a, s′). These formu- lations are equivalent in expressive power. That is, given a deterministic-reward formu- lation, we can simulate a stochastic reward by lumping the reward into the state. The end goal is to find a policy π, which maps states to actions. We will mostly con- sider stochastic policies, which are conditional distributions π(a | s), though elsewhere in the literature, one frequently sees deterministic policies a = π(s). 2.2 the episodic reinforcement learning problem This thesis will be focused on the episodic setting of reinforcement learning, where the agent’s experience is broken up into a series of episodes—sequences with a finite num- ber of states, actions and rewards. Episodic reinforcement learning in the fully-observed setting is defined by the following process. Each episode begins by sampling an initial state of the environment, s0, from distribution μ(s0). Each timestep t = 0,1,2,..., the 8

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)