OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Publication Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS ( optimizing-expectations-from-deep-reinforcement-learning-to- )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 018

2.4 policies 10 should combine information from many previous timesteps, so the action at depends on the preceding history ht = (y0, a0, y1, a1, . . . , yt−1, at−1, yt). The data-generating process is given by the following equations, and the figure below. s0,y0 ∼μ0 a0 ∼ π(a0 | h0) s1,y1,r0 ∼ P(s1,y1,r0 |s0,a0) a1 ∼ π(a1 | h1) s2,y2,r1 ∼ P(s2,y2,r1 |s1,a1) ... aT−1 ∼π(aT−1|hT−1) sT,yT,rT−1 ∼ P(sT,yT,rT−1 |sT−1,aT−1) This process is called a partially observed Markov decision process (POMDP). The partially-observed setting is equivalent to the fully-observed setting because we can call the observation history ht the state of the system. That is, a POMDP can be written as an MDP (with infinite state space). When using function approximation, the partially observed setting is not much different conceptually from the fully-observed setting. 2.4 policies We’ll typically use parameterized stochastic policies, which we’ll write as πθ(a | s). Here, θ ∈ Rd is a parameter vector that specifies the policy. For example, if the policy is a neural network, θ would correspond to the flattened weights and biases of the network. The parameterization of the policy will depend on the action space of the MDP, and whether it is a discrete set or a continuous space. The following are sensible choices (but not the only choices) for how to how to define deterministic and stochastic neural network policies. With a discrete action space, we’ll use a neural network that outputs action probabilities, i.e., the final layer is a softmax layer. With a continuous action space, we’ll use a neural network that outputs the mean of a Gaussian distribution, with a sep- arate set of parameters specifying a diagonal covariance matrix. Since the optimal policy in an MDP or POMDP is deterministic, we don’t lose much by using a simple action distribution (e.g., a diagonal covariance matrix, rather than a full covariance matrix or a more complicated multi-model distribution.) π h0 h1 h2 Agent s0 s1 s2 sT r0 r1 rT-1 y1 y1 s1 P Environment a0 a1 aT-1 y0 μ0

PDF Image | OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

PDF Search Title:

OPTIMIZING EXPECTATIONS: FROM DEEP REINFORCEMENT LEARNING TO STOCHASTIC COMPUTATION GRAPHS

Original File Name Searched:

thesis-optimizing-deep-learning.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)