A summary of Reinforcement Learning
(1) Policy iteration:
First evaluate current policy by Bellman equation. In particular, compute the state-value function
The the value function for the current policy helps find better policy, namely policy improvement. At any state
The detailed algorithm is presented in the pseudocode
(2) Value iteration: In policy iteration, the policy evaluation involves multiple sweeps through the state set. The policy improvement occurs only when the evaluation converages.
In fact, the policy evaluation can be stopped after just one update of each state. This algorithm is called the value iteration.
In dynamic programming, it is required that the model of the envrionment’s dynamics is known, which limits its application. The Monte Carlo methods instead learn the value function from experience in the form of sample episodes.
The first MC method is presented here below. One of the greatest difficulties in RL is the fundamental dilemma of exploration versus exploitation
The assumption in the algorithms seems unrealistic. One way to ensure the sufficient exploration is to use a stochastic policy namely
The previous two algorithms are on-policy methods. The off-policy MC control is shown below. The agent follows the behaviour policy while learning about and improving the target policy. Although the target policy may be deterministic, the agent can still sample all possible actions.
TD learning combines ideas of MC methods and DP methods. It does not require model of the envrionment’s dynamic and learns values from the experience, like MC methods. It also updates the estimate based on other estimate and does not have to wait for the return until the episode ends (bootstrapping), like DP methods.
In MC methods, the value of a state is updated as follows
where
In TD methods, agent only needs to wait until the next time step. It updates like this
Since the agent does not need to wait unitl the episode ends, the TD methods are implementated in an online, fully incremental fashion.
One on-policy TD control method is Sarsa. The target action value is the sum of the immediate reward plus the action value
Q-learning is an off-policy TD control method. The action-value function
Unlike to tabular and traditional non-parametric methods, e.g., Q-learning, Deep Q-network can deal efficiently with the curse of dimensionality. It tries to approximate the action-value function
Reinforcement learning is unstable and even diverge when a nonlinear function approximator such as neural network is used to represent the action-value (Q function). The causes are the correlations present in the sequence of observations. Small updates to Q may significantly change the policy and change the data distribution, and the correlations between action-value (Q) and target value
Two key ideas are used to address the correlation:
(1) Replay buffer that randomizes over the data and removes correlation sequence and smoothes over changes in the data distribution.
(2) Update the target value periodically instead of every iteration.
More specifically, the DQN parameterizes an approximate value function
where
Deep Q learning is (1) model free: not explicitly estimate the reward and transition dynamics. (2) off-policy: learn the greedy policy but execute action based on
Advantage of replay buffer: (1) each step of experience is potentially used in many updates, which allows for greater data efficiency. (2) break the correlation between samples. (3) The behaviour distribution is average over many of its previous states smoothing out learning and aoviding oscillations or divergence in the parameters.
The methods introduced above are all categorized as action-value methods, which select the action based on the action-value estimates. Another class of methods consider to learn a parameterized policy that enables actions to be chosen without consulting action-value estimates. Although the action-value function may still be used to learn the policy parameter, e.g., REINFORCE with baseline, but is not required to select the action.
The advantages of policy-based methods are that (1) They can learn appropriate levels of exploration and approach deterministic policies asymptotically. (2) They can learn specific probabilities for taking the actions. (3) They can naturally handle continuous action spaces. (4) The action probabilities change smoothly as a function of learned parameter.
The policy-gradient theorem below gives an exact formula for how performance is affected by the policy parameter which does not depend on the state distribution. This provides a theoretical foundation for all policy gradient methods, which update the policy parameter on each step in the direction of an estimate of the gradient of performance with respect to the policy parameter.
where
Now we need to obtain samples such that the expectation of the sample gradient is proprotional to the actual gradient of the performance. This is based on the equations below
The equation tells us the quantity that can be sampled on each time step whose expectation is equal to the gradient. Therefore the update rule is
where
The updat rule enables the parameter to move most in the directions that favor actions that obtain the highest return. The detailed algorithm is presented below:
This method is a MC algorithm because the complete return
One way to reduce the variance is to include a comparison of the action value to an arbitary baseline
It is intuitive to choose the estimate of the state value,
The previous two methods are unbiased but tend to learn slowly and produce estimates of high variance. Also it is inconvenient to implement online or for continuing problems.
Like TD methods, we can use bootstrapping to replace the full return in target value of REINFORCE with the one-step return. The update rule (with a learned state-value function as baseline) is
Although the bias is introduced through bootstrapping, the variance is reduced substantially and learning thus becomes faster.
The pseudocode is presented below. This is a fully online, incremental algorithm.
This algorithm is called actor-critic because the state-value function (critic) is used to provide feedback to the policy (actor). Therefore formally, the REINFORCE baseline is not actor-critic because the value function is only used as a baseline, which neither introduces bias nor change the expected value of the update.
Policy-based methods is able to deal with the large actions space, even continuous space with infinite number of actoins. Instead of learning the probability for each individual action, we can learn a probability distributions for each entry of the action (vector), e.g., the mean and variance of the Gaussian distribution.
The methods in previous section is in fact the stochastic policy gradient. The policy is represented by a parametric probability distribution, i.e.,
Silver et al.
In more details, the general performance objective for the problem is the cumulative (discounted) reward from the start state. For stochastic PG, that is
For deterministic PG, the performance objective is
where
As a result, the performance gradient
The paper proposes the deterministic policy gradient theorem as:
Stochastic policy is usually used to explore the full state and action space. To ensure the sufficient exploration, an off-policy actor-critic algorithm is derived based on the theorem. Specifically, the agent takes actions based on a stochastic behaviour policy
The paper
Similar to DQN, a replay buffer is used to address the correlation in the sequence of observations. The transition
DDPG uses soft target updates, instead of directly copying the weights in DQN. The weights of the target networks
DDPG is off-policy. The behaviour policy and actor policy should be different. A noise process
The pseudocode is presented below:
on-policy methods: The agent commits to always exploring and tries to find the best policy that still explores.
off-policy methods: The agent also explores, but learns a deterministic optimal policy that may be unrelated to the policy followed. It is usually based on some form of importance sampling, i.e., on weighting returns by the ratio of the probabilities of taking the observed actions under the two policies, thereby transforming their expectations from the behavior policy to the target policy.
action-value methods: learn the value of actions and select actions based on the estimated action values, e.g., Q-learning, TD learning.
Bootstrapping: Update estimate of state value (action value) based on the estimate of state value (action value) of the successor state.
actor-critic: methods that learn approximations to both policy and value functions. The actor refers to the learned policy and critic refers to the learned value function.