guide:C8c80a2ae8: Difference between revisions

From Stochiki
No edit summary
mNo edit summary
Line 1: Line 1:
<div class="d-none"><math>
<div class="d-none"><math>
\newcommand*{\rom}[1]{\expandafter\@slowromancap\romannumeral #1@}
\newcommand*{\rom}[1]{\expandafter\@slowromancap\romannumeral #1@}
\newcommand{\vertiii}[1]{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert #1
    \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}
\DeclareMathOperator*{\dprime}{\prime \prime}
\DeclareMathOperator*{\dprime}{\prime \prime}
\DeclareMathOperator{\Tr}{Tr}
\DeclareMathOperator{\Tr}{Tr}
Line 41: Line 39:
\DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))}
\DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))}
\newcommand{\mathds}{\mathbb}</math></div>
\newcommand{\mathds}{\mathbb}</math></div>
label{sec:rl_basics}
 
Reinforcement learning is  
Reinforcement learning is an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment.
an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment.
RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals.
RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals.
The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming.  Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as <math>Q</math>-learning and Actor-Critic.
The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming.  Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as <math>Q</math>-learning and Actor-Critic.
Line 51: Line 48:
Combining this framework with deep learning <ref name="goodfellow2016deep"/> was popularized by the Deep <math>Q</math>-learning algorithm, introduced in <ref name="mnih2013playing"/>, which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years <ref name="van2016deep"/><ref name="gu2016continuous"/><ref name="fan2020theoretical"/>.
Combining this framework with deep learning <ref name="goodfellow2016deep"/> was popularized by the Deep <math>Q</math>-learning algorithm, introduced in <ref name="mnih2013playing"/>, which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years <ref name="van2016deep"/><ref name="gu2016continuous"/><ref name="fan2020theoretical"/>.


Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in [[#sec:set_up |Section]] with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In [[#sec:MDP2learning |Section]] we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in [[#sec:value_based_methods |Section]] and policy-based methods in Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]], with an infinite-time horizon. Finally, Section~[[#sec:discussion |sec:discussion]] contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL.  
Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in [[#sec:set_up |Section]] with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In [[#sec:MDP2learning |Section]] we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in section [[#sec:value_based_methods |value based methods]] and policy-based methods in section [[#subsec:policy_based_methods |policy based methods]], with an infinite-time horizon. Finally, section [[#sec:discussion |discussion]] contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL.  


===<span id="sec:set_up"></span>Setup: Markov Decision Processes===
===<span id="sec:set_up"></span>Setup: Markov Decision Processes===
Line 107: Line 104:
optimal (stationary) policy <math>\pi^*(s, a)</math> (if it exists) from  <math>Q(s, a)</math>, in that  <math>\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)</math>.  
optimal (stationary) policy <math>\pi^*(s, a)</math> (if it exists) from  <math>Q(s, a)</math>, in that  <math>\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)</math>.  
An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer  the reader to <ref name="abbasi2019politex"/><ref name="wei2020model"/> for a more detailed discussion of RL with infinite-time horizon and average reward.
An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer  the reader to <ref name="abbasi2019politex"/><ref name="wei2020model"/> for a more detailed discussion of RL with infinite-time horizon and average reward.
\iffalse
'''Infinite Time Horizon and Average Reward.''' Another setting over an infinite time horizon is to consider the average reward, which is also referred to as the ergodic reward,
<math display="block">
\begin{eqnarray} \label{equ: singlev_er}
V(s) = \sup_{\Pi} V^{\Pi}(s):=
\sup_{\Pi}\lim_{T\rightarrow\infty}\E^{\Pi}\biggl[\frac{1}{T}\sum_{t=0}^{T-1} r(s_t, a_t) \biggl| s_0 = s\biggl],
\end{eqnarray}
</math>
subject to


<math display="block">
\begin{eqnarray} \label{equ: singledynamics}
s_{t + 1} \sim P(s_t, a_t), \;\;\; a_t \sim \pi_t({s_t}).
\end{eqnarray}
</math>
where <math>h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} </math> is the history up to time <math>t</math>;  <math>s_t \in \Sc</math> and <math>a_t \in \Ac</math> are the state and the action at time <math>t</math>; <math>P:  \Sc \times \Ac \to \Pc(\Sc)</math> is the transition matrix of the underlying Markov system; the reward  <math>r(s, a)</math> is random valued in <math>\R</math> for each pair <math>(s, a)\in \Sc \times \Ac</math>; and  <math>\Pi =\{\pi_t\}_{t=0}^{\infty}</math> is the policy adopted by the agent.
\fi


'''Finite Time Horizon.''' When there is a finite time horizon <math>T < \infty</math> we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as
'''Finite Time Horizon.''' When there is a finite time horizon <math>T < \infty</math> we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as
Line 172: Line 151:
</math>
</math>
with the terminal condition <math>Q^\star_T(s,a) = \mathbb{E}[r_T(s)]</math> for all <math>a\in \mathcal{A}</math>. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications.
with the terminal condition <math>Q^\star_T(s,a) = \mathbb{E}[r_T(s)]</math> for all <math>a\in \mathcal{A}</math>. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications.
\iffalse
'''Policy Class and Optimal Policies.'''
We consider the following types of policies for <math>\Pi = \{\pi_t\}_{t=0}^{T}</math> with <math>T < \infty</math> for the finite time horizon setting and <math>T=\infty</math> for the infinite time horizon setting:
<ul style{{=}}"list-style-type:lower-roman"><li>History-dependent Randomized Policy (HR): <math>\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Pc(\Ac)</math>,</li>
<li>History-dependent Deterministic Policy (HD): <math>\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Ac</math>,</li>
<li>Markov Randomized Policy (MR): <math>\pi_t: \Sc \to \Pc(\Ac)</math>,</li>
<li>Markov Deterministic Policy (MD): <math>\pi_t: \Sc \to \Ac</math>.</li>
</ul>
It is easy to observe that MD <math>\subset</math> MR <math>\subset</math> HR. Recall that in \eqref{equ: singlev} and \eqref{equ: singlev_fh}, the optimal policy is defined as the one that maximizes the gain of the MDP. Due to the structure of MDPs it is not difficult to show that it is sufficient to consider Markovian policies. Henceforth, we shall focus only on Markovian policies.
\fi


For an infinite horizon MDP with finite state and action space, and  finite reward <math>r</math>, a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists.  
For an infinite horizon MDP with finite state and action space, and  finite reward <math>r</math>, a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists.  
{{proofcard|Theorem (Theorem 6.2.7 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one.
{{proofcard|Theorem (Theorem 6.2.7 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one.
For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.|}}
For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.|}}
\iffalse
 
{{proofcard|Theorem (Theorem 8.1.2 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one. For infinite horizon average reward MDP, there always exist a stationary (possibly randomized) policy which is an optimal policy.|}}
\fi
[[#thm:MDP_discounted |Theorem]] implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in <ref name="puterman2014markov"/>. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the '' best stationary policy''. To this end, we write <math>\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})</math> (without a time index) for a stationary policy throughout the paper.  
[[#thm:MDP_discounted |Theorem]] implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in <ref name="puterman2014markov"/>. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the '' best stationary policy''. To this end, we write <math>\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})</math> (without a time index) for a stationary policy throughout the paper.  


Line 272: Line 238:
It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies.  
It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies.  
In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence.
In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence.
For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time <math>0</math> and ends at the terminal time <math>T</math>, and the performance is evaluated in terms of the total number of samples in all episodes.{{efn|Sample complexity with respect to the number of episodes has also been adopted in the literature <ref name="DannBrunskill2015"/><ref name="DannUBEV"/>. The translation between these two criteria is straightforward.}} Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE <ref name="zhang2020sample"/> and the policy gradient method <ref name="liu2020improved"/> (see Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]]).  
For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time <math>0</math> and ends at the terminal time <math>T</math>, and the performance is evaluated in terms of the total number of samples in all episodes.{{efn|Sample complexity with respect to the number of episodes has also been adopted in the literature <ref name="DannBrunskill2015"/><ref name="DannUBEV"/>. The translation between these two criteria is straightforward.}} Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE <ref name="zhang2020sample"/> and the policy gradient method <ref name="liu2020improved"/> (see [[#subsec:policy_based_methods |policy based methods]]).  
Throughout the analysis, we use the notation <math>\widetilde{O}(\cdot)</math>  to hide logarithmic factors when describing the order of the scaling in <math>\varepsilon</math>, <math>\delta</math>, <math>|\mathcal{A}|</math> and <math>|\mathcal{S}|</math> (when these are finite), and other possible model parameters such as the discount rate <math>\gamma</math> in the infinite horizon case or the dimension <math>d</math> of the features when functional approximation is used. We write <math>\widetilde{\mathcal{O}}^{\prime}</math> rather than <math>\widetilde{\mathcal{O}}</math> to indicate that, although there may be other parameters, we only include the dependence on <math>\varepsilon</math>.  
Throughout the analysis, we use the notation <math>\widetilde{O}(\cdot)</math>  to hide logarithmic factors when describing the order of the scaling in <math>\varepsilon</math>, <math>\delta</math>, <math>|\mathcal{A}|</math> and <math>|\mathcal{S}|</math> (when these are finite), and other possible model parameters such as the discount rate <math>\gamma</math> in the infinite horizon case or the dimension <math>d</math> of the features when functional approximation is used. We write <math>\widetilde{\mathcal{O}}^{\prime}</math> rather than <math>\widetilde{\mathcal{O}}</math> to indicate that, although there may be other parameters, we only include the dependence on <math>\varepsilon</math>.  
We denote by <math>\mbox{poly}(\cdot)</math> a polynomial function of its arguments.
We denote by <math>\mbox{poly}(\cdot)</math> a polynomial function of its arguments.
Line 379: Line 345:
</math>
</math>
The term <math>\delta_n</math>, commonly called the TD error, measures the difference between the estimated value at <math>s</math> and the better estimate <math>r+\gamma V^{\pi,(n)}(s')</math>, which is often called the  TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature.  
The term <math>\delta_n</math>, commonly called the TD error, measures the difference between the estimated value at <math>s</math> and the better estimate <math>r+\gamma V^{\pi,(n)}(s')</math>, which is often called the  TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature.  
<proc id="alg:TD" label = "\textbf{TD(0) Method for estimating $V^{\pi}$}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> used to sample observations; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0\leq n \leq N-1)</math>
<proc id="alg:TD" label = "TD(0) Method for estimating $V^{\pi}$"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> used to sample observations; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0\leq n \leq N-1)</math>
</div>
</div>
<div class="ms-2"> Initialize <math>V^{\pi,(0)}(s)</math> for all <math>s\in \mathcal{S}</math>
<div class="ms-2"> Initialize <math>V^{\pi,(0)}(s)</math> for all <math>s\in \mathcal{S}</math>
Line 395: Line 361:
</div>
</div>
'''end for'''</div></proc>
'''end for'''</div></proc>
====<math>Q</math>-learning Algorithm====
====<math>Q</math>-learning Algorithm====
A version of TD learning is <math>Q</math>-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the <math>Q</math>-function with samples observed by the agent. At iteration <math>n</math> (<math>1 \leq n \leq N-1</math>), the <math>Q</math>-function is updated using a sample <math>(s,a,r,s^{\prime})</math> where <math>s' \sim P(s,a)</math>,
A version of TD learning is <math>Q</math>-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the <math>Q</math>-function with samples observed by the agent. At iteration <math>n</math> (<math>1 \leq n \leq N-1</math>), the <math>Q</math>-function is updated using a sample <math>(s,a,r,s^{\prime})</math> where <math>s' \sim P(s,a)</math>,
Line 405: Line 373:
Here <math>\beta_n(s,a)</math> is the learning rate which balances the weight between the current estimate <math>Q^{(n)}</math> from iteration <math>n</math> and the new estimate <math>r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')</math> calculated using the sample <math>(s,a,r,s^{\prime})</math>. [[#alg:Q |Algorithm]] provides pseudo-code for an implementation of  Q-learning.  
Here <math>\beta_n(s,a)</math> is the learning rate which balances the weight between the current estimate <math>Q^{(n)}</math> from iteration <math>n</math> and the new estimate <math>r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')</math> calculated using the sample <math>(s,a,r,s^{\prime})</math>. [[#alg:Q |Algorithm]] provides pseudo-code for an implementation of  Q-learning.  


<proc id="alg:Q" label = "\textbf{$Q$-learning with samples from a policy $\pi$}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> to be evaluated; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0 \leq n\leq N-1)</math>
<proc id="alg:Q" label = "$Q$-learning with samples from a policy $\pi$"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> to be evaluated; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0 \leq n\leq N-1)</math>
</div>
</div>
<div class="ms-2"> Initialize <math>Q^{(0)}(s,a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math>
<div class="ms-2"> Initialize <math>Q^{(0)}(s,a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math>
Line 421: Line 389:
</div>
</div>
'''end for'''</div></proc>
'''end for'''</div></proc>
The following theorem guarantees the asymptotic convergence of the <math>Q</math>-learning update \eqref{eq:Q_learning_update} to the <math>Q</math>-function defined in \eqref{defsingleQ}.
The following theorem guarantees the asymptotic convergence of the <math>Q</math>-learning update \eqref{eq:Q_learning_update} to the <math>Q</math>-function defined in \eqref{defsingleQ}.
{{proofcard|Theorem (Watkins and Dayan (1992) <ref name="watkins1992q"/>)|Theorem-1|\label{thm:Q_conv} Assume <math>|\mathcal{A}| < \infty</math> and <math>|\mathcal{S}| < \infty</math>.
{{proofcard|Theorem (Watkins and Dayan (1992) <ref name="watkins1992q"/>)|Theorem-1|\label{thm:Q_conv} Assume <math>|\mathcal{A}| < \infty</math> and <math>|\mathcal{S}| < \infty</math>.
Let <math>n^i(s, a)</math> be the index of the <math>i</math>-th time that the action <math>a</math> is used in state <math>s</math>.
Let <math>n^i(s, a)</math> be the index of the <math>i</math>-th time that the action <math>a</math> is used in state <math>s</math>.
Line 432: Line 403:
</math>
</math>
then <math>Q^{(n)}(s,a)\rightarrow Q^\star(s,a)</math> as <math>n\rightarrow \infty</math>, <math>\forall s,a</math> with probability <math>1</math>.|}}
then <math>Q^{(n)}(s,a)\rightarrow Q^\star(s,a)</math> as <math>n\rightarrow \infty</math>, <math>\forall s,a</math> with probability <math>1</math>.|}}
This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on  the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence <math>\beta</math>. [[#thm:Q_conv |Theorem]] implies that eventually <math>Q^{(n)}</math> will convergence to the true value function <math>Q^\star</math> when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for <math>Q</math>-learning related algorithms are discussed in Section~[[#subsubsec:value_theo |subsubsec:value_theo]].
 
This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on  the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence <math>\beta</math>. [[#thm:Q_conv |Theorem]] implies that eventually <math>Q^{(n)}</math> will convergence to the true value function <math>Q^\star</math> when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for <math>Q</math>-learning related algorithms are discussed in [[#subsubsec:value_theo |Discussion]].


====<span id="sec:sarsa"></span>SARSA====
====<span id="sec:sarsa"></span>SARSA====
In contrast to the <math>Q</math>-learning algorithm, which takes samples from any policy <math>\pi</math> as the input where these samples could be collected in advance before performing the <math>Q</math>-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the <math>Q</math>-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm.  
In contrast to the <math>Q</math>-learning algorithm, which takes samples from any policy <math>\pi</math> as the input where these samples could be collected in advance before performing the <math>Q</math>-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the <math>Q</math>-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm.  
<proc id="alg:SARSA" label = "\textbf{SARSA: On-policy TD Learning}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, the learning rate <math>\beta\in(0,1)</math>\footnotemark[1], and small parameter <math>\varepsilon > 0</math>.
<proc id="alg:SARSA" label = "SARSA: On-policy TD Learning"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, the learning rate <math>\beta\in(0,1)</math>\footnotemark[1], and small parameter <math>\varepsilon > 0</math>.
</div>
</div>
<div class="ms-2"> Initialize <math>Q^{(0)}(s, a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math>
<div class="ms-2"> Initialize <math>Q^{(0)}(s, a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math>
Line 458: Line 430:
</div>
</div>
'''end for'''</div></proc>
'''end for'''</div></proc>
{{efn|The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate <math>\beta</math>. In the convergence analysis, we use <math>\beta_n</math> to denote a learning rate which depends on the number of iterations.}}
{{efn|The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate <math>\beta</math>. In the convergence analysis, we use <math>\beta_n</math> to denote a learning rate which depends on the number of iterations.}}
The policy derived from <math>Q^{(n)}</math> using <math>\varepsilon</math>-greedy (line 4 and line 8 in [[#alg:SARSA |Algorithm]]) suggests the following choice of action <math>a</math> when in state <math>s</math>:
The policy derived from <math>Q^{(n)}</math> using <math>\varepsilon</math>-greedy (line 4 and line 8 in [[#alg:SARSA |Algorithm]]) suggests the following choice of action <math>a</math> when in state <math>s</math>:
Line 477: Line 451:


'''On-policy Learning vs. Off-policy Learning.'''  An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps.  <math>Q</math>-learning is an off-policy agent as the samples <math>(s,a,r,s')</math> used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the <math>Q</math>-function. The convergence of the <math>Q</math>-function relies on the exploration scheme of the policy <math>\pi</math> (see [[#thm:Q_conv |Theorem]]). In contrast to <math>Q</math>-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the <math>Q</math>-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of <math>Q</math>-learning \eqref{eq:Q_learning_update}, we have <math>\max_{a'} Q^{(n)}(s',a')</math> which is the maximum value of the <math>Q</math>-function at a given state <math>s'</math>. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes <math>Q^{(n)}(s',a')</math> where <math>a'</math> is selected based on a policy derived from <math>Q^{(n)}</math> (via the <math>\varepsilon</math>-greedy policy \eqref{epsilon_policy}).
'''On-policy Learning vs. Off-policy Learning.'''  An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps.  <math>Q</math>-learning is an off-policy agent as the samples <math>(s,a,r,s')</math> used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the <math>Q</math>-function. The convergence of the <math>Q</math>-function relies on the exploration scheme of the policy <math>\pi</math> (see [[#thm:Q_conv |Theorem]]). In contrast to <math>Q</math>-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the <math>Q</math>-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of <math>Q</math>-learning \eqref{eq:Q_learning_update}, we have <math>\max_{a'} Q^{(n)}(s',a')</math> which is the maximum value of the <math>Q</math>-function at a given state <math>s'</math>. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes <math>Q^{(n)}(s',a')</math> where <math>a'</math> is selected based on a policy derived from <math>Q^{(n)}</math> (via the <math>\varepsilon</math>-greedy policy \eqref{epsilon_policy}).
%{\color{RedOrange}Note that there are several  differences between the <math>Q</math>-learning Algorithm


====<span id="subsubsec:value_theo"></span>Discussion====
====<span id="subsubsec:value_theo"></span>Discussion====
In this section, we provide a brief overview of sample complexity  
In this section, we provide a brief overview of sample complexity results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to [[#sec:discussion |Section]].
results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to [[#sec:discussion |Section]].
There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, <ref name="dalal2018finite"/> showed that the mean-squared error converges at a rate of <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})</math> with decaying learning rate <math>\beta_n= \frac{1}{(1+n)^{\sigma}}</math> for some <math>\sigma\in (0,1)</math> where i.i.d. samples are drawn from a stationary distribution; <ref name="lakshminarayanan2018linear"/> provided
There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, <ref name="dalal2018finite"/> showed that the mean-squared error converges at a rate of <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})</math> with decaying learning rate <math>\beta_n= \frac{1}{(1+n)^{\sigma}}</math> for some <math>\sigma\in (0,1)</math> where i.i.d. samples are drawn from a stationary distribution; <ref name="lakshminarayanan2018linear"/> provided
an improved <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})</math> bound for iterate-averaged TD(0) with constant step-size; and <ref name="bhandari2018finite"/> reached the same mean-square sample complexity <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})</math> with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory.
an improved <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})</math> bound for iterate-averaged TD(0) with constant step-size; and <ref name="bhandari2018finite"/> reached the same mean-square sample complexity <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})</math> with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory.
Line 491: Line 462:
This was followed by <ref name="beck2012error"/> which showed that the same order of <math>Q</math>-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)</math> could be achieved with a constant learning rate <math>\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}</math> <math>(n\leq N)</math>. Without access to a simulator, delayed <math>Q</math>-learning <ref name="strehl2009reinforcement"/> was shown to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math>  assuming a  
This was followed by <ref name="beck2012error"/> which showed that the same order of <math>Q</math>-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)</math> could be achieved with a constant learning rate <math>\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}</math> <math>(n\leq N)</math>. Without access to a simulator, delayed <math>Q</math>-learning <ref name="strehl2009reinforcement"/> was shown to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math>  assuming a  
known reward. An improvement to this result to  <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy <ref name="dong2019q"/> and a single trajectory.   
known reward. An improvement to this result to  <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy <ref name="dong2019q"/> and a single trajectory.   
%{\color{RedOrange}In particular, the dependence on the discount rate <math>\gamma</math> is improved from <math>\frac{1}{(1-\gamma)^8}</math> to  <math>\frac{1}{(1-\gamma)^7}</math>.}
 
 
In addition, sample complexities of <math>Q</math>-learning variants with (linear) function approximations have been established in several recent papers.  <ref name="yang2019sample"/> assumed a linear MDP framework (see the end of [[#sec:set_up |Section]])
In addition, sample complexities of <math>Q</math>-learning variants with (linear) function approximations have been established in several recent papers.  <ref name="yang2019sample"/> assumed a linear MDP framework (see the end of [[#sec:set_up |Section]])
and the authors provided a V-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math> with <math>d</math> denoting the dimension of the features. This result matches the information-theoretic lower bound up to <math>\log(\cdot)</math> factors. Later work, <ref name="lattimore2020learning"/> considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity  of order <math>\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math>. Although the dependence on <math>\gamma</math> is not as good as in <ref name="yang2019sample"/>, this framework can be applied to a more general class of models which are “near” linear. Although <math>Q</math>-learning is one of the most successful algorithms for finding the best
and the authors provided a V-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math> with <math>d</math> denoting the dimension of the features. This result matches the information-theoretic lower bound up to <math>\log(\cdot)</math> factors. Later work, <ref name="lattimore2020learning"/> considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity  of order <math>\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math>. Although the dependence on <math>\gamma</math> is not as good as in <ref name="yang2019sample"/>, this framework can be applied to a more general class of models which are “near” linear. Although <math>Q</math>-learning is one of the most successful algorithms for finding the best
action-value function <math>Q</math>, its implementation often suffers from large high biased estimates of the <math>Q</math>-function values incurred by random sampling. The double <math>Q</math>-learning algorithm proposed in <ref name="hasselt2010double"/>
action-value function <math>Q</math>, its implementation often suffers from large high biased estimates of the <math>Q</math>-function values incurred by random sampling. The double <math>Q</math>-learning algorithm proposed in <ref name="hasselt2010double"/>
tackled this high estimation issue by randomly switching the update between two <math>Q</math>-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double <math>Q</math>-learning algorithm was established in <ref name="xiong2020finite"/> with a Q-sample complexity on the order of <math>\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)</math> and learning rate <math>\beta_n = \frac{1}{n^{w}}</math> for some constant <math>\omega \in (0,1)</math>.
tackled this high estimation issue by randomly switching the update between two <math>Q</math>-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double <math>Q</math>-learning algorithm was established in <ref name="xiong2020finite"/> with a Q-sample complexity on the order of <math>\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)</math> and learning rate <math>\beta_n = \frac{1}{n^{w}}</math> for some constant <math>\omega \in (0,1)</math>.
%{\color{RedOrange}[TODO Renyuan: add some comments on these orders -- which one is better than the others under what situation.]}
 
 
===<span id="subsec:policy_based_methods"></span>Policy-based Methods===
===<span id="subsec:policy_based_methods"></span>Policy-based Methods===
%{\color{ForestGreen}[TODO Huining: describe the setup: infinite horizon and Markov policy.]}
 
So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the  <math>\varepsilon</math>-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods <ref name="sewak2019policy"/><ref name="yu2020policy"/><ref name="daberius2019deep"/>. This stems from the fact that value-based methods  can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions.
So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the  <math>\varepsilon</math>-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods <ref name="sewak2019policy"/><ref name="yu2020policy"/><ref name="daberius2019deep"/>. This stems from the fact that value-based methods  can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions.
In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE <ref name="williams1992"/>, Actor-Critic methods <ref name="konda2000"/>, Trust Region Policy Optimization (TRPO) <ref name="schulman2015"/>, Proximal Policy Optimization (PPO) <ref name="schulman2017proximal"/> among others. We first parametrize the policy <math>\pi</math> by a vector <math>\theta</math>, thus we define  <math>\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})</math>,  the probability distribution parameterized by <math>\theta</math> over the action space given the state <math>s</math> at time <math>t</math>. Here we will use <math>\pi(s,a;\theta)</math> and <math>\pi_{\theta}</math> interchangeably to represent the policy parameterized by <math>\theta</math>. We write <math>\mu^{\pi_{\theta}}</math> for the stationary distribution of the state under policy <math>\pi(s,a;\theta)</math>. Then the policy objective function is defined to be
In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE <ref name="williams1992"/>, Actor-Critic methods <ref name="konda2000"/>, Trust Region Policy Optimization (TRPO) <ref name="schulman2015"/>, Proximal Policy Optimization (PPO) <ref name="schulman2017proximal"/> among others. We first parametrize the policy <math>\pi</math> by a vector <math>\theta</math>, thus we define  <math>\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})</math>,  the probability distribution parameterized by <math>\theta</math> over the action space given the state <math>s</math> at time <math>t</math>. Here we will use <math>\pi(s,a;\theta)</math> and <math>\pi_{\theta}</math> interchangeably to represent the policy parameterized by <math>\theta</math>. We write <math>\mu^{\pi_{\theta}}</math> for the stationary distribution of the state under policy <math>\pi(s,a;\theta)</math>. Then the policy objective function is defined to be
Line 531: Line 504:
Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard ''Monte Carlo'' method that repeatedly simulates a trajectory of length <math>M</math> within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE <ref name="williams1992"/> or Monte Carlo policy gradient (see [[#alg:REINFORCE |Algorithm]]). We use a simple empirical return function <math>G_t</math> to approximate the <math>Q</math>-function in \eqref{eqn:policy_gra_thm}. The return <math>G_t</math> is defined as the sum of the discounted rewards and given by <math>G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}</math>. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time <math>t</math>, it uses all future rewards up until the end of the trajectory (see line 7 in [[#alg:REINFORCE |Algorithm]]). Within the <math>n</math>-iteration the policy parameter <math>\theta^{(n+1)}</math> is updated <math>M</math> times with <math>G_t</math> for <math>t=0,1,\cdots,M-1</math>.  
Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard ''Monte Carlo'' method that repeatedly simulates a trajectory of length <math>M</math> within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE <ref name="williams1992"/> or Monte Carlo policy gradient (see [[#alg:REINFORCE |Algorithm]]). We use a simple empirical return function <math>G_t</math> to approximate the <math>Q</math>-function in \eqref{eqn:policy_gra_thm}. The return <math>G_t</math> is defined as the sum of the discounted rewards and given by <math>G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}</math>. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time <math>t</math>, it uses all future rewards up until the end of the trajectory (see line 7 in [[#alg:REINFORCE |Algorithm]]). Within the <math>n</math>-iteration the policy parameter <math>\theta^{(n+1)}</math> is updated <math>M</math> times with <math>G_t</math> for <math>t=0,1,\cdots,M-1</math>.  
As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a ''baseline'', which is a function of the state <math>s</math>, from the return <math>G_t</math>. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, <math>\widehat{V}(s)</math>, which can be learnt using one of the methods introduced in [[#sec:value_based_methods |Section]]. Replacing the <math>G_t</math> in \eqref{eqn:alg_REINFORCE_update} by <math>G_t-\widehat{V}(s)</math> gives a REINFORCE algorithm with baseline.
As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a ''baseline'', which is a function of the state <math>s</math>, from the return <math>G_t</math>. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, <math>\widehat{V}(s)</math>, which can be learnt using one of the methods introduced in [[#sec:value_based_methods |Section]]. Replacing the <math>G_t</math> in \eqref{eqn:alg_REINFORCE_update} by <math>G_t-\widehat{V}(s)</math> gives a REINFORCE algorithm with baseline.
<proc id="alg:REINFORCE" label = "\textbf{REINFORCE: Monte Carlo Policy Gradient}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, learning rate <math>\beta</math>, a differentiable policy parametrization <math>\pi(s,a;\theta)</math>, finite length of the trajectory <math>M</math>.
<proc id="alg:REINFORCE" label = "REINFORCE: Monte Carlo Policy Gradient"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, learning rate <math>\beta</math>, a differentiable policy parametrization <math>\pi(s,a;\theta)</math>, finite length of the trajectory <math>M</math>.
</div>
</div>
<div class="ms-2"> Initialize policy parameter <math>\theta^{(0)}</math>.
<div class="ms-2"> Initialize policy parameter <math>\theta^{(0)}</math>.
Line 552: Line 525:
</div>
</div>
'''end for'''</div>'''end for'''</div></proc>
'''end for'''</div>'''end for'''</div></proc>
====Actor-Critic Methods====
====Actor-Critic Methods====


Line 558: Line 533:
Here for the critic, we approximate the <math>Q</math>-function by a linear combination of features, that is, <math>Q(s,a;w)=\phi(s,a)^\top w</math> for some known feature <math>\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d</math>, and use TD(0) to update the parameter <math>w</math> to minimize the least-square error of the TD error with the gradient of the least-square error taking the form <math>\delta \phi(s,a)</math> (see lines 9-10 in [[#alg:actor_critic |Algorithm]]). For the actor, we use the (vanilla) policy-based method to update the policy parameter <math>\theta</math> (see line 11 in [[#alg:actor_critic |Algorithm]]).
Here for the critic, we approximate the <math>Q</math>-function by a linear combination of features, that is, <math>Q(s,a;w)=\phi(s,a)^\top w</math> for some known feature <math>\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d</math>, and use TD(0) to update the parameter <math>w</math> to minimize the least-square error of the TD error with the gradient of the least-square error taking the form <math>\delta \phi(s,a)</math> (see lines 9-10 in [[#alg:actor_critic |Algorithm]]). For the actor, we use the (vanilla) policy-based method to update the policy parameter <math>\theta</math> (see line 11 in [[#alg:actor_critic |Algorithm]]).
There are three main ways to execute the algorithm. In the ''nested-loop'' setting (see, e.g. <ref name="kumar2019sample"/><ref name="xu2020improving"/>), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the ''two time-scale'' setting (see, e.g. <ref name="xu2020non"/>), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the ''single-scale'' setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. <ref name="fu2020single"/><ref name="xu2021doubly"/>).
There are three main ways to execute the algorithm. In the ''nested-loop'' setting (see, e.g. <ref name="kumar2019sample"/><ref name="xu2020improving"/>), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the ''two time-scale'' setting (see, e.g. <ref name="xu2020non"/>), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the ''single-scale'' setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. <ref name="fu2020single"/><ref name="xu2021doubly"/>).
<proc id="alg:actor_critic" label = "\textbf{Actor-Critic Algorithm}"><div class="ms-2"> '''input''': A differentiable policy parametrization <math>\pi(s,a;\theta)</math>, a differentiable <math>Q</math>-function parameterization <math>Q(s,a;w)</math>, learning rates <math>\beta^{\theta} > 0</math> and <math>\beta^{w} > 0</math>, number of iterations <math>N</math>
<proc id="alg:actor_critic" label = "Actor-Critic Algorithm"><div class="ms-2"> '''input''': A differentiable policy parametrization <math>\pi(s,a;\theta)</math>, a differentiable <math>Q</math>-function parameterization <math>Q(s,a;w)</math>, learning rates <math>\beta^{\theta} > 0</math> and <math>\beta^{w} > 0</math>, number of iterations <math>N</math>
</div>
</div>
<div class="ms-2"> Initialize policy parameter <math>\theta^{(0)}</math> and <math>Q</math>-function parameter <math>w^{(0)}</math>
<div class="ms-2"> Initialize policy parameter <math>\theta^{(0)}</math> and <math>Q</math>-function parameter <math>w^{(0)}</math>
Line 571: Line 546:
<div class="ms-2"> Sample action <math>a^\prime\sim\pi(s^\prime,\cdot;\theta^{(n)})</math>
<div class="ms-2"> Sample action <math>a^\prime\sim\pi(s^\prime,\cdot;\theta^{(n)})</math>
</div>
</div>
<div class="ms-2"> {Compute the TD error:} <math>\delta\leftarrow r+\gamma Q(s^\prime,a^\prime;w^{(n)})-Q(s,a;w^{(n)})</math>
<div class="ms-2"> Compute the TD error: <math>\delta\leftarrow r+\gamma Q(s^\prime,a^\prime;w^{(n)})-Q(s,a;w^{(n)})</math>
</div>
</div>
<div class="ms-2"> {Update <math>w</math>:} <math>w^{(n+1)}\leftarrow w^{(n)}+\beta^{w}\delta\phi(s,a)</math>  
<div class="ms-2"> Update <math>w</math>: <math>w^{(n+1)}\leftarrow w^{(n)}+\beta^{w}\delta\phi(s,a)</math>  
</div>
</div>
<div class="ms-2"> <math>\theta^{(n+1)}\leftarrow \theta^{(n)} + \beta^{\theta}Q(s,a;w^{(n)})\nabla_{\theta}\ln\pi(s,a;\theta^{(n)})</math>
<div class="ms-2"> <math>\theta^{(n+1)}\leftarrow \theta^{(n)} + \beta^{\theta}Q(s,a;w^{(n)})\nabla_{\theta}\ln\pi(s,a;\theta^{(n)})</math>
Line 580: Line 555:
</div>
</div>
'''end for'''</div></proc>
'''end for'''</div></proc>


====<span id="sec:value_discussion"></span>Discussion====
====<span id="sec:value_discussion"></span>Discussion====
Line 647: Line 623:
</math>
</math>
where <math>s_{u+1} \sim P_{k,u}(s_u,a_u)</math> and <math>a_u\sim \pi_{k,u}(s_u)</math>.
where <math>s_{u+1} \sim P_{k,u}(s_u,a_u)</math> and <math>a_u\sim \pi_{k,u}(s_u)</math>.
\iffalse
'''Non-stationary RL with Infinite Time Horizon.'''
For the infinite horizon setting, an agent interacts with a non-stationary MDP starting from an initial timestamp <math>t</math>. The non-stationary environment can be denoted by a tuple
<math>(\mathcal{S},\mathcal{A},  \pmb{P}, \pmb{r})</math>, where <math>\mathcal{S}</math> is the state space, <math>\mathcal{A}</math> is the action space, <math>\pmb{P} = \{P_{t} \}_{t \ge 0}</math>
is the set of transition kernels with <math>P_{t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})</math>,
and <math>\pmb{r} = \{r_{t} \}_{t \ge 0}</math> is the set of reward functions with <math>r_{t}(s,a)</math> a random variable in <math>\mathbb{R}</math> (depending on both the time-step and the episode number) for each pair <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math>. Denote by <math>\Pi=\{\pi_{t} \}_{t \ge 0}</math> the Markovian policy in which <math>\pi_{t}</math> can be either deterministic or randomized. Now we consider the associated value function with a discount factor <math>\gamma</math>:
<math display="block">
\begin{eqnarray}
V^{\Pi}_{t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^\infty \gamma^{u-t} r_{u}(s_u,a_u)\right|s_t = s\right],
\end{eqnarray}
</math>
where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.\\
\ben{This repeats too much how about:}
\fi


'''Non-stationary RL with Infinite Time Horizon.'''
'''Non-stationary RL with Infinite Time Horizon.''' For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon <math>T</math> become infinite. Then the non-stationary environment can be denoted by a tuple
For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon <math>T</math> become infinite. Then the non-stationary environment can be denoted by a tuple
<math>(\mathcal{S},\mathcal{A},  \pmb{P}, \pmb{r})</math>, with the obvious changes from the episodic case. Again we write <math>\Pi=\{\pi_{t} \}_{t \ge 0}</math> for the Markovian policy in which <math>\pi_{t}</math> can be either deterministic or randomized and consider the associated value function with a discount factor <math>\gamma</math>:
<math>(\mathcal{S},\mathcal{A},  \pmb{P}, \pmb{r})</math>, with the obvious changes from the episodic case. Again we write <math>\Pi=\{\pi_{t} \}_{t \ge 0}</math> for the Markovian policy in which <math>\pi_{t}</math> can be either deterministic or randomized and consider the associated value function with a discount factor <math>\gamma</math>:


Line 673: Line 632:
\end{eqnarray}
\end{eqnarray}
</math>
</math>
where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.\\
where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.


Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon  MDPs with finite state and action spaces <ref name="gajane2018sliding"/><ref name="cheung2019learning"/>, episodic MDPs with finite state and action spaces <ref name="mao2020model"/>, and episodic linear MDPs <ref name="touati2020efficient"/>.
Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon  MDPs with finite state and action spaces <ref name="gajane2018sliding"/><ref name="cheung2019learning"/>, episodic MDPs with finite state and action spaces <ref name="mao2020model"/>, and episodic linear MDPs <ref name="touati2020efficient"/>.

Revision as of 01:18, 12 May 2024

[math] \newcommand*{\rom}[1]{\expandafter\@slowromancap\romannumeral #1@} \DeclareMathOperator*{\dprime}{\prime \prime} \DeclareMathOperator{\Tr}{Tr} \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\N}{\mathbb{N}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\Sc}{\mathcal{S}} \DeclareMathOperator{\Ac}{\mathcal{A}} \DeclareMathOperator{\Pc}{\mathcal{P}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\sx}{\underline{\sigma}_{\pmb{X}}} \DeclareMathOperator{\sqmin}{\underline{\sigma}_{\pmb{Q}}} \DeclareMathOperator{\sqmax}{\overline{\sigma}_{\pmb{Q}}} \DeclareMathOperator{\sqi}{\underline{\sigma}_{Q,\textit{i}}} \DeclareMathOperator{\sqnoti}{\underline{\sigma}_{\pmb{Q},-\textit{i}}} \DeclareMathOperator{\sqfir}{\underline{\sigma}_{\pmb{Q},1}} \DeclareMathOperator{\sqsec}{\underline{\sigma}_{\pmb{Q},2}} \DeclareMathOperator{\sru}{\underline{\sigma}_{\pmb{R}}^{u}} \DeclareMathOperator{\srv}{\underline{\sigma}_{\pmb{R}}^v} \DeclareMathOperator{\sri}{\underline{\sigma}_{R,\textit{i}}} \DeclareMathOperator{\srnoti}{\underline{\sigma}_{\pmb{R},\textit{-i}}} \DeclareMathOperator{\srfir}{\underline{\sigma}_{\pmb{R},1}} \DeclareMathOperator{\srsec}{\underline{\sigma}_{\pmb{R},2}} \DeclareMathOperator{\srmin}{\underline{\sigma}_{\pmb{R}}} \DeclareMathOperator{\srmax}{\overline{\sigma}_{\pmb{R}}} \DeclareMathOperator{\HH}{\mathcal{H}} \DeclareMathOperator{\HE}{\mathcal{H}(1/\varepsilon)} \DeclareMathOperator{\HD}{\mathcal{H}(1/\varepsilon)} \DeclareMathOperator{\HCKI}{\mathcal{H}(C(\pmb{K}^0))} \DeclareMathOperator{\HECK}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HECKI}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} \DeclareMathOperator{\HC}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HCK}{\mathcal{H}(C(\pmb{K}))} \DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}))} \DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} \DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}^0))} \DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} \newcommand{\mathds}{\mathbb}[/math]

Reinforcement learning is an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as [math]Q[/math]-learning and Actor-Critic. Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go [1] and multi-player StarCraft II [2], as well as the development of ChatGPT [3]. The generality of the reinforcement learning framework allows its application in both discrete and continuous spaces to solve tasks in both real and simulated environments [4]. Classical RL research during the last third of the previous century developed an extensive theoretical core on which modern algorithms are based. Several algorithms, such as Temporal-Difference Learning and [math]Q[/math]-learning, were developed and are able to solve small-scale problems when either the states of the environment can be enumerated (and stored in memory) or the optimal policy lies in the space of linear or quadratic functions of the state variable. Although these restrictions are extremely limiting, these foundations of classical RL theory underlie the modern approaches which benefit from increased computational power. Combining this framework with deep learning [5] was popularized by the Deep [math]Q[/math]-learning algorithm, introduced in [6], which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years [7][8][9].

Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in Section with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In Section we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in section value based methods and policy-based methods in section policy based methods, with an infinite-time horizon. Finally, section discussion contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL.

Setup: Markov Decision Processes

We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker.


Infinite Time Horizon and Discounted Reward. Let us start with a discrete time MDP with an infinite time horizon and discounted reward. We have a Markov process taking values in a state space [math]\Sc[/math], where we can influence the evolution by taking an action from a set [math]\Ac[/math]. The aim is to optimize the expected discounted return from the system by choosing a policy, that is a sequence of actions through time. We formulate this mathematically by defining the value function [math]V^\star[/math] for each [math]s\in \Sc[/math] to be

[[math]] \begin{eqnarray} \label{equ: singlev} V^\star(s) = \sup_{\Pi} V^{\Pi}(s):= \sup_{\Pi}\E^{\Pi}\biggl[\sum_{t=0}^{\infty}\gamma^t r(s_t, a_t) \biggl| s_0 = s\biggl], \end{eqnarray} [[/math]]

subject to

[[math]] \begin{eqnarray} \label{equ: singledynamics} s_{t + 1} \sim P(s_t, a_t), \;\;\; a_t \sim \pi_t({s_t}). \end{eqnarray} [[/math]]

Here and throughout the paper, [math]\E^{\Pi}[/math] denotes the expectation under the policy [math]{\Pi}[/math], where the probability measure [math]\mathbb{P}[/math] describes both dynamics and the rewards in the MDP. We will write [math]\mathcal{P}(\mathcal{X})[/math] for the probability measures over a space [math]\mathcal{X}[/math]. The state space [math](\Sc, d_{\Sc})[/math] and the action space [math](\Ac, d_{\Ac})[/math] are both complete separable metric spaces, which includes the case of [math]\Sc[/math] and [math]\Ac[/math] being finite, as often seen in the RL literature; [math]\gamma[/math] [math]\in[/math] [math](0, 1)[/math] is a discount factor; [math]s_t \in \Sc[/math] and [math]a_t \in \Ac[/math] are the state and the action at time [math]t[/math]; [math]P: \Sc \times \Ac \to \Pc(\Sc)[/math] is the transition function of the underlying Markov process; the notation [math] s_{t + 1} \sim P(s_t, a_t)[/math] denotes that [math] s_{t + 1}[/math] is sampled from the distribution [math] P(s_t, a_t)\in \mathcal{P}(\mathcal{S})[/math]; the reward [math]r(s, a)[/math] is a random variable in [math]\R[/math] for each pair [math](s, a)\in \Sc \times \Ac[/math]; and the policy [math]\Pi =\{\pi_t\}_{t=0}^{\infty}[/math] is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, [math]\pi_t: \Sc \to \Ac[/math] maps the current state [math]s_t[/math] to a deterministic action. On the other hand, a randomized policy [math]\pi_t: \Sc\to \Pc(\Ac)[/math] maps the current state [math]s_t[/math] to a distribution over the action space. For a randomized policy [math]\pi_t[/math], we will use [math]\pi_t(s)\in \mathcal{P}(\mathcal{A})[/math] and [math]\pi_t(s,a)[/math] to represent, respectively, the distribution over the action space given state [math]s[/math] and the probability of taking action [math]a[/math] at state [math]s[/math]. In this case, with infinite-time horizon, we assume the reward [math]r[/math] and the transition dynamics [math]P[/math] are time homogeneous, which is a standard assumption in the MDP literature [10]. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward.

The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev};

[[math]] \begin{eqnarray} \label{equ:classicalV} V^\star(s) = \sup_{a \in \Ac} \Big\lbrace \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}[V^\star(s')] \Big\rbrace. \end{eqnarray} [[/math]]

We write the value function as

[[math]] V^\star(s) = \sup_{a \in \Ac} Q^\star(s, a), [[/math]]

where the [math]Q[/math]-function, one of the basic quantities used for RL, is defined to be

[[math]] \begin{eqnarray} \label{defsingleQ} Q^\star(s, a) = \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}[V^\star(s')], \end{eqnarray} [[/math]]

the expected reward from taking action [math]a[/math] at state [math]s[/math] and then following the optimal policy thereafter. There is also a Bellman equation for the [math]Q[/math]-function given by

[[math]] \begin{eqnarray} \label{equ: singleQ} Q^\star(s, a) = \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}\sup_{a' \in \Ac} Q^\star(s', a'). \end{eqnarray} [[/math]]

This allows us to retrieve the optimal (stationary) policy [math]\pi^*(s, a)[/math] (if it exists) from [math]Q(s, a)[/math], in that [math]\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)[/math]. An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer the reader to [11][12] for a more detailed discussion of RL with infinite-time horizon and average reward.


Finite Time Horizon. When there is a finite time horizon [math]T \lt \infty[/math] we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as

[[math]] \begin{eqnarray} \label{equ: singlev_fh} V^\star_t(s) = \sup_{\Pi} V_t^{\Pi}(s):= \sup_{\Pi}\E^{\Pi}\biggl[\sum_{u=t}^{T-1} r_u(s_u, a_u) +r_T(s_T)\biggl| s_t = s\biggl], \;\; \forall s\in \Sc, \end{eqnarray} [[/math]]

subject to

[[math]] \begin{eqnarray} \label{equ: singledynamics_fh} s_{u + 1} \sim P_u(s_u, a_u), \;\;\; a_u \sim \pi_u(s_u),\;\; t \leq u \leq T-1. \end{eqnarray} [[/math]]

As in the infinite horizon case, we denote by [math]s_u \in \Sc[/math] and [math]a_u \in \Ac[/math] the state and the action of the agent at time [math]u[/math]. However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now [math]P_u: \Sc \times \Ac \to \Pc(\Sc)[/math] denotes the transition function and [math]r_u(s, a)[/math] a real-valued random variable for each pair [math](s, a)\in \Sc \times \Ac[/math] for [math]t \leq u \leq T-1[/math]. Similarly [math]r_T(s)[/math], the terminal reward, is a real-valued random variable for all [math]s\in \Sc[/math]. Finally, the Markovian policy [math]\Pi =\{\pi_u\}_{t=0}^T[/math] can be either deterministic or randomized. The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as

[[math]] \begin{eqnarray} \label{equ:classicalV_fh} V^\star_t(s) = \sup_{a \in \Ac} \Big\lbrace \E[r_t(s, a)] + \E_{s' \sim P_{t}(s, a)}[V^\star_{t+1}(s')] \Big\rbrace, \end{eqnarray} [[/math]]

with the terminal condition [math]V^\star_T(s) = \mathbb{E}[r_{T}(s)][/math]. We write the value function as

[[math]] V^\star_t(s) = \sup_{a \in \Ac} Q^\star_t(s, a), [[/math]]

where the [math]Q^\star_t[/math] function is defined to be

[[math]] \begin{eqnarray} \label{defsingleQ_fh} Q^\star_t(s, a) = \E[r_t(s, a)] + \E_{s' \sim P_t(s, a)}[V^\star_t(s')]. \end{eqnarray} [[/math]]

There is also a Bellman equation for the [math]Q[/math]-function given by

[[math]] \begin{eqnarray} \label{equ: singleQ_fh} Q^\star_t(s, a) = \E[r_t(s, a)] + \E_{s' \sim P_t(s, a)}\left[\sup_{a' \in \Ac} Q^\star_{t+1}(s', a')\right], \end{eqnarray} [[/math]]

with the terminal condition [math]Q^\star_T(s,a) = \mathbb{E}[r_T(s)][/math] for all [math]a\in \mathcal{A}[/math]. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications.

For an infinite horizon MDP with finite state and action space, and finite reward [math]r[/math], a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists.

Theorem (Theorem 6.2.7 in [10])

Assume [math]|\mathcal{A}| \lt \infty[/math], [math]|\mathcal{S}| \lt \infty[/math] and [math]|r| \lt \infty[/math] with probability one. For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.

Theorem implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in [10]. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the best stationary policy. To this end, we write [math]\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})[/math] (without a time index) for a stationary policy throughout the paper.


Linear MDPs and Linear Functional Approximation. Linear MDPs have been extensively studied in the recent literature on theoretical RL in order to establish tractable and efficient algorithms. In a linear MDP, both the transition kernels and the reward function are assumed to be linear with respect to some feature mappings [13][14]. In the infinite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if there exist [math]d[/math] unknown (signed) measures [math]\mu = (\mu^{(1)},\cdots,\mu^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have

[[math]] \begin{eqnarray}\label{eq:linearMDP_infinite} P(\cdot|s,a) = \langle\, \phi(s,a),\mu(\cdot)\,\rangle,\quad r(s,a) = \langle\, \phi(s,a),\theta\,\rangle. \end{eqnarray} [[/math]]


Similarly for the finite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if for any [math]0\leq t \leq T[/math], there exist [math]d[/math] unknown (signed) measures [math]\mu_t = (\mu_t^{(1)},\cdots,\mu_t^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta_t \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have

[[math]] \begin{eqnarray}\label{eq:linearMDP_finite} P_t(\cdot|s,a) = \langle\, \phi(s,a),\mu_t(\cdot)\,\rangle,\quad r_t(s,a) = \langle\, \phi(s,a),\theta_t\,\rangle, \end{eqnarray} [[/math]]

Typically features are assumed to be known to the agent and bounded, namely, [math]\|\phi(s,a)\|\leq 1[/math] for all [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math]. The linear MDP framework is closely related to the literature on RL with linear functional approximation, where the value function is assumed to be of the form

[[math]] \begin{eqnarray}\label{eq:lfa_infinite} Q(s,a) = \langle\, \psi(s,a), \omega \,\rangle, \quad V(s) = \langle\, \xi(s), \eta \,\rangle \end{eqnarray} [[/math]]

for the infinite horizon case, and

[[math]] \begin{eqnarray}\label{eq:lfa_finite} Q_t(s,a) = \langle\, \psi(s,a), \omega_t \,\rangle, \quad V_t(s) = \langle\, \xi(s), \eta_t \,\rangle, \forall\, 0\leq t \leq T \end{eqnarray} [[/math]]

for the finite horizon case. Here [math]\psi: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}^d[/math] and [math]\xi: \mathcal{S}\rightarrow \mathbb{R}^d[/math] are known feature mappings and [math]\omega[/math], [math]\omega_t[/math] [math]\eta[/math], and [math]\eta_t[/math] are unknown vectors. It has been shown that the linear MDP (i.e., \eqref{eq:linearMDP_infinite} or \eqref{eq:linearMDP_finite}) and the linear functional approximation formulation (i.e., \eqref{eq:lfa_infinite} or \eqref{eq:lfa_finite}) are equivalent under mild conditions [15][16]. In addition to linear functional approximations, we refer the reader to [17] for a general discussion of RL with nonlinear functional approximations and to Section \ref{sec:deep_value_based} for neural network approximation, one of the most popular nonlinear functional approximations used in practice.

Nonlinear Functional Approximation. Compared to linear functional approximations, nonlinear functional approximations do not require knowledge of the kernel functions a priori. Nonlinear functional approximation could potentially address the mis-specification issue when the agent has an incorrect understanding of the functional space that the MDP lies in. The most popular nonlinear functional approximation approach is to use neural networks, leading to deep RL. Thanks to the universal approximation theorem, this is a theoretically sound approach and neural networks show promising performance for a wide range of applications. Meanwhile gradient-based algorithms for certain neural network architectures enjoy provable convergence guarantees. We defer the discussion of deep RL to Section \ref{sec:deep_value_based} and its financial applications to Section \ref{sec:fin_app}.

From MDP to Learning

When the transition dynamics [math]P[/math] and the reward function [math]r[/math] for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy [math]\pi[/math] (if it exists) while simultaneously learning the unknown [math]P[/math] and [math]r[/math]. The learning of [math]P[/math] and [math]r[/math] can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as [18][19].

Agent-environment Interface. In the RL setting, the learner or the decision-maker is called the agent. The physical world that the agent operates in and interacts with, comprising everything outside the agent, is called the environment. The agent and environment interact at each of a sequence of discrete time steps, [math]t =0,1,2,3,\cdots[/math], in the following way. At the beginning of each time step [math]t[/math], the agent receives some representation of the environment's state, [math]s_t \in \mathcal{S}[/math] and selects an action [math]a_t\in \mathcal{A}[/math]. At the end of this time step, in part as a consequence of its action, the agent receives a numerical reward [math]r_t[/math] (possibly stochastic) and a new state [math]s_{t+1}[/math] from the environment. The tuple [math](s_t,a_t,r_t,s_{t+1})[/math] is called a sample at time [math]t[/math] and [math]h_t :=\{(s_u,a_u,r_u,s_{u+1})\}_{u=0}^t[/math] is referred to as the history or experience up to time [math]t[/math]. An RL algorithm is a finite sequence of well-defined policies for the agent to interact with the environment.


Exploration vs Exploitation. In order to maximize the accumulated reward over time, the agent learns to select her actions based on her past experiences (exploitation) and/or by trying new choices (exploration). Exploration provides opportunities to improve performance from the current sub-optimal solution to the ultimate globally optimal one, yet it is time consuming and computationally expensive as over-exploration may impair the convergence to the optimal solution. Meanwhile, pure exploitation, i.e., myopically picking the current solution based solely on past experience, though easy to implement, tends to yield sub-optimal global solutions. Therefore, an appropriate trade-off between exploration and exploitation is crucial in the design of RL algorithms in order to improve the learning and the optimization performance. See Section for a more detailed discussion.


Simulator. In the procedure described above on how the agent interacts with the environment, the agent does so in an online fashion. Namely, the initial state at time step [math]t+1[/math], [math]s_{t+1}[/math], is the state that the agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. This is a challenging setting since an efficient exploration scheme is needed. For instance, [math]Q[/math]-learning with the [math]\varepsilon[/math]-greedy policy (to be introduced in Section) may take exponentially many episodes to learn the optimal policy [20]. Some results assume access to a simulator [21] (a.k.a., a generative model [22]), which allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. This implies the agent can “restart” the system at any time. That is, the initial state at time step [math]t+1[/math] does not need to be the state that agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [22].

Randomized Policy vs Deterministic Policy. A randomized policy [math]\pi_t:\Sc \rightarrow \mathcal{P}(\Ac)[/math] is also known in the control literature as a relaxed control and in game theory as a mixed strategy. Despite the existence of a deterministic optimal policy as stated in Theorem for the infinite time horizon case, most of the RL algorithms adopt randomized policies to encourage exploration when RL agents are not certain about the environment.


An Example: Multi-armed bandits. We illustrate these points by discussing multi-armed bandit problems, a special case of RL problems. The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown. We try these arms in some order, which may depend on the sequence of rewards that have been observed so far and attempt to maximize our return. A common objective in this context is to find a policy for choosing the next arm to be tried, under which the sum of the expected rewards comes as close as possible to the ideal reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times. Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward. The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return. In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time [math]t[/math] can be interpreted as the personal information and purchasing behavior of a targeted client. The problem was first considered in the seminal work of Robbins [23], which derives policies that asymptotically attain an average reward that converges in the limit to the reward of the best arm. The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected reward, and adversarial setups. See Berry and Fristedt [24] for a review of the classical results on the multi-armed bandit problem.

Performance Evaluation

It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time [math]0[/math] and ends at the terminal time [math]T[/math], and the performance is evaluated in terms of the total number of samples in all episodes.[a] Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE [27] and the policy gradient method [28] (see policy based methods). Throughout the analysis, we use the notation [math]\widetilde{O}(\cdot)[/math] to hide logarithmic factors when describing the order of the scaling in [math]\varepsilon[/math], [math]\delta[/math], [math]|\mathcal{A}|[/math] and [math]|\mathcal{S}|[/math] (when these are finite), and other possible model parameters such as the discount rate [math]\gamma[/math] in the infinite horizon case or the dimension [math]d[/math] of the features when functional approximation is used. We write [math]\widetilde{\mathcal{O}}^{\prime}[/math] rather than [math]\widetilde{\mathcal{O}}[/math] to indicate that, although there may be other parameters, we only include the dependence on [math]\varepsilon[/math]. We denote by [math]\mbox{poly}(\cdot)[/math] a polynomial function of its arguments.

Sample Complexity. In RL, a sample [math](s,a,r,s^{\prime})[/math] is defined as a tuple consisting of a state [math]s[/math], an action [math]a[/math], an instantaneous reward received by taking action [math]a[/math] at state [math]s[/math], and the next state [math]s^{\prime}[/math] observed afterwards. Sample complexity is defined as the total number of samples required to find an approximately optimal policy. This evaluation criterion can be used for any kind of RL problem. For episodic RL, an algorithm returns, after the end of the [math](k-1)[/math]-th episode with [math]M_{k-1}[/math] samples used in this episode, a policy [math]\pi_k[/math] to be applied in the [math]k[/math]-th episode. The sample complexity of this algorithm is the minimum number [math]\sum_{k=1}^K M_{k-1}(\varepsilon)[/math] of samples such that for all [math]k \ge K[/math], [math]\pi_k[/math] is [math]\varepsilon[/math]-optimal with probability at least [math]1-\delta[/math], i.e., for [math]k \ge K[/math],

[[math]] \begin{eqnarray}\label{eq:sample_complexity_epi} \mathbb{P}\left(V_0^{\pi_k}\ge V_0^\star-\varepsilon\right) \ge 1-\delta. \end{eqnarray} [[/math]]


For discounted RL, an algorithm returns, after the end of the [math](n-1)[/math]-th step with [math]M_{n-1}[/math] samples used in this iteration, a policy [math]\pi_n[/math] to be applied in the [math]n[/math]-th step. There are several notions of sample complexity in this setting. The most commonly used ones are defined through the value function and the [math]Q[/math]-function with all input variables (i.e., [math]s\in \mathcal{S}[/math] for [math]V[/math] and [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] for [math]Q[/math]) and they are referred to as the [math]V[/math]-sample complexity and [math]Q[/math]-sample complexity in the literature. The [math]Q[/math]-sample complexity of a given algorithm is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]Q^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]Q^*[/math], that is

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_Q} \|Q^{(n)} - Q^*\|_{\infty}:=\sup_{s\in\mathcal{S},a\in \mathcal{A}}\Big|Q^*(s,a)-Q^{(n)}(s,a)\Big| \leq \varepsilon, \end{eqnarray} [[/math]]

holds either with high probability (that is [math]\mathbb{P}\left(\|Q^{(n)}-Q^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|Q^{(n)}-Q^*\|_{\infty}] \lt \varepsilon[/math]. Similarly, the [math]V[/math]-sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]V^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]V^*[/math], that is

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_V_2} \|V^{(n)} - V^*\|_{\infty}:=\sup_{s\in\mathcal{S}}\left|V^{(n)}(s)-V^*(s)\right|\leq \varepsilon, \end{eqnarray} [[/math]]

holds either with high probability (in that [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|V^{(n)}-V^*\|_{\infty}] \lt \varepsilon[/math]. Note that \eqref{eq:sample_complexity_infi_Q} implies \eqref{eq:sample_complexity_infi_V_2} since [math]V^{(n)}(s) = \sup_{a\in \mathcal{A}}Q^{(n)}(s,a)[/math] and [math]V^{*}(s) = \sup_{a\in \mathcal{A}}Q^*(s,a)[/math]. In our analysis we do not distinguish between whether the sample complexity bounds for \eqref{eq:sample_complexity_infi_Q} and \eqref{eq:sample_complexity_infi_V_2} are in the high probability sense or in the expectation sense. The second type of sample complexity, the sample complexity of exploration, is defined as the number of samples such that the non-stationary policy [math]\pi_n[/math] at time [math]n[/math] is not [math]\varepsilon[/math]-optimal for the current state [math]s_n[/math]. This measure counts the number of mistakes along the whole trajectory. Mathematically speaking, the sample complexity of exploration for a given algorithm is the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]\pi_n[/math] is [math]\varepsilon[/math]-optimal when starting from the current state [math]s_n[/math] with probability at least [math]1-\delta[/math], i.e., for [math]n \ge N[/math],

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_V} \mathbb{P}\left(\left|V^{\pi_n}(s_n)- V^\star(s_n)\right|\leq \varepsilon\right) \ge 1-\delta. \end{eqnarray} [[/math]]

Note that the condition [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math] associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. Another weaker criterion, the mean-square sample complexity, measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution [math]\mu[/math]. In this case, the mean-square sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math],

[[math]] \begin{eqnarray}\label{eq:sample_complexity_infi_V_avg} \|V^{\pi_n} - V^*\|_{\mu}:=\sqrt{\int_{\mathcal{S}}(V^{\pi_n}(s)-V^*(s))^2\mu(ds)} \leq \varepsilon. \end{eqnarray} [[/math]]

Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the [math]V[/math] function over a state distribution [math]\mu[/math], it is implied by \eqref{eq:sample_complexity_infi_V_2}.

{\color{blue} What does [math]\widetilde{\mathcal{O}}[/math] mean?}

Rate of Convergence. To emphasize the convergence with respect to the number of iterations required, we introduce the notion of rate of convergence, which uses the relationship between the number of iterations/steps [math]N[/math] and the error term [math]\varepsilon[/math], to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from [math]\varepsilon[/math]) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to [math]\varepsilon[/math]. In particular, an algorithm is said to converge at a linear rate if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(1/\varepsilon))[/math]. Similarly, an algorithm is said to converge at a sublinear rate (slower than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^p)[/math] with [math]p\geq 1[/math], and is said to converge at a superlinear (faster than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(\log(1/\varepsilon))[/math].

Regret Analysis. The regret of a given policy [math]\pi[/math] is defined as the difference between the cumulative reward of the optimal policy and that gathered by [math]\pi[/math]. It quantifies the exploration-exploitation trade-off. For episodic RL with horizon [math]T[/math] in formulation \eqref{equ: singlev_fh}, the regret of an algorithm after [math]K [/math] episodes with [math]M=T\,K[/math] steps is

[[math]] \begin{eqnarray}\label{eq:regret_epi} R(M) = \left\| K \times V_0^{*} - \sum_{k=1}^K\mathbb{E}[V_0^{\pi_k}]\right\|_{\infty} = \sup_{s\in \mathcal{S}} \Big(K \times V_0^{*} - \sum_{k=1}^K\mathbb{E}[V_0^{\pi_k}]\Big). \end{eqnarray} [[/math]]

There is currently no regret analysis for RL problems with infinite time horizon and discounted reward. The natural definition of regret as given above is less meaningful in this case as the cumulative discounted reward is bounded and hence does not scale with [math]T[/math].

Asymptotic Convergence. In addition to sample complexity and regret analysis discussed above, asymptotic convergence is another weaker convergence guarantee, which requires the error term to decay to zero as [math]N[/math] goes to infinity (without specifying the order of [math]N[/math]). This is often a first step in the theoretical analysis of RL algorithms.

Classification of RL Algorithms

An RL algorithm includes one or more of the following components:

  • a representation of a value function that provides a prediction of how good each state or each state/action pair is,
  • a direct representation of the policy [math]\pi(s)[/math] or [math]\pi(s,a)[/math],
  • a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).

The first two components are related to what is called model-free RL. When the latter component is used, the algorithm is referred to as model-based RL. In the MDP setting, model-based algorithms maintain an approximate MDP model by estimating the transition probabilities and the reward function, and derive a value function from the approximate MDP. The policy is then derived from the value function. Examples include [29][20][30] and [31]. Another line of model-based algorithms make structural assumptions on the model, using some prior knowledge, and utilize the structural information for algorithm design. For example, see [32] for learning linear-quadratic regulators over an infinite time horizon and [33][34] for the finite time horizon case. Unlike the model-based method, model-free algorithms directly learn a value (or state-value) function or the optimal policy without inferring the model. Model-free algorithms can be further divided into two categories, value-based methods and policy-based methods. Policy-based methods explicitly build a representation of a policy and keep it in memory during learning. Examples include policy gradient methods and trust-region policy optimization methods. As an alternative, value-based methods store only a value function without an explicit policy during the learning process. In this case, the policy is implicit and can be derived directly from the value function (by picking the action with the best value). In our discussion of methodology, we focus on model-free RL algorithms for MDP with infinite horizon and discounted reward. In particular, we introduce some classical value-based and policy-based methods in Sections and respectively. For the episodic setting and model-based algorithms, see the discussion in Section.

Value-based Methods

In this section, we introduce the methodologies of several classical value-based methods in the setting of an infinite time horizon with discounting, finite state and action spaces ([math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]), and stationary policies. The setting with finite state and action spaces is also referred to as the tabular case in the RL literature. For more general setups with an infinite number of actions and states, we refer the reader to Section for a discussion of linear functional approximations and to Section for neural network approximations. Recall that for reinforcement learning, the distribution of the reward function [math]r[/math] and the transition function [math]P[/math] are unknown to the agent. Therefore the expectations in the Bellman equations \eqref{equ:classicalV} and \eqref{equ: singleQ} cannot be calculated directly. On the other hand, the agents can observe samples [math](s,a,r,s^{\prime})[/math] by interacting with the system without a model of the system's dynamics [math]P[/math] or any prior knowledge of [math]r[/math]. The agents can then learn the value function or [math]Q[/math]-function using the samples. Following this idea, we next introduce the classical temporal-difference learning algorithm, with [math]Q[/math]-learning and State-Action-Reward-State-Action (SARSA) as two special cases.

Temporal-Difference Learning

Given some samples [math](s,a,r,s')[/math] obtained by following a policy [math]\pi[/math], the agent can update her estimate of the value function [math]V^{\pi}[/math] (defined in \eqref{equ: singlev}) at the [math](n+1)[/math]-th iteration by

[[math]] \begin{eqnarray}\label{eq:TD_update} V^{\pi,(n+1)}(s) \leftarrow (1-\beta_n(s,a))\underbrace{V^{\pi,(n)}(s)}_{\text{current estimate}} + \beta_n(s,a) \underbrace{[r+\gamma V^{\pi,(n)}(s')]}_{\text{new estimate}}, \end{eqnarray} [[/math]]

with some initialization of the value function [math]V^{\pi,(0)}[/math]. Here [math]\beta_n(s,a)[/math] is the learning rate at iteration [math]n+1[/math] which balances the weighting between the current estimate and the new estimate. The quantity [math]\beta_n[/math] can be a constant, can depend on the current state [math]s[/math], or even depend on the observed samples up to iteration [math]n+1[/math]. Algorithm provides some pseudo-code for an implementation of the TD method. Equation \eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD[math](\lambda)[/math] method (for some [math]\lambda \in [0,1][/math]) which is a combination of TD learning (with weight [math]\lambda[/math]) and a Monte Carlo method (with weight [math]1-\lambda[/math]). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See Section for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. {\color{blue} I still dont understand. A Monte Carlo method is a simulation method for calculating expectations of functions of a stochastic process. Here we have a method, TD, which uses a single sample to update the value function. We could use more samples to update with some average of the samples from a given [math](s,a)[/math] - is this what you mean? Or do we run sample the policy for a longer time period and average these to get an approximation using the definition of [math]V^{\pi}[/math] as an expectation over an infinite discounted sum?} Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in [18] for more details. The TD update in \eqref{eq:TD_update} can also be written as

[[math]] \begin{eqnarray}\label{eq:TD_update_v2} V^{\pi,(n+1)}(s) \leftarrow V^{\pi,(n)}(s) + \beta_n(s,a) \underbrace{\left[\overbrace{r+\gamma V^{\pi,(n)}(s')}^{\text{TD target}}-V^{\pi,(n)}(s)\right]}_{\text{TD error $\delta_n$}}. \end{eqnarray} [[/math]]

The term [math]\delta_n[/math], commonly called the TD error, measures the difference between the estimated value at [math]s[/math] and the better estimate [math]r+\gamma V^{\pi,(n)}(s')[/math], which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature.

TD(0) Method for estimating $V^{\pi}$
input: total number of iterations [math]N[/math]; the policy [math]\pi[/math] used to sample observations; rule to set learning rate [math]\beta_n \in (0,1][/math] [math](0\leq n \leq N-1)[/math]
Initialize [math]V^{\pi,(0)}(s)[/math] for all [math]s\in \mathcal{S}[/math]
Initialize [math]s[/math]
if [math]n=0,\cdots,N-1[/math] then
Sample action [math]a[/math] according to [math]\pi(s)[/math]
Observe [math]r[/math] and [math]s'[/math] after taking action [math]a[/math]
Update [math]V^{\pi,(n+1)}[/math] according to \eqref{eq:TD_update} with [math](s,a,r,s')[/math]
[math]s\leftarrow s'[/math]
end for


[math]Q[/math]-learning Algorithm

A version of TD learning is [math]Q[/math]-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the [math]Q[/math]-function with samples observed by the agent. At iteration [math]n[/math] ([math]1 \leq n \leq N-1[/math]), the [math]Q[/math]-function is updated using a sample [math](s,a,r,s^{\prime})[/math] where [math]s' \sim P(s,a)[/math],

[[math]] \begin{eqnarray}\label{eq:Q_learning_update} Q^{(n+1)}(s,a)\leftarrow (1-\beta_n(s,a))\underbrace{Q^{(n)}(s,a)}_{\text{current estimate}}+\beta_n(s,a)\underbrace{\left[r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')\right]}_{\text{new estimate}}. \end{eqnarray} [[/math]]

Here [math]\beta_n(s,a)[/math] is the learning rate which balances the weight between the current estimate [math]Q^{(n)}[/math] from iteration [math]n[/math] and the new estimate [math]r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')[/math] calculated using the sample [math](s,a,r,s^{\prime})[/math]. Algorithm provides pseudo-code for an implementation of Q-learning.

$Q$-learning with samples from a policy $\pi$
input: total number of iterations [math]N[/math]; the policy [math]\pi[/math] to be evaluated; rule to set learning rate [math]\beta_n \in (0,1][/math] [math](0 \leq n\leq N-1)[/math]
Initialize [math]Q^{(0)}(s,a)[/math] for all [math]s\in \mathcal{S}[/math] and [math]a\in \mathcal{A}[/math]
Initialize [math]s[/math]
if [math]n=0,\cdots,N-1[/math] then
Sample action [math]a[/math] according to [math]\pi(s)[/math]
Observe [math]r[/math] and [math]s'[/math] after taking action [math]a[/math]
Update [math]Q^{(n+1)}[/math] according to \eqref{eq:Q_learning_update} with sample [math](s,a,r,s')[/math]
[math]s\leftarrow s'[/math]
end for


The following theorem guarantees the asymptotic convergence of the [math]Q[/math]-learning update \eqref{eq:Q_learning_update} to the [math]Q[/math]-function defined in \eqref{defsingleQ}.

Theorem (Watkins and Dayan (1992) [35])

\label{thm:Q_conv} Assume [math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]. Let [math]n^i(s, a)[/math] be the index of the [math]i[/math]-th time that the action [math]a[/math] is used in state [math]s[/math]. Let [math]R \lt \infty[/math] be a constant. Given bounded rewards [math]|r|\leq R[/math], learning rate [math]0\leq\beta_n \lt 1[/math] and

[[math]] \begin{eqnarray}\label{exploration} \sum_{i=1}^{\infty}\beta_{n^i(s,a)} = \infty,\,\,\sum_{i=1}^{\infty}(\beta_{n^i(s,a)})^2 \lt \infty, \,\,\forall s,a, \end{eqnarray} [[/math]]
then [math]Q^{(n)}(s,a)\rightarrow Q^\star(s,a)[/math] as [math]n\rightarrow \infty[/math], [math]\forall s,a[/math] with probability [math]1[/math].

This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence [math]\beta[/math]. Theorem implies that eventually [math]Q^{(n)}[/math] will convergence to the true value function [math]Q^\star[/math] when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for [math]Q[/math]-learning related algorithms are discussed in Discussion.

SARSA

In contrast to the [math]Q[/math]-learning algorithm, which takes samples from any policy [math]\pi[/math] as the input where these samples could be collected in advance before performing the [math]Q[/math]-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the [math]Q[/math]-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm.

SARSA: On-policy TD Learning
input: total number of iterations [math]N[/math], the learning rate [math]\beta\in(0,1)[/math]\footnotemark[1], and small parameter [math]\varepsilon \gt 0[/math].
Initialize [math]Q^{(0)}(s, a)[/math] for all [math]s\in \mathcal{S}[/math] and [math]a\in \mathcal{A}[/math]
Initialize [math]s\in \mathcal{S}[/math]
Initialize [math]a[/math]: Choose [math]a[/math] when in [math]s[/math] using a policy derived from [math]Q^{(0)}[/math] (e.g, [math]\varepsilon[/math]-greedy)
if [math]n=0,1,\cdots,N-1[/math] then
Take action [math]a[/math], observe reward [math]r[/math] and the next step [math]s'[/math]
Choose [math]a'[/math] when in [math]s'[/math] using a policy derived from [math]Q^{(n)}[/math] (e.g., [math]\varepsilon[/math]-greedy)

[[math]] \begin{equation}\label{eq:SARSA_update} Q^{(n+1)}(s,a) \leftarrow (1-\beta)\underbrace{ Q^{(n)}(s,a)}_{\text{current estimate}} +\beta\underbrace{(r+\gamma Q^{(n)}(s',a'))}_{\text{new estimate}} \end{equation} [[/math]]

[math]s\leftarrow s'[/math], [math]a\leftarrow a'[/math]
end for


[b] The policy derived from [math]Q^{(n)}[/math] using [math]\varepsilon[/math]-greedy (line 4 and line 8 in Algorithm) suggests the following choice of action [math]a[/math] when in state [math]s[/math]:

[[math]] \begin{eqnarray}\label{epsilon_policy} \begin{cases} \text{ select from }\arg\max_{a'} Q^{(n)}(s,a') & \text{with probability } 1-\varepsilon,\\ \text{ uniformly sample from } \mathcal{A} & \text{with probability } \varepsilon. \end{cases} \end{eqnarray} [[/math]]


In Algorithm, SARSA uses the [math]Q^{(n)}[/math] function with an [math]\epsilon[/math]-greedy policy to choose [math]a'[/math] and to perform exploration. In contrast, Q-learning uses the maximum value of [math]Q^{(n)}[/math] over all possible actions for the next step, which is equivalent to setting [math]\varepsilon=0[/math] when choosing [math]a'[/math].

In addition to the [math]\varepsilon[/math]-greedy policy, other exploration methods such as the softmax operator (resulting in a Boltzmann policy) [36][37][38] can also be applied. See Section for a more detailed discussion.


On-policy Learning vs. Off-policy Learning. An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. [math]Q[/math]-learning is an off-policy agent as the samples [math](s,a,r,s')[/math] used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the [math]Q[/math]-function. The convergence of the [math]Q[/math]-function relies on the exploration scheme of the policy [math]\pi[/math] (see Theorem). In contrast to [math]Q[/math]-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the [math]Q[/math]-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of [math]Q[/math]-learning \eqref{eq:Q_learning_update}, we have [math]\max_{a'} Q^{(n)}(s',a')[/math] which is the maximum value of the [math]Q[/math]-function at a given state [math]s'[/math]. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes [math]Q^{(n)}(s',a')[/math] where [math]a'[/math] is selected based on a policy derived from [math]Q^{(n)}[/math] (via the [math]\varepsilon[/math]-greedy policy \eqref{epsilon_policy}).

Discussion

In this section, we provide a brief overview of sample complexity results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to Section. There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, [39] showed that the mean-squared error converges at a rate of [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})[/math] with decaying learning rate [math]\beta_n= \frac{1}{(1+n)^{\sigma}}[/math] for some [math]\sigma\in (0,1)[/math] where i.i.d. samples are drawn from a stationary distribution; [40] provided an improved [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] bound for iterate-averaged TD(0) with constant step-size; and [41] reached the same mean-square sample complexity [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory. A similar analysis was applied by [42] to SARSA and [math]Q[/math]-learning algorithms for a continuous state space using linear function approximation (see the end of Section).

The [math]Q[/math]-sample complexity for the [math]Q[/math]-learning algorithm of [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\mbox{poly}(\log \delta^{-1})\right)[/math] was first established in [43] with decaying learning rate [math]\beta_n = \frac{1}{(1+n)^{4/5}}[/math] and access to a simulator. This was followed by [44] which showed that the same order of [math]Q[/math]-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)[/math] could be achieved with a constant learning rate [math]\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}[/math] [math](n\leq N)[/math]. Without access to a simulator, delayed [math]Q[/math]-learning [45] was shown to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] assuming a known reward. An improvement to this result to [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy [46] and a single trajectory.


In addition, sample complexities of [math]Q[/math]-learning variants with (linear) function approximations have been established in several recent papers. [16] assumed a linear MDP framework (see the end of Section) and the authors provided a V-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math] with [math]d[/math] denoting the dimension of the features. This result matches the information-theoretic lower bound up to [math]\log(\cdot)[/math] factors. Later work, [47] considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order [math]\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math]. Although the dependence on [math]\gamma[/math] is not as good as in [16], this framework can be applied to a more general class of models which are “near” linear. Although [math]Q[/math]-learning is one of the most successful algorithms for finding the best action-value function [math]Q[/math], its implementation often suffers from large high biased estimates of the [math]Q[/math]-function values incurred by random sampling. The double [math]Q[/math]-learning algorithm proposed in [48] tackled this high estimation issue by randomly switching the update between two [math]Q[/math]-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double [math]Q[/math]-learning algorithm was established in [49] with a Q-sample complexity on the order of [math]\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)[/math] and learning rate [math]\beta_n = \frac{1}{n^{w}}[/math] for some constant [math]\omega \in (0,1)[/math].


Policy-based Methods

So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the [math]\varepsilon[/math]-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods [50][51][52]. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE [53], Actor-Critic methods [54], Trust Region Policy Optimization (TRPO) [55], Proximal Policy Optimization (PPO) [56] among others. We first parametrize the policy [math]\pi[/math] by a vector [math]\theta[/math], thus we define [math]\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})[/math], the probability distribution parameterized by [math]\theta[/math] over the action space given the state [math]s[/math] at time [math]t[/math]. Here we will use [math]\pi(s,a;\theta)[/math] and [math]\pi_{\theta}[/math] interchangeably to represent the policy parameterized by [math]\theta[/math]. We write [math]\mu^{\pi_{\theta}}[/math] for the stationary distribution of the state under policy [math]\pi(s,a;\theta)[/math]. Then the policy objective function is defined to be

[[math]] J(\theta):=\int_{\mathcal{S}}V^{\pi_{\theta}}(s)\mu^{\pi_{\theta}}(ds) [[/math]]

for RL with infinite time horizon, or

[[math]] J(\theta):=\int_{\mathcal{S}}V_0^{\pi_{\theta}}(s)\mu^{\pi_{\theta}}(ds) [[/math]]

for RL with finite time horizon. In order to maximize this function, the policy parameter [math]\theta[/math] is updated using the gradient ascent rule:

[[math]] \begin{equation}\label{eqn:grad_ascent} \theta^\prime = \theta + \beta\,\widehat{\nabla_{\theta}J(\theta)}, \end{equation} [[/math]]

where [math]\beta[/math] is the learning rate, and [math]\widehat{\nabla_{\theta}J(\theta)}[/math] is an estimate of the gradient of [math]J(\theta)[/math] with respect to [math]\theta[/math].

Policy Gradient Theorem

Our task now is to estimate the gradient [math]\nabla_{\theta}J(\theta)[/math]. The objective function [math]J(\theta)[/math] depends on both the policy and the corresponding stationary distribution [math]\mu^{\pi_{\theta}}[/math], and the parameter [math]\theta[/math] affects both of them. Calculating the gradient of the action with respect to [math]\theta[/math] is straightforward given the parametrization [math]\pi(s,a;\theta)[/math], whereas it might be challenging to analyze the effect of [math]\theta[/math] on the state when the system is unknown. Fortunately the well-known policy gradient theorem provides an expression for [math]\nabla_{\theta}J(\theta)[/math] which does not involve the derivatives of the state distribution with respect to [math]\theta[/math]. We write the [math]Q[/math]-function under policy [math]\pi(s,a;\theta)[/math] as [math]Q^{\pi_{\theta}}(s,a)[/math], then we have the following theorem (for more details see [18]).

Theorem (Policy Gradient Theorem [57])

Assume that [math]\pi(s,a;\theta)[/math] is differentiable with respect to [math]\theta[/math] and that there exists [math]\mu^{\pi_{\theta}}[/math], the stationary distribution of the dynamics under policy [math]\pi_{\theta}[/math], which is independent of the initial state [math]s_0[/math]. Then the policy gradient is

[[math]] \begin{equation}\label{eqn:policy_gra_thm} \nabla_{\theta}J(\theta) = \mathbb{E}_{s\sim \mu^{\pi_{\theta}},a\sim \pi_{\theta}}[\nabla_{\theta}\ln\pi(s,a;\theta)Q^{\pi_{\theta}}(s,a)]. \end{equation} [[/math]]

For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity [58].

REINFORCE: Monte Carlo Policy Gradient

Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard Monte Carlo method that repeatedly simulates a trajectory of length [math]M[/math] within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE [53] or Monte Carlo policy gradient (see Algorithm). We use a simple empirical return function [math]G_t[/math] to approximate the [math]Q[/math]-function in \eqref{eqn:policy_gra_thm}. The return [math]G_t[/math] is defined as the sum of the discounted rewards and given by [math]G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}[/math]. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time [math]t[/math], it uses all future rewards up until the end of the trajectory (see line 7 in Algorithm). Within the [math]n[/math]-iteration the policy parameter [math]\theta^{(n+1)}[/math] is updated [math]M[/math] times with [math]G_t[/math] for [math]t=0,1,\cdots,M-1[/math]. As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a baseline, which is a function of the state [math]s[/math], from the return [math]G_t[/math]. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, [math]\widehat{V}(s)[/math], which can be learnt using one of the methods introduced in Section. Replacing the [math]G_t[/math] in \eqref{eqn:alg_REINFORCE_update} by [math]G_t-\widehat{V}(s)[/math] gives a REINFORCE algorithm with baseline.

REINFORCE: Monte Carlo Policy Gradient
input: total number of iterations [math]N[/math], learning rate [math]\beta[/math], a differentiable policy parametrization [math]\pi(s,a;\theta)[/math], finite length of the trajectory [math]M[/math].
Initialize policy parameter [math]\theta^{(0)}[/math].
if [math]n=0,1,\ldots,N-1[/math] then
Sample a trajectory [math]s_0,a_0,r_1,\cdots,s_{T-1},a_{T-1},r_T\sim\pi(s,a;\theta^{(n)})[/math]
Initialize [math]\theta^{(n+1)} = \theta^{(n)}[/math]
if [math]t=0,\cdots,M-1[/math] then
Calculate the return: [math]G_t=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}[/math]
Update the policy parameter

[[math]] \begin{equation}\label{eqn:alg_REINFORCE_update} \theta^{(n+1)} \leftarrow \theta^{(n+1)} + \beta\gamma^t G_t\nabla_{\theta}\ln\pi(s_t,a_t;\theta^{(n+1)}) \end{equation} [[/math]]

end for
end for


Actor-Critic Methods

The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., [59] and [60]) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. In Actor-Critic methods the value function is parametrized by [math]w[/math] and the policy is parametrized by [math]\theta[/math]. These methods consist of two models: a critic which updates the value function or [math]Q[/math]-function parameter [math]w[/math], and an actor which updates the policy parameter [math]\theta[/math] based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in Section for the critic; for example SARSA or [math]Q[/math]-learning. We give the pseudocode for a simple Actor-Critic method in Algorithm. Here for the critic, we approximate the [math]Q[/math]-function by a linear combination of features, that is, [math]Q(s,a;w)=\phi(s,a)^\top w[/math] for some known feature [math]\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d[/math], and use TD(0) to update the parameter [math]w[/math] to minimize the least-square error of the TD error with the gradient of the least-square error taking the form [math]\delta \phi(s,a)[/math] (see lines 9-10 in Algorithm). For the actor, we use the (vanilla) policy-based method to update the policy parameter [math]\theta[/math] (see line 11 in Algorithm). There are three main ways to execute the algorithm. In the nested-loop setting (see, e.g. [61][62]), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time-scale setting (see, e.g. [63]), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the single-scale setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. [64][65]).

Actor-Critic Algorithm
input: A differentiable policy parametrization [math]\pi(s,a;\theta)[/math], a differentiable [math]Q[/math]-function parameterization [math]Q(s,a;w)[/math], learning rates [math]\beta^{\theta} \gt 0[/math] and [math]\beta^{w} \gt 0[/math], number of iterations [math]N[/math]
Initialize policy parameter [math]\theta^{(0)}[/math] and [math]Q[/math]-function parameter [math]w^{(0)}[/math]
Initialize [math]s[/math]
if [math]n=0,1,\ldots,N-1[/math] then
Sample [math]a\sim\pi(s,a;\theta^{(n)})[/math]
Take action [math]a[/math], observe state [math]s^\prime[/math] and reward [math]r[/math]
Sample action [math]a^\prime\sim\pi(s^\prime,\cdot;\theta^{(n)})[/math]
Compute the TD error: [math]\delta\leftarrow r+\gamma Q(s^\prime,a^\prime;w^{(n)})-Q(s,a;w^{(n)})[/math]
Update [math]w[/math]: [math]w^{(n+1)}\leftarrow w^{(n)}+\beta^{w}\delta\phi(s,a)[/math]
[math]\theta^{(n+1)}\leftarrow \theta^{(n)} + \beta^{\theta}Q(s,a;w^{(n)})\nabla_{\theta}\ln\pi(s,a;\theta^{(n)})[/math]
[math]s\leftarrow s^\prime[/math], [math]a\leftarrow a^\prime[/math]
end for


Discussion

Variants of Policy-based Methods. There have been many variations of policy-based methods with improvements in various directions. For example, the natural policy-based algorithms [66][67] modify the search direction of the vanilla version by involving a Fisher information matrix in the gradient ascent updating rule, which speeds the convergence significantly. To enhance the stability of learning, we may request the updated policy estimate to be not too far away from the previous estimate. Along this line, TRPO [55] uses the Kullback-Leibler (KL)-divergence to measure the change between the consecutive updates as a constraint, while PPO [56] incorporate an adaptive KL-penalty or a clipped objective in the objective function to eliminate the constraint. When using the KL-penalty, PPO can be viewed as a Lagrangian relaxation of the TRPO algorithm. To reduce the sample complexity, Deterministic Policy Gradient (DPG) [68] uses a deterministic function (rather than the stochastic policy [math]\pi[/math]) to represent the policy to avoid sampling over actions (although enough exploration needs to be guaranteed in other ways); Actor-Critic with Experience Replay (ACER) [69] reuses some experience (tuples of data collected in previous iterations/episodes, see the discussion of Deep [math]Q[/math]-Networks in Section) which improves the sample efficiency and decreases the data correlation. In addition we mention Asynchronous Advantage Actor-Critic (A3C) and Advantage Actor-Critic (A2C), two popular Actor-Critic methods with a special focus on parallel training [70]. The latter is a synchronous, deterministic version of the former. For continuous-time framework, see developments in [71][72].

Convergence Guarantees. Now we discuss the convergence guarantees of policy-based algorithms. [57] provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. For more examples of asymptotic convergence, see for example [54] and [73]. Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [61] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates [math]k[/math] required to attain [math]\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 \lt \varepsilon[/math]. We denote this smallest number as [math]K[/math]. When the actor uses a policy gradient method, the critic achieves [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)[/math] by employing TD(0), [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})[/math] by employing the Gradient Temporal Difference, and [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})[/math] by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. [74] investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. For more examples of sample complexity analysis for convergence to a stationary point, see for example [75][76][77][78][79]. The global optimality of stationary points was studied in [80] where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. Recently there have been results on the sample complexity analysis of the convergence to the global optimum. [81] proved that the entropy regularized natural policy gradient method achieves a [math]Q[/math]-sample complexity (and [math]\varepsilon[/math]-optimal policies) with linear convergence rate. [82] showed that with high probability at least [math]1-\delta[/math], the TRPO algorithm has sublinear rate [math] \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)[/math] in the unregularized (tabular) MDPs, and has an improved rate [math]\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)[/math] in MDPs with entropy regularization. [63] gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order [math]\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})[/math]. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both [64] and [65]. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see [27] for the regret analysis of the REINFORCE algorithm for discounted MDPs and [83][84] for policy gradient methods in the setting of known model parameters.

General Discussion

So far we have focused on model-free RL with discounting over an infinite time horizon. In this section, we discuss two other cases: RL over a finite time horizon with both model-based and model-free methods, and model-based RL with an infinite time horizon.

RL with Finite Time Horizon

As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. [85] proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an [math]\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})[/math] expected regret bound. [25] proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has [math]V[/math]-sample complexity[c] of order [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})[/math] assuming known rewards. [86] considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of [math]\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]) and [math]\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T^3|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]), respectively. [26] proposed an Upper Bounding the Expected Next State Value (UBEV) algorithm which achieves a sample complexity\footnotemark[2] of [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^5}{\varepsilon^2}{\rm polylog}(\frac{1}{\delta})\big)[/math]. They also proved that the algorithm has a regret bound of [math]\widetilde{\mathcal{O}}(T^2\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] with [math]M\ge |\mathcal{S}|^5|\mathcal{A}|^3[/math]. [87] proved that [math]Q[/math]-learning with UCB exploration achieves regret of [math]\widetilde{\mathcal{O}}(\sqrt{T^3|\mathcal{S}|\,|\mathcal{A}|M})[/math] without requiring access to a simulator. The above regret bounds depend on the size of the state and action space and thus may suffer from the curse of dimensionality as [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math] increase. To tackle this issue, [88] learned a low-dimensional representation of the probability transition model and proved that their MatrixRL algorithm achieves a regret bound of [math]\widetilde{\mathcal{O}}(T^2\,d\sqrt{M})[/math] which depends on the number of features [math]d[/math] rather than [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math]. [15] provided a [math]\widetilde{\mathcal{O}}(\sqrt{d^3T^3M})[/math] regret bound with high probability for a modified version of the Least-Squares Value Iteration (LSVI) algorithm with UCB exploration.

Model-based RL with Infinite Time Horizon

To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, [29] proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)[/math] by [89]. The Model-based Interval Estimation (MBIE) in [90] was proved to achieve a sample complexity of exploration of the same order as R-MAX. [31] further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)[/math]. [91] proved that the model-based [math]Q[/math]-Value Iteration (QVI) achieves the [math]Q[/math]-sample complexity of [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})[/math] for some small [math]\varepsilon[/math] with high probability at least [math]1-\delta[/math]. [92] studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as [91], but with a larger range of [math]\varepsilon[/math]. See also [45][20][30] for model-based RL along this line. Linear-quadratic (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) linear-quadratic regulator (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example [32] and [33] for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also [34] and [93] for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time.

Exploration Exploitation Trade-offs

As mentioned earlier, exploitation versus exploration is a critical topic in RL. RL agents want to find the optimal solution as fast as possible, meanwhile committing to solutions too fast without enough exploration could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while efficient exploration remains a challenging and less understood topic. For multi-armed bandit problems or MDP with finite state and action spaces, classic exploration algorithms such as [math]\varepsilon[/math]-greedy, UCB, Boltzmann exploration and Thompson sampling show promising performance in different settings. For the [math]\varepsilon[/math]-greedy algorithm [94][95], the agent randomly explores occasionally with probability [math]\varepsilon[/math] and takes the optimal action most of the time with probability [math]1-\varepsilon[/math]. For example, see Equation \eqref{epsilon_policy} for the [math]\varepsilon[/math]-greedy method combined with a value-based algorithm. The UCB type of algorithms are mainly designed for value-based algorithms [46][86][87][15]. The agent selects the greediest action to maximize the upper confidence bound [math]Q(s,a) + U(s,a)[/math], where [math]Q(s,a)[/math] is the Q value function and [math]U(s,a)[/math] is a function inversely proportional to how many times the state-action pair [math](s,a)[/math] has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration [36][37][38], the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method [96][97][98], the agent keeps track of a belief of the optimal action via a distribution over the set of admissible actions (for each given state).

For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space.

Regularization in Reinforcement Learning

Many recent developments in RL benefit from regularization, which has been shown to improve not only the exploration but also the robustness. Examples include both policy-based methods and value-based methods in various settings. Here we provide a general overview of different regularization techniques and their advantages along with several examples. For example, TRPO and PPO use the KL-divergence between two consecutive policies as a penalty in policy improvement [55][56]. Soft-[math]Q[/math]-learning uses Shannon entropy as a penalty in value iteration [37]. The authors confirmed in simulated experiments that entropy regularization helps to improve exploration and allows transferring skills between tasks. [99] proposed a unified framework to analyze the above algorithms via a regularized Bellman operator. [81] showed that entropy regularization can also help to speed up the convergence rate from a theoretical perspective. It is worth noting that [38] explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently [100] and [101] extended this idea to a general control framework beyond the linear-quadratic case. [102] extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors [103][104]. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for [math]Q[/math]-functions) and the [math]l_1[/math] norm (for parameters).

Reinforcement Learning in Non-stationary Environment

Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce the mathematical formulations of non-stationary RL (in both episodic and infinite horizon settings) below.

Non-stationary Episodic RL. For the episodic (or finite-time horizon) setting, an agent interacts with a non-stationary MDP for [math]K[/math] episodes, with each episode containing [math]T[/math] timestamps. Let [math](t, k)[/math] denote a (time,episode) index for the [math]t[/math]-th timestamp in the [math]k[/math]-th episode. The non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, T, \pmb{P}, \pmb{r})[/math], where [math]\mathcal{S}[/math] is the state space, [math]\mathcal{A}[/math] is the action space, [math]T[/math] is the number of timestamps in one episode, [math]\pmb{P} = \{P_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of transition kernels with [math]P_{k,t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})[/math] for all [math]1\leq k\leq K[/math] and [math]0\leq t\leq T[/math], and [math]\pmb{r} = \{r_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of reward functions with [math]r_{k,t}(s,a)[/math] a random variable in [math]\mathbb{R}[/math] (depending on both the time-step and the episode number) for each pair [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] and for [math]0\leq t\leq T-1[/math]. Similarly, the terminal reward [math]r_{k,T}(s)[/math] is a real-valued random variable for each [math]s\in\mathcal{S}[/math]. Denote by [math]\Pi=\{\pi_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T-1}[/math] the Markovian policy in which [math]\pi_{k,t}[/math] can be either deterministic or randomized. The value function for this policy at the [math]k[/math]-th episode can be written as

[[math]] \begin{eqnarray} V^{\Pi}_{k,t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^{T-1} r_{k,u}(s_u,a_u)+r_{k,T}(s_T)\right|s_t = s\right], \end{eqnarray} [[/math]]

where [math]s_{u+1} \sim P_{k,u}(s_u,a_u)[/math] and [math]a_u\sim \pi_{k,u}(s_u)[/math].

Non-stationary RL with Infinite Time Horizon. For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon [math]T[/math] become infinite. Then the non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})[/math], with the obvious changes from the episodic case. Again we write [math]\Pi=\{\pi_{t} \}_{t \ge 0}[/math] for the Markovian policy in which [math]\pi_{t}[/math] can be either deterministic or randomized and consider the associated value function with a discount factor [math]\gamma[/math]:

[[math]] \begin{eqnarray} V^{\Pi}_{t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^\infty \gamma^{u-t} r_{u}(s_u,a_u)\right|s_t = s\right], \end{eqnarray} [[/math]]

where [math]s_{u+1} \sim P_{u}(\cdot|s_u,a_u)[/math] and [math]a_u\sim \pi_{u}(s_u)[/math].

Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces [105][106], episodic MDPs with finite state and action spaces [107], and episodic linear MDPs [108]. However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. To relax this assumption, [109] proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result [106]. However, it introduces extra overheads and leads to suboptimal regret. More recently, [110] proposes a general approach that is applicable to various reinforcement learning settings (including both episodic MDPs and infinite-horizon MDPs) and achieves optimal dynamic regret without any prior knowledge of the degree of non-stationarity.

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

Notes

  1. Sample complexity with respect to the number of episodes has also been adopted in the literature [25][26]. The translation between these two criteria is straightforward.
  2. The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate [math]\beta[/math]. In the convergence analysis, we use [math]\beta_n[/math] to denote a learning rate which depends on the number of iterations.
  3. In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by [math]T[/math] to match the definition of sample complexity in this paper.

References

  1. Cite error: Invalid <ref> tag; no text was provided for refs named silver2017mastering
  2. Cite error: Invalid <ref> tag; no text was provided for refs named vinyals2019grandmaster
  3. Cite error: Invalid <ref> tag; no text was provided for refs named radford2019language
  4. Cite error: Invalid <ref> tag; no text was provided for refs named lillicrap2015continuous
  5. Cite error: Invalid <ref> tag; no text was provided for refs named goodfellow2016deep
  6. Cite error: Invalid <ref> tag; no text was provided for refs named mnih2013playing
  7. Cite error: Invalid <ref> tag; no text was provided for refs named van2016deep
  8. Cite error: Invalid <ref> tag; no text was provided for refs named gu2016continuous
  9. Cite error: Invalid <ref> tag; no text was provided for refs named fan2020theoretical
  10. 10.0 10.1 10.2 Cite error: Invalid <ref> tag; no text was provided for refs named puterman2014markov
  11. Cite error: Invalid <ref> tag; no text was provided for refs named abbasi2019politex
  12. Cite error: Invalid <ref> tag; no text was provided for refs named wei2020model
  13. Cite error: Invalid <ref> tag; no text was provided for refs named bradtke1996linear
  14. Cite error: Invalid <ref> tag; no text was provided for refs named melo2007q
  15. 15.0 15.1 15.2 Cite error: Invalid <ref> tag; no text was provided for refs named jin2020provably
  16. 16.0 16.1 16.2 Cite error: Invalid <ref> tag; no text was provided for refs named yang2019sample
  17. Cite error: Invalid <ref> tag; no text was provided for refs named dai2018sbeed
  18. 18.0 18.1 18.2 Cite error: Invalid <ref> tag; no text was provided for refs named suttonbarto
  19. Cite error: Invalid <ref> tag; no text was provided for refs named powell2021reinforcement
  20. 20.0 20.1 20.2 Cite error: Invalid <ref> tag; no text was provided for refs named kearns2002near
  21. Cite error: Invalid <ref> tag; no text was provided for refs named koenig1993complexity
  22. 22.0 22.1 Cite error: Invalid <ref> tag; no text was provided for refs named azar2012sample
  23. Cite error: Invalid <ref> tag; no text was provided for refs named robbins1952some
  24. Cite error: Invalid <ref> tag; no text was provided for refs named berry1985bandit
  25. 25.0 25.1 Cite error: Invalid <ref> tag; no text was provided for refs named DannBrunskill2015
  26. 26.0 26.1 Cite error: Invalid <ref> tag; no text was provided for refs named DannUBEV
  27. 27.0 27.1 Cite error: Invalid <ref> tag; no text was provided for refs named zhang2020sample
  28. Cite error: Invalid <ref> tag; no text was provided for refs named liu2020improved
  29. 29.0 29.1 Cite error: Invalid <ref> tag; no text was provided for refs named brafman2002r
  30. 30.0 30.1 Cite error: Invalid <ref> tag; no text was provided for refs named lattimore2012pac
  31. 31.0 31.1 Cite error: Invalid <ref> tag; no text was provided for refs named MoRmax2010
  32. 32.0 32.1 Cite error: Invalid <ref> tag; no text was provided for refs named fazel2018global
  33. 33.0 33.1 Cite error: Invalid <ref> tag; no text was provided for refs named hambly2020policy
  34. 34.0 34.1 Cite error: Invalid <ref> tag; no text was provided for refs named basei2021logarithmic
  35. Cite error: Invalid <ref> tag; no text was provided for refs named watkins1992q
  36. 36.0 36.1 Cite error: Invalid <ref> tag; no text was provided for refs named asadi2017alternative
  37. 37.0 37.1 37.2 Cite error: Invalid <ref> tag; no text was provided for refs named haarnoja2017reinforcement
  38. 38.0 38.1 38.2 Cite error: Invalid <ref> tag; no text was provided for refs named WZZ2018
  39. Cite error: Invalid <ref> tag; no text was provided for refs named dalal2018finite
  40. Cite error: Invalid <ref> tag; no text was provided for refs named lakshminarayanan2018linear
  41. Cite error: Invalid <ref> tag; no text was provided for refs named bhandari2018finite
  42. Cite error: Invalid <ref> tag; no text was provided for refs named zou2019finite
  43. Cite error: Invalid <ref> tag; no text was provided for refs named even2003learning
  44. Cite error: Invalid <ref> tag; no text was provided for refs named beck2012error
  45. 45.0 45.1 Cite error: Invalid <ref> tag; no text was provided for refs named strehl2009reinforcement
  46. 46.0 46.1 Cite error: Invalid <ref> tag; no text was provided for refs named dong2019q
  47. Cite error: Invalid <ref> tag; no text was provided for refs named lattimore2020learning
  48. Cite error: Invalid <ref> tag; no text was provided for refs named hasselt2010double
  49. Cite error: Invalid <ref> tag; no text was provided for refs named xiong2020finite
  50. Cite error: Invalid <ref> tag; no text was provided for refs named sewak2019policy
  51. Cite error: Invalid <ref> tag; no text was provided for refs named yu2020policy
  52. Cite error: Invalid <ref> tag; no text was provided for refs named daberius2019deep
  53. 53.0 53.1 Cite error: Invalid <ref> tag; no text was provided for refs named williams1992
  54. 54.0 54.1 Cite error: Invalid <ref> tag; no text was provided for refs named konda2000
  55. 55.0 55.1 55.2 Cite error: Invalid <ref> tag; no text was provided for refs named schulman2015
  56. 56.0 56.1 56.2 Cite error: Invalid <ref> tag; no text was provided for refs named schulman2017proximal
  57. 57.0 57.1 Cite error: Invalid <ref> tag; no text was provided for refs named sutton2000policy
  58. Cite error: Invalid <ref> tag; no text was provided for refs named konda2002
  59. Cite error: Invalid <ref> tag; no text was provided for refs named franccois2019
  60. Cite error: Invalid <ref> tag; no text was provided for refs named von2011statistical
  61. 61.0 61.1 Cite error: Invalid <ref> tag; no text was provided for refs named kumar2019sample
  62. Cite error: Invalid <ref> tag; no text was provided for refs named xu2020improving
  63. 63.0 63.1 Cite error: Invalid <ref> tag; no text was provided for refs named xu2020non
  64. 64.0 64.1 Cite error: Invalid <ref> tag; no text was provided for refs named fu2020single
  65. 65.0 65.1 Cite error: Invalid <ref> tag; no text was provided for refs named xu2021doubly
  66. Cite error: Invalid <ref> tag; no text was provided for refs named kakade2001natural
  67. Cite error: Invalid <ref> tag; no text was provided for refs named peters2008natural
  68. Cite error: Invalid <ref> tag; no text was provided for refs named silver2014deterministic
  69. Cite error: Invalid <ref> tag; no text was provided for refs named wang2016
  70. Cite error: Invalid <ref> tag; no text was provided for refs named mnih2016asynchronous
  71. Cite error: Invalid <ref> tag; no text was provided for refs named jia2021policy
  72. Cite error: Invalid <ref> tag; no text was provided for refs named jia2022policy
  73. Cite error: Invalid <ref> tag; no text was provided for refs named bhatnagar2010actor
  74. Cite error: Invalid <ref> tag; no text was provided for refs named zhang2020global
  75. Cite error: Invalid <ref> tag; no text was provided for refs named papini2018stochastic
  76. Cite error: Invalid <ref> tag; no text was provided for refs named xu2020improved
  77. Cite error: Invalid <ref> tag; no text was provided for refs named xu2019sample
  78. Cite error: Invalid <ref> tag; no text was provided for refs named shen2019hessian
  79. Cite error: Invalid <ref> tag; no text was provided for refs named xiong2020non
  80. Cite error: Invalid <ref> tag; no text was provided for refs named bhandari2019
  81. 81.0 81.1 Cite error: Invalid <ref> tag; no text was provided for refs named cen2020fast
  82. Cite error: Invalid <ref> tag; no text was provided for refs named shani2020adaptive
  83. Cite error: Invalid <ref> tag; no text was provided for refs named agarwal2021theory
  84. Cite error: Invalid <ref> tag; no text was provided for refs named mei2020
  85. Cite error: Invalid <ref> tag; no text was provided for refs named PSRL2013
  86. 86.0 86.1 Cite error: Invalid <ref> tag; no text was provided for refs named azar2017minimax
  87. 87.0 87.1 Cite error: Invalid <ref> tag; no text was provided for refs named jin2018q
  88. Cite error: Invalid <ref> tag; no text was provided for refs named yang2020reinforcement
  89. Cite error: Invalid <ref> tag; no text was provided for refs named li2009rmax
  90. Cite error: Invalid <ref> tag; no text was provided for refs named Strehl2005
  91. 91.0 91.1 Cite error: Invalid <ref> tag; no text was provided for refs named azar2013minimax
  92. Cite error: Invalid <ref> tag; no text was provided for refs named agarwal2020model
  93. Cite error: Invalid <ref> tag; no text was provided for refs named guo2021reinforcement
  94. Cite error: Invalid <ref> tag; no text was provided for refs named dann2022guarantees
  95. Cite error: Invalid <ref> tag; no text was provided for refs named dabney2020temporally
  96. Cite error: Invalid <ref> tag; no text was provided for refs named thompson1933likelihood
  97. Cite error: Invalid <ref> tag; no text was provided for refs named ouyang2017learning
  98. Cite error: Invalid <ref> tag; no text was provided for refs named gopalan2015thompson
  99. Cite error: Invalid <ref> tag; no text was provided for refs named geist2019theory
  100. Cite error: Invalid <ref> tag; no text was provided for refs named gao2020state
  101. Cite error: Invalid <ref> tag; no text was provided for refs named tang2021exploratory
  102. Cite error: Invalid <ref> tag; no text was provided for refs named grau2018soft
  103. Cite error: Invalid <ref> tag; no text was provided for refs named massoud2009regularized
  104. Cite error: Invalid <ref> tag; no text was provided for refs named farahmand2008regularized
  105. Cite error: Invalid <ref> tag; no text was provided for refs named gajane2018sliding
  106. 106.0 106.1 Cite error: Invalid <ref> tag; no text was provided for refs named cheung2019learning
  107. Cite error: Invalid <ref> tag; no text was provided for refs named mao2020model
  108. Cite error: Invalid <ref> tag; no text was provided for refs named touati2020efficient
  109. Cite error: Invalid <ref> tag; no text was provided for refs named cheung2020reinforcement
  110. Cite error: Invalid <ref> tag; no text was provided for refs named wei2021non