guide:C8c80a2ae8: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
<div class="d-none"><math> | |||
\newcommand*{\rom}[1]{\expandafter\@slowromancap\romannumeral #1@} | |||
\newcommand{\vertiii}[1]{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert #1 | |||
\right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}} | |||
\DeclareMathOperator*{\dprime}{\prime \prime} | |||
\DeclareMathOperator{\Tr}{Tr} | |||
\DeclareMathOperator{\E}{\mathbb{E}} | |||
\DeclareMathOperator{\N}{\mathbb{N}} | |||
\DeclareMathOperator{\R}{\mathbb{R}} | |||
\DeclareMathOperator{\Sc}{\mathcal{S}} | |||
\DeclareMathOperator{\Ac}{\mathcal{A}} | |||
\DeclareMathOperator{\Pc}{\mathcal{P}} | |||
\DeclareMathOperator*{\argmin}{arg\,min} | |||
\DeclareMathOperator*{\argmax}{arg\,max} | |||
\DeclareMathOperator{\sx}{\underline{\sigma}_{\pmb{X}}} | |||
\DeclareMathOperator{\sqmin}{\underline{\sigma}_{\pmb{Q}}} | |||
\DeclareMathOperator{\sqmax}{\overline{\sigma}_{\pmb{Q}}} | |||
\DeclareMathOperator{\sqi}{\underline{\sigma}_{Q,\textit{i}}} | |||
\DeclareMathOperator{\sqnoti}{\underline{\sigma}_{\pmb{Q},-\textit{i}}} | |||
\DeclareMathOperator{\sqfir}{\underline{\sigma}_{\pmb{Q},1}} | |||
\DeclareMathOperator{\sqsec}{\underline{\sigma}_{\pmb{Q},2}} | |||
\DeclareMathOperator{\sru}{\underline{\sigma}_{\pmb{R}}^{u}} | |||
\DeclareMathOperator{\srv}{\underline{\sigma}_{\pmb{R}}^v} | |||
\DeclareMathOperator{\sri}{\underline{\sigma}_{R,\textit{i}}} | |||
\DeclareMathOperator{\srnoti}{\underline{\sigma}_{\pmb{R},\textit{-i}}} | |||
\DeclareMathOperator{\srfir}{\underline{\sigma}_{\pmb{R},1}} | |||
\DeclareMathOperator{\srsec}{\underline{\sigma}_{\pmb{R},2}} | |||
\DeclareMathOperator{\srmin}{\underline{\sigma}_{\pmb{R}}} | |||
\DeclareMathOperator{\srmax}{\overline{\sigma}_{\pmb{R}}} | |||
\DeclareMathOperator{\HH}{\mathcal{H}} | |||
\DeclareMathOperator{\HE}{\mathcal{H}(1/\varepsilon)} | |||
\DeclareMathOperator{\HD}{\mathcal{H}(1/\varepsilon)} | |||
\DeclareMathOperator{\HCKI}{\mathcal{H}(C(\pmb{K}^0))} | |||
\DeclareMathOperator{\HECK}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} | |||
\DeclareMathOperator{\HECKI}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} | |||
\DeclareMathOperator{\HC}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} | |||
\DeclareMathOperator{\HCK}{\mathcal{H}(C(\pmb{K}))} | |||
\DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}))} | |||
\DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))} | |||
\DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}^0))} | |||
\DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))} | |||
\newcommand{\mathds}{\mathbb}</math></div> | |||
label{sec:rl_basics} | |||
Reinforcement learning is | |||
an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. | |||
RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. | |||
The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. %This thread runs through some of the earliest work in artificial intelligence and led to the revival {\color{blue} I am not sure what revival means in this context} of RL in the early 1980s. | |||
The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as <math>Q</math>-learning and Actor-Critic. | |||
Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go <ref name="silver2017mastering"/> and multi-player StarCraft II <ref name="vinyals2019grandmaster"/>, as well as the development of ChatGPT <ref name="radford2019language"/>. The generality of the reinforcement learning framework allows its application in both discrete and | |||
continuous spaces to solve tasks in both real and simulated environments <ref name="lillicrap2015continuous"/>. | |||
Classical RL research during the last third of the previous century developed an extensive theoretical core on which modern algorithms are based. Several algorithms, such as Temporal-Difference Learning and <math>Q</math>-learning, were developed and are able to solve small-scale problems when either the states of the environment can be enumerated (and stored in memory) or the optimal policy lies in the space of linear or quadratic functions of the state variable. Although these restrictions are extremely limiting, these foundations of classical RL theory underlie the modern approaches which benefit from increased computational power. | |||
Combining this framework with deep learning <ref name="goodfellow2016deep"/> was popularized by the Deep <math>Q</math>-learning algorithm, introduced in <ref name="mnih2013playing"/>, which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years <ref name="van2016deep"/><ref name="gu2016continuous"/><ref name="fan2020theoretical"/>. | |||
Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in [[#sec:set_up |Section]] with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In [[#sec:MDP2learning |Section]] we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in [[#sec:value_based_methods |Section]] and policy-based methods in Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]], with an infinite-time horizon. Finally, Section~[[#sec:discussion |sec:discussion]] contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL. | |||
<!-- \color{ForestGreen}[TODO Huining: notation consistency throughout this section.] --> | |||
===<span id="sec:set_up"></span>Setup: Markov Decision Processes=== | |||
We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker. | |||
'''Infinite Time Horizon and Discounted Reward.''' Let us start with a discrete time MDP with an infinite time horizon and discounted reward. We have a Markov process taking values in a state space <math>\Sc</math>, where we can influence the evolution by taking an action from a set <math>\Ac</math>. The aim is to optimize the expected discounted return from the system by choosing a policy, that is a sequence of actions through time. We formulate this mathematically by defining the value function <math>V^\star</math> for each <math>s\in \Sc</math> to be | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singlev} | |||
V^\star(s) = \sup_{\Pi} V^{\Pi}(s):= | |||
\sup_{\Pi}\E^{\Pi}\biggl[\sum_{t=0}^{\infty}\gamma^t r(s_t, a_t) \biggl| s_0 = s\biggl], | |||
\end{eqnarray} | |||
</math> | |||
subject to | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singledynamics} | |||
s_{t + 1} \sim P(s_t, a_t), \;\;\; a_t \sim \pi_t({s_t}). | |||
\end{eqnarray} | |||
</math> | |||
%where <math>h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} </math> is the history up to time <math>t</math>; | |||
Here and throughout the paper, <math>\E^{\Pi}</math> denotes the expectation under the policy <math>{\Pi}</math>, where the probability measure <math>\mathbb{P}</math> describes both dynamics and the rewards in the MDP. We will write <math>\mathcal{P}(\mathcal{X})</math> for the probability measures over a space <math>\mathcal{X}</math>. The state space <math>(\Sc, d_{\Sc})</math> and the action space <math>(\Ac, d_{\Ac})</math> are both complete separable metric spaces, which includes the case of <math>\Sc</math> and <math>\Ac</math> being finite, as often seen in the RL literature; <math>\gamma</math> <math>\in</math> <math>(0, 1)</math> is a discount factor; <math>s_t \in \Sc</math> and <math>a_t \in \Ac</math> are the state and the action at time <math>t</math>; <math>P: \Sc \times \Ac \to \Pc(\Sc)</math> is the transition function of the underlying Markov process; the notation <math> s_{t + 1} \sim P(s_t, a_t)</math> denotes that <math> s_{t + 1}</math> is sampled from the distribution <math> P(s_t, a_t)\in \mathcal{P}(\mathcal{S})</math>; the reward <math>r(s, a)</math> is a random variable in <math>\R</math> for each pair <math>(s, a)\in \Sc \times \Ac</math>; and the policy <math>\Pi =\{\pi_t\}_{t=0}^{\infty}</math> is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, <math>\pi_t: \Sc \to \Ac</math> maps the current state <math>s_t</math> to a deterministic action. On the other hand, a randomized policy <math>\pi_t: \Sc\to \Pc(\Ac)</math> maps the current state <math>s_t</math> to a distribution over the action space. | |||
For a randomized policy <math>\pi_t</math>, we will use <math>\pi_t(s)\in \mathcal{P}(\mathcal{A})</math> and <math>\pi_t(s,a)</math> to represent, respectively, the distribution over the action space given state <math>s</math> and the probability of taking action <math>a</math> at state <math>s</math>. | |||
%that can be either deterministic, in that {\color{RedOrange}<math>\pi_t: \Sc \to \Ac</math>}, or randomized, in that {\color{RedOrange}<math>\pi_t: (\Sc\times \Ac)^t\times\Sc\to \Pc(\Ac)</math>}. | |||
In this case, with infinite-time horizon, we assume the reward <math>r</math> and the transition dynamics <math>P</math> are time homogeneous, which is a standard assumption in the MDP literature <ref name="puterman2014markov"/>. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward. | |||
% {\color{blue} I am concerned with the assumptions we make. At the moment everything looks very general - dont we need some sort of irreducibility? Later we talk about stationary distributions and sampling from them, without irreducibility these could be trivial} | |||
The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev}; | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ:classicalV} | |||
V^\star(s) = \sup_{a \in \Ac} \Big\lbrace \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}[V^\star(s')] \Big\rbrace. | |||
\end{eqnarray} | |||
</math> | |||
We write the value function as | |||
<math display="block"> | |||
V^\star(s) = \sup_{a \in \Ac} Q^\star(s, a), | |||
</math> | |||
where the <math>Q</math>-function, one of the basic quantities used for RL, is defined to be | |||
<math display="block"> | |||
\begin{eqnarray} \label{defsingleQ} | |||
Q^\star(s, a) = \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}[V^\star(s')], | |||
\end{eqnarray} | |||
</math> | |||
the expected reward from taking action <math>a</math> at state <math>s</math> and then following the optimal policy thereafter. | |||
There is also a Bellman equation for the <math>Q</math>-function given by | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singleQ} | |||
Q^\star(s, a) = \E[r(s, a)] + \gamma \E_{s' \sim P(s, a)}\sup_{a' \in \Ac} Q^\star(s', a'). | |||
\end{eqnarray} | |||
</math> | |||
This allows us to retrieve the | |||
optimal (stationary) policy <math>\pi^*(s, a)</math> (if it exists) from <math>Q(s, a)</math>, in that <math>\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)</math>. | |||
An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer the reader to <ref name="abbasi2019politex"/><ref name="wei2020model"/> for a more detailed discussion of RL with infinite-time horizon and average reward. | |||
\iffalse | |||
'''Infinite Time Horizon and Average Reward.''' Another setting over an infinite time horizon is to consider the average reward, which is also referred to as the ergodic reward, | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singlev_er} | |||
V(s) = \sup_{\Pi} V^{\Pi}(s):= | |||
\sup_{\Pi}\lim_{T\rightarrow\infty}\E^{\Pi}\biggl[\frac{1}{T}\sum_{t=0}^{T-1} r(s_t, a_t) \biggl| s_0 = s\biggl], | |||
\end{eqnarray} | |||
</math> | |||
subject to | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singledynamics} | |||
s_{t + 1} \sim P(s_t, a_t), \;\;\; a_t \sim \pi_t({s_t}). | |||
\end{eqnarray} | |||
</math> | |||
where <math>h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} </math> is the history up to time <math>t</math>; <math>s_t \in \Sc</math> and <math>a_t \in \Ac</math> are the state and the action at time <math>t</math>; <math>P: \Sc \times \Ac \to \Pc(\Sc)</math> is the transition matrix of the underlying Markov system; the reward <math>r(s, a)</math> is random valued in <math>\R</math> for each pair <math>(s, a)\in \Sc \times \Ac</math>; and <math>\Pi =\{\pi_t\}_{t=0}^{\infty}</math> is the policy adopted by the agent. | |||
\fi | |||
'''Finite Time Horizon.''' When there is a finite time horizon <math>T < \infty</math> we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singlev_fh} | |||
V^\star_t(s) = \sup_{\Pi} V_t^{\Pi}(s):= | |||
\sup_{\Pi}\E^{\Pi}\biggl[\sum_{u=t}^{T-1} r_u(s_u, a_u) +r_T(s_T)\biggl| s_t = s\biggl], \;\; \forall s\in \Sc, | |||
\end{eqnarray} | |||
</math> | |||
subject to | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singledynamics_fh} | |||
s_{u + 1} \sim P_u(s_u, a_u), \;\;\; a_u \sim \pi_u(s_u),\;\; t \leq u \leq T-1. | |||
\end{eqnarray} | |||
</math> | |||
%where <math>h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} </math> is the history up to time <math>t</math>; | |||
As in the infinite horizon case, we denote by <math>s_u \in \Sc</math> and <math>a_u \in \Ac</math> the state and the action of the agent at time <math>u</math>. % and denote by <math>h_u =\{s_0,a_0,\cdots,s_{u-1},a_{u-1},s_u\}</math> the history up to time <math>u</math>. | |||
However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now <math>P_u: \Sc \times \Ac \to \Pc(\Sc)</math> denotes the transition function and <math>r_u(s, a)</math> a real-valued random variable for each pair <math>(s, a)\in \Sc \times \Ac</math> for <math>t \leq u \leq T-1</math>. Similarly <math>r_T(s)</math>, the terminal reward, is a real-valued random variable for all <math>s\in \Sc</math>. Finally, the Markovian policy <math>\Pi =\{\pi_u\}_{t=0}^T</math> can be either deterministic or randomized. | |||
The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ:classicalV_fh} | |||
V^\star_t(s) = \sup_{a \in \Ac} \Big\lbrace \E[r_t(s, a)] + \E_{s' \sim P_{t}(s, a)}[V^\star_{t+1}(s')] \Big\rbrace, | |||
\end{eqnarray} | |||
</math> | |||
with the terminal condition <math>V^\star_T(s) = \mathbb{E}[r_{T}(s)]</math>. | |||
We write the value function as | |||
<math display="block"> | |||
V^\star_t(s) = \sup_{a \in \Ac} Q^\star_t(s, a), | |||
</math> | |||
where the <math>Q^\star_t</math> function is defined to be | |||
<math display="block"> | |||
\begin{eqnarray} \label{defsingleQ_fh} | |||
Q^\star_t(s, a) = \E[r_t(s, a)] + \E_{s' \sim P_t(s, a)}[V^\star_t(s')]. | |||
\end{eqnarray} | |||
</math> | |||
% the expected reward from taking action <math>a</math> at state <math>s</math> and at time <math>t</math> and then following the optimal policy thereafter. | |||
There is also a Bellman equation for the <math>Q</math>-function given by | |||
<math display="block"> | |||
\begin{eqnarray} \label{equ: singleQ_fh} | |||
Q^\star_t(s, a) = \E[r_t(s, a)] + \E_{s' \sim P_t(s, a)}\left[\sup_{a' \in \Ac} Q^\star_{t+1}(s', a')\right], | |||
\end{eqnarray} | |||
</math> | |||
with the terminal condition <math>Q^\star_T(s,a) = \mathbb{E}[r_T(s)]</math> for all <math>a\in \mathcal{A}</math>. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications. | |||
\iffalse | |||
'''Policy Class and Optimal Policies.''' | |||
We consider the following types of policies for <math>\Pi = \{\pi_t\}_{t=0}^{T}</math> with <math>T < \infty</math> for the finite time horizon setting and <math>T=\infty</math> for the infinite time horizon setting: | |||
<ul style{{=}}"list-style-type:lower-roman"><li>History-dependent Randomized Policy (HR): <math>\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Pc(\Ac)</math>,</li> | |||
<li>History-dependent Deterministic Policy (HD): <math>\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Ac</math>,</li> | |||
<li>Markov Randomized Policy (MR): <math>\pi_t: \Sc \to \Pc(\Ac)</math>,</li> | |||
<li>Markov Deterministic Policy (MD): <math>\pi_t: \Sc \to \Ac</math>.</li> | |||
</ul> | |||
It is easy to observe that MD <math>\subset</math> MR <math>\subset</math> HR. Recall that in \eqref{equ: singlev} and \eqref{equ: singlev_fh}, the optimal policy is defined as the one that maximizes the gain of the MDP. Due to the structure of MDPs it is not difficult to show that it is sufficient to consider Markovian policies. Henceforth, we shall focus only on Markovian policies. | |||
\fi | |||
<!-- \color{blue}This seems a bit circular; the DPP is used to derive the Bellman equation says it is enough to optimize over one step but doesn't this mean that optimal policy is Markov? --> | |||
For an infinite horizon MDP with finite state and action space, and finite reward <math>r</math>, a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists. | |||
{{proofcard|Theorem (Theorem 6.2.7 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one. | |||
For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.|}} | |||
\iffalse | |||
{{proofcard|Theorem (Theorem 8.1.2 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one. For infinite horizon average reward MDP, there always exist a stationary (possibly randomized) policy which is an optimal policy.|}} | |||
\fi | |||
[[#thm:MDP_discounted |Theorem]] implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in <ref name="puterman2014markov"/>. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the '' best stationary policy''. To this end, we write <math>\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})</math> (without a time index) for a stationary policy throughout the paper. | |||
'''Linear MDPs and Linear Functional Approximation.''' Linear MDPs have been extensively studied in the recent literature on theoretical RL in order to establish tractable and efficient algorithms. In a linear MDP, both the transition kernels and the reward function are assumed to be linear with respect to some feature mappings <ref name="bradtke1996linear"/><ref name="melo2007q"/>. | |||
In the infinite horizon setting, an MDP is said to be linear with a feature map <math>\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d</math>, if there exist <math>d</math> unknown (signed) measures <math>\mu = (\mu^{(1)},\cdots,\mu^{(d)})</math> over <math>\mathcal{S}</math> and an unknown vector <math>\theta \in \mathbb{R}^d</math> such that for any <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math>, we have | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:linearMDP_infinite} | |||
P(\cdot|s,a) = \langle\, \phi(s,a),\mu(\cdot)\,\rangle,\quad r(s,a) = \langle\, \phi(s,a),\theta\,\rangle. | |||
\end{eqnarray} | |||
</math> | |||
Similarly for the finite horizon setting, an MDP is said to be linear with a feature map <math>\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d</math>, if for any <math>0\leq t \leq T</math>, there exist <math>d</math> unknown (signed) measures <math>\mu_t = (\mu_t^{(1)},\cdots,\mu_t^{(d)})</math> over <math>\mathcal{S}</math> and an unknown vector <math>\theta_t \in \mathbb{R}^d</math> such that for any <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math>, we have | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:linearMDP_finite} | |||
P_t(\cdot|s,a) = \langle\, \phi(s,a),\mu_t(\cdot)\,\rangle,\quad r_t(s,a) = \langle\, \phi(s,a),\theta_t\,\rangle, | |||
\end{eqnarray} | |||
</math> | |||
Typically features are assumed to be known to the agent and bounded, namely, <math>\|\phi(s,a)\|\leq 1</math> for all <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math>. | |||
The linear MDP framework is closely related to the literature on RL with '' linear functional approximation'', where the value function is assumed to be of the form | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:lfa_infinite} | |||
Q(s,a) = \langle\, \psi(s,a), \omega \,\rangle, \quad V(s) = \langle\, \xi(s), \eta \,\rangle | |||
\end{eqnarray} | |||
</math> | |||
for the infinite horizon case, and | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:lfa_finite} | |||
Q_t(s,a) = \langle\, \psi(s,a), \omega_t \,\rangle, \quad V_t(s) = \langle\, \xi(s), \eta_t \,\rangle, \forall\, 0\leq t \leq T | |||
\end{eqnarray} | |||
</math> | |||
for the finite horizon case. Here <math>\psi: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}^d</math> and <math>\xi: \mathcal{S}\rightarrow \mathbb{R}^d</math> are known feature mappings and <math>\omega</math>, <math>\omega_t</math> <math>\eta</math>, and <math>\eta_t</math> are unknown vectors. It has been shown that the linear MDP (i.e., \eqref{eq:linearMDP_infinite} or \eqref{eq:linearMDP_finite}) and the linear functional approximation formulation (i.e., \eqref{eq:lfa_infinite} or \eqref{eq:lfa_finite}) are equivalent under mild conditions <ref name="jin2020provably"/><ref name="yang2019sample"/>. In addition to linear functional approximations, we refer the reader to <ref name="dai2018sbeed"/> for a general discussion of RL with nonlinear functional approximations and to Section \ref{sec:deep_value_based} for neural network approximation, one of the most popular nonlinear functional approximations used in practice. | |||
'''Nonlinear Functional Approximation.''' Compared to linear functional approximations, nonlinear functional approximations do not require knowledge of the kernel functions a priori. Nonlinear functional approximation could potentially address the mis-specification issue when the agent has an incorrect understanding of the functional space that the MDP lies in. The most popular nonlinear functional approximation approach is to use neural networks, leading to deep RL. Thanks to the universal approximation theorem, this is a theoretically sound approach and neural networks show promising performance for | |||
a wide range of applications. Meanwhile gradient-based algorithms for certain neural network architectures enjoy provable convergence guarantees. We defer the discussion of deep RL to Section \ref{sec:deep_value_based} and its financial applications to Section \ref{sec:fin_app}. | |||
===<span id="sec:MDP2learning"></span>From MDP to Learning=== | |||
When the transition dynamics <math>P</math> and the reward function <math>r</math> for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy <math>\pi</math> (if it exists) <!-- \color{RedOrange} for the infinite horizon case and a optimal policy <math>\Pi:=\{\pi_t\}_{t=0}^{T-1}</math> for the finite horizon case --> | |||
while simultaneously learning the unknown <math>P</math> and <math>r</math>. The learning of <math>P</math> and <math>r</math> can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as <ref name="suttonbarto"/><ref name="powell2021reinforcement"/>. | |||
'''Agent-environment Interface.''' In the RL setting, the learner or the decision-maker is called the agent. The physical world that the agent operates in and interacts with, comprising | |||
everything outside the agent, is called the environment. The agent and environment interact at each of a sequence of discrete time steps, <math>t =0,1,2,3,\cdots</math>, in the following way. At the beginning of each time step <math>t</math>, the agent receives some representation of the environment's state, <math>s_t \in \mathcal{S}</math> and selects an action <math>a_t\in \mathcal{A}</math>. At the end of this time step, in part as a consequence of its action, the agent receives a numerical reward <math>r_t</math> (possibly stochastic) and a new state <math>s_{t+1}</math> from the environment. The tuple <math>(s_t,a_t,r_t,s_{t+1})</math> is called a sample at time <math>t</math> and <math>h_t :=\{(s_u,a_u,r_u,s_{u+1})\}_{u=0}^t</math> is referred to as the history or experience up to time <math>t</math>. An RL algorithm is a finite sequence of well-defined policies for the agent to interact with the environment. | |||
'''Exploration vs Exploitation.''' In order to maximize the accumulated reward over time, the agent | |||
learns to select her actions based on her past experiences (exploitation) and/or by trying new choices (exploration). Exploration provides opportunities to improve performance from the current sub-optimal solution to the ultimate globally optimal one, yet it is time consuming and computationally expensive as over-exploration may impair the convergence to the optimal solution. Meanwhile, pure exploitation, i.e., myopically picking the current solution based solely on | |||
past experience, though easy to implement, tends to yield sub-optimal global solutions. Therefore, | |||
an appropriate trade-off between exploration and exploitation is crucial in the design of RL algorithms in order to improve the learning and the optimization performance. See [[#sec:ex-ex |Section]] for a more detailed discussion. | |||
'''Simulator.''' In the procedure described above on how the agent interacts with the environment, the agent does so in an online fashion. Namely, | |||
the initial state at time step <math>t+1</math>, <math>s_{t+1}</math>, is the state that the agent moves to after taking action <math>a_t</math> from state <math>s_t</math>. This is a challenging setting since an efficient exploration scheme is needed. For instance, <math>Q</math>-learning with the <math>\varepsilon</math>-greedy policy (to be introduced in [[#sec:sarsa |Section]]) may take exponentially many episodes | |||
to learn the optimal policy <ref name="kearns2002near"/>. | |||
Some results assume access to a simulator <ref name="koenig1993complexity"/> (a.k.a., a generative model <ref name="azar2012sample"/>), | |||
which allows the algorithm to query '' arbitrary'' state-action pairs and return the reward and the next state. This implies the agent can “restart” the system at any time. That is, the initial state at time step <math>t+1</math> does not need to be the state that agent moves to after taking action <math>a_t</math> from state <math>s_t</math>. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy <ref name="azar2012sample"/>. | |||
'''Randomized Policy vs Deterministic Policy.''' | |||
A randomized policy <math>\pi_t:\Sc \rightarrow \mathcal{P}(\Ac)</math> is also known in the control literature as a relaxed control and in game theory as a mixed strategy. Despite the existence of a '' deterministic'' optimal policy as stated in [[#thm:MDP_discounted |Theorem]] for the infinite time horizon case, most of the RL algorithms adopt randomized policies to encourage exploration when RL agents are not certain about the environment. | |||
'''An Example: Multi-armed bandits.''' We illustrate these points by discussing | |||
multi-armed bandit problems, a special case of RL problems. | |||
%In the bandit setting, the action does not impact the state dynamics of the agent (if exists). | |||
%The multi-armed bandit problem has been widely studied in a variety of setups. | |||
The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown. | |||
We try these arms in some order, which may depend on the sequence of rewards that have been | |||
observed so far and attempt to maximize our return. A common objective in this context is to find a policy for choosing the next arm | |||
to be tried, under which the sum of the expected rewards comes as close as possible to the ideal | |||
reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times. | |||
Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward. <!-- \color{RedOrange}[Renyuan TODO:The state is the empirical measure of the reward from each arm (note the reward here does not depend on the state).] --> | |||
The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return. | |||
In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time <math>t</math> can be interpreted as the personal information and purchasing behavior of a targeted client. | |||
The problem was first considered in the seminal work of Robbins <ref name="robbins1952some"/>, which | |||
derives policies that asymptotically attain an average reward that converges in the limit to the reward of the best arm. | |||
The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected | |||
reward, and adversarial setups. See Berry and Fristedt <ref name="berry1985bandit"/> for a review of the classical results on the multi-armed bandit problem. | |||
%One of the attractive features of the multi-armed bandit problem is that despite its simplicity, | |||
%it encompasses many important decision theoretic issues, such as the tradeoff between exploration | |||
%and exploitation. | |||
<!-- \color{red} If we dont have a state space, then our policies are history dependent and this doesn't really fit our Markovian framework. To me, if we take the state space to capture the history, then it could be formulated as an MDP. May be we should not talk about bandits. Is there another simple example to illustrate these ideas? --> | |||
====Performance Evaluation==== | |||
It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. | |||
In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. | |||
For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time <math>0</math> and ends at the terminal time <math>T</math>, and the performance is evaluated in terms of the total number of samples in all episodes.{{efn|Sample complexity with respect to the number of episodes has also been adopted in the literature <ref name="DannBrunskill2015"/><ref name="DannUBEV"/>. The translation between these two criteria is straightforward.}} Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE <ref name="zhang2020sample"/> and the policy gradient method <ref name="liu2020improved"/> (see Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]]). %We refer the the reader to <ref name="liu2020improved"/> for more theoretical discussions on this technique.} | |||
Throughout the analysis, we use the notation <math>\widetilde{O}(\cdot)</math> to hide logarithmic factors when describing the order of the scaling in <math>\varepsilon</math>, <math>\delta</math>, <math>|\mathcal{A}|</math> and <math>|\mathcal{S}|</math> (when these are finite), and other possible model parameters such as the discount rate <math>\gamma</math> in the infinite horizon case or the dimension <math>d</math> of the features when functional approximation is used. We write <math>\widetilde{\mathcal{O}}^{\prime}</math> rather than <math>\widetilde{\mathcal{O}}</math> to indicate that, although there may be other parameters, we only include the dependence on <math>\varepsilon</math>. | |||
We denote by <math>\mbox{poly}(\cdot)</math> a polynomial function of its arguments. | |||
'''Sample Complexity.''' In RL, a sample <math>(s,a,r,s^{\prime})</math> is defined as a tuple consisting of a state <math>s</math>, an action <math>a</math>, an instantaneous reward received by taking action <math>a</math> at state <math>s</math>, and the next state <math>s^{\prime}</math> observed afterwards. Sample complexity is defined as the total number of samples required to find an approximately optimal policy. This evaluation criterion can be used for any kind of RL problem. | |||
For episodic RL, an algorithm returns, after the end of the <math>(k-1)</math>-th episode with <math>M_{k-1}</math> samples used in this episode, a policy <math>\pi_k</math> to be applied in the <math>k</math>-th episode. | |||
The sample complexity of this algorithm is the minimum number <math>\sum_{k=1}^K M_{k-1}(\varepsilon)</math> of samples such that for all <math>k \ge K</math>, <math>\pi_k</math> is <math>\varepsilon</math>-optimal with probability at least <math>1-\delta</math>, i.e., for <math>k \ge K</math>, | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:sample_complexity_epi} | |||
\mathbb{P}\left(V_0^{\pi_k}\ge V_0^\star-\varepsilon\right) \ge 1-\delta. | |||
\end{eqnarray} | |||
</math> | |||
% {\color{ForestGreen}See <ref name="DannUBEV"/> for a detailed discussion of other notions of sample complexity in episodic RL and the relationships among them.} | |||
<!-- \color{blue} now using <math>V^{\pi*}</math> for the optimal policy, also use <math>V^*</math> and <math>Q^*</math> below - need to define earlier --> | |||
For discounted RL, an algorithm returns, after | |||
the end of the <math>(n-1)</math>-th step with <math>M_{n-1}</math> samples used in this iteration, a policy <math>\pi_n</math> to be applied in the <math>n</math>-th step. There are several notions of sample complexity in this setting. | |||
The most commonly used ones are defined through the value function and the <math>Q</math>-function with all input variables (i.e., <math>s\in \mathcal{S}</math> for <math>V</math> and <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math> for <math>Q</math>) and they are referred to as the <math>V</math>-sample complexity and <math>Q</math>-sample complexity in the literature. The <math>Q</math>-sample complexity of a given algorithm is defined as the minimum number <math>\sum_{n=1}^N M_{n-1}(\varepsilon)</math> of samples such that for all <math>n \ge N</math>, <math>Q^{(n)}</math> is <math>\varepsilon</math>-close to <math>Q^*</math>, that is | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:sample_complexity_infi_Q} | |||
\|Q^{(n)} - Q^*\|_{\infty}:=\sup_{s\in\mathcal{S},a\in \mathcal{A}}\Big|Q^*(s,a)-Q^{(n)}(s,a)\Big| \leq \varepsilon, | |||
\end{eqnarray} | |||
</math> | |||
holds either with high probability (that is <math>\mathbb{P}\left(\|Q^{(n)}-Q^*\|_{\infty} < \varepsilon\right) \ge 1-\delta</math>) or in the expectation sense <math>\mathbb{E}[\|Q^{(n)}-Q^*\|_{\infty}] < \varepsilon</math>. | |||
Similarly, the <math>V</math>-sample complexity is defined as the minimum number <math>\sum_{n=1}^N M_{n-1}(\varepsilon)</math> of samples such that for all <math>n \ge N</math>, <math>V^{(n)}</math> is <math>\varepsilon</math>-close to <math>V^*</math>, that is | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:sample_complexity_infi_V_2} | |||
\|V^{(n)} - V^*\|_{\infty}:=\sup_{s\in\mathcal{S}}\left|V^{(n)}(s)-V^*(s)\right|\leq \varepsilon, | |||
\end{eqnarray} | |||
</math> | |||
holds either with high probability (in that <math>\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} < \varepsilon\right) \ge 1-\delta</math>) or in the expectation sense <math>\mathbb{E}[\|V^{(n)}-V^*\|_{\infty}] < \varepsilon</math>. | |||
Note that \eqref{eq:sample_complexity_infi_Q} implies \eqref{eq:sample_complexity_infi_V_2} since <math>V^{(n)}(s) = \sup_{a\in \mathcal{A}}Q^{(n)}(s,a)</math> and <math>V^{*}(s) = \sup_{a\in \mathcal{A}}Q^*(s,a)</math>. | |||
In our analysis we do not distinguish between whether the sample complexity bounds for \eqref{eq:sample_complexity_infi_Q} and \eqref{eq:sample_complexity_infi_V_2} are in the high probability sense or in the expectation sense. | |||
The second type of sample complexity, the '' sample complexity of exploration'', is defined as the number of samples such that the non-stationary policy <math>\pi_n</math> at time <math>n</math> is not <math>\varepsilon</math>-optimal for the current state <math>s_n</math>. This measure counts the number of mistakes along the whole trajectory. Mathematically speaking, the sample complexity of exploration for a given algorithm is the minimum number <math>\sum_{n=1}^N M_{n-1}(\varepsilon)</math> of samples such that for all <math>n \ge N</math>, <math>\pi_n</math> is <math>\varepsilon</math>-optimal when starting from the current state <math>s_n</math> with probability at least <math>1-\delta</math>, i.e., for <math>n \ge N</math>, | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:sample_complexity_infi_V} | |||
\mathbb{P}\left(\left|V^{\pi_n}(s_n)- V^\star(s_n)\right|\leq \varepsilon\right) \ge 1-\delta. | |||
\end{eqnarray} | |||
</math> | |||
Note that the condition <math>\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} < \varepsilon\right) \ge 1-\delta</math> associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. | |||
Another weaker criterion, the '' mean-square sample complexity'', measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution <math>\mu</math>. | |||
%Let <math>D = diag(\mu(s_1), . . . , \mu(s_{|\mathcal{S}|})) \in \mathbb{|\mathcal{S}|}^{|\mathcal{S}|}</math> denote the diagonal matrix whose elements are given by the entries of the stationary distribution or initial distribution <math>\mu(\cdot)</math>. Then, | |||
In this case, the '' mean-square sample complexity'' is defined as the minimum number <math>\sum_{n=1}^N M_{n-1}(\varepsilon)</math> of samples such that for all <math>n \ge N</math>, | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:sample_complexity_infi_V_avg} | |||
\|V^{\pi_n} - V^*\|_{\mu}:=\sqrt{\int_{\mathcal{S}}(V^{\pi_n}(s)-V^*(s))^2\mu(ds)} \leq \varepsilon. | |||
\end{eqnarray} | |||
</math> | |||
Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the <math>V</math> function over a state distribution <math>\mu</math>, it is implied by \eqref{eq:sample_complexity_infi_V_2}. | |||
<!-- \color{blue} I dont see why this is defined only in the finite state space case. Why not have the classical <math>L^2</math>-distance with respect to the measure <math>\mu</math>? | |||
<math display="block"> | |||
\begin{eqnarray} \|V^{\pi_n} - V^*\|_{\mu}:=\sqrt{\int_{\mathcal{S}}(V^{\pi_n}(s)-V^*(s))^2\mu(ds)} \leq \varepsilon. \end{eqnarray} | |||
</math> | |||
--> | |||
<!-- \color{RedOrange}[TODO Renyuan: introduce the notation of <math>\widetilde{\mathcal{O}}</math>.] --> {\color{blue} What does <math>\widetilde{\mathcal{O}}</math> mean?} | |||
'''Rate of Convergence.''' To emphasize the convergence with respect to the number of iterations required, we introduce the notion of ''rate of convergence'', which uses the relationship between the number of iterations/steps <math>N</math> and the error term <math>\varepsilon</math>, to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from <math>\varepsilon</math>) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to <math>\varepsilon</math>. | |||
In particular, an algorithm is said to converge at a ''linear'' rate if <math>N\sim \widetilde{\mathcal{O}}^{\prime}(\log(1/\varepsilon))</math>. Similarly, an algorithm is said to converge at a ''sublinear'' rate (slower than linear) if <math>N\sim \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^p)</math> with <math>p\geq 1</math>, and is said to converge at a ''superlinear'' (faster than linear) if <math>N\sim \widetilde{\mathcal{O}}^{\prime}(\log(\log(1/\varepsilon))</math>. | |||
'''Regret Analysis.''' The '' regret'' of a given policy <math>\pi</math> is defined as the difference between the cumulative reward of the optimal policy and that gathered by <math>\pi</math>. It quantifies the exploration-exploitation trade-off. | |||
For episodic RL with horizon <math>T</math> in formulation \eqref{equ: singlev_fh}, the regret of an algorithm after <math>K </math> episodes with <math>M=T\,K</math> steps is | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:regret_epi} | |||
R(M) = \left\| K \times V_0^{*} - \sum_{k=1}^K\mathbb{E}[V_0^{\pi_k}]\right\|_{\infty} = \sup_{s\in \mathcal{S}} \Big(K \times V_0^{*} - \sum_{k=1}^K\mathbb{E}[V_0^{\pi_k}]\Big). | |||
\end{eqnarray} | |||
</math> | |||
There is currently no regret analysis for RL problems with infinite time horizon and discounted reward. The natural definition of regret as given above is less meaningful in this case as the cumulative discounted reward is bounded and hence does not scale with <math>T</math>. | |||
'''Asymptotic Convergence.''' In addition to sample complexity and regret analysis discussed above, ''asymptotic'' convergence is another weaker convergence guarantee, which requires the error term to decay to zero as <math>N</math> goes to infinity (without specifying the order of <math>N</math>). This is often a first step in the theoretical analysis of RL algorithms. | |||
% Throughout the analysis, we use the notation <math>\widetilde{O}(\cdot)</math> for to hide logarithmic factors when describing the order of the scaling in <math>\varepsilon</math>, <math>\delta</math>, <math>|\mathcal{A}|</math> and <math>|\mathcal{S}|</math> (when these are finite), and other possible model parameters such as the discount rate <math>\gamma</math> in the infinite horizon case or the dimension <math>d</math> of the features when functional approximation is used. We write <math>\widetilde{\mathcal{O}}^{\prime}</math> rather than <math>\widetilde{\mathcal{O}}</math> to indicate that, although there may be other parameters, we only include the dependence on <math>\varepsilon</math>. | |||
====Classification of RL Algorithms==== | |||
%Most of the reinforcement learning algorithms can be classified into two categories, i.e., model-based algorithms and the model-free algorithms. | |||
An RL algorithm includes one or more of the following components: | |||
<ul style{{=}}"list-style-type:lower-roman"><li>a representation of a value function that provides a prediction of how good each state or each state/action pair is,</li> | |||
<li>a direct representation of the policy <math>\pi(s)</math> or <math>\pi(s,a)</math>, %or</li> | |||
<li>a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).</li> | |||
</ul> | |||
The first two components are related to what is called model-free RL. When the latter component is used, the algorithm is referred to as model-based RL. | |||
In the MDP setting, model-based algorithms maintain an approximate MDP model by estimating the transition probabilities and the reward function, and derive a value function from the approximate MDP. The policy is then derived from the value function. Examples include <ref name="brafman2002r"/><ref name="kearns2002near"/><ref name="lattimore2012pac"/> and <ref name="MoRmax2010"/>. Another line of model-based algorithms make structural assumptions on the model, using some prior knowledge, and utilize the structural information for algorithm design. For example, see <ref name="fazel2018global"/> for learning linear-quadratic regulators over an infinite time horizon and <ref name="hambly2020policy"/><ref name="basei2021logarithmic"/> for the finite time horizon case. | |||
Unlike the model-based method, model-free algorithms '' directly'' learn a value (or state-value) function or the optimal policy without inferring the model. | |||
Model-free algorithms can be further divided into two categories, value-based methods and policy-based methods. Policy-based methods explicitly build a representation of a policy and keep it in memory during learning. Examples include policy gradient methods and trust-region policy optimization methods. As an alternative, value-based methods store only a value function without an explicit policy during the learning process. In this case, the policy is implicit and can be derived directly from the value function (by picking the action with the best value). | |||
In our discussion of methodology, we focus on model-free RL algorithms for MDP with infinite horizon and discounted reward. In particular, we introduce some classical value-based and policy-based methods in [[#sec:value_based_methods |Sections]] [[#subsec:policy_based_methods |and]] respectively. For the episodic setting and model-based algorithms, see the discussion in [[#sec:discussion |Section]]. | |||
===<span id="sec:value_based_methods"></span>Value-based Methods=== | |||
In this section, we introduce the methodologies of several classical value-based methods in the setting of an infinite time horizon with discounting, finite state and action spaces (<math>|\mathcal{A}| < \infty</math> and <math>|\mathcal{S}| < \infty</math>), and stationary policies. | |||
The setting with finite state and action spaces is also referred to as the tabular case in the RL literature. For more general setups with an infinite number of actions and states, we refer the reader to [[#sec:value_discussion |Section]] for a discussion of linear functional approximations and to [[guide:576dcdd2b6#subsec:deep_value_based |Section]] for neural network approximations. | |||
Recall that for reinforcement learning, the distribution of the reward function <math>r</math> and the transition function <math>P</math> are unknown to the agent. Therefore the expectations in the Bellman equations \eqref{equ:classicalV} and \eqref{equ: singleQ} cannot be calculated directly. On the other hand, the agents can observe samples <math>(s,a,r,s^{\prime})</math> by interacting with the system without a model of the system's dynamics <math>P</math> or any prior knowledge of <math>r</math>. The agents can then learn the value function or <math>Q</math>-function using the samples. Following this idea, we next introduce the classical temporal-difference learning algorithm, with <math>Q</math>-learning and State-Action-Reward-State-Action (SARSA) as two special cases. | |||
====Temporal-Difference Learning==== | |||
Given some samples <math>(s,a,r,s')</math> obtained by following a policy <math>\pi</math>, the agent can update her estimate of the value function <math>V^{\pi}</math> (defined in \eqref{equ: singlev}) at the <math>(n+1)</math>-th iteration by | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:TD_update} | |||
V^{\pi,(n+1)}(s) \leftarrow (1-\beta_n(s,a))\underbrace{V^{\pi,(n)}(s)}_{\text{current estimate}} + \beta_n(s,a) \underbrace{[r+\gamma V^{\pi,(n)}(s')]}_{\text{new estimate}}, | |||
\end{eqnarray} | |||
</math> | |||
with some initialization of the value function <math>V^{\pi,(0)}</math>. Here | |||
<math>\beta_n(s,a)</math> is the learning rate at iteration <math>n+1</math> which balances the weighting between the current estimate and the new estimate. The quantity <math>\beta_n</math> can be a constant, can depend on the current state <math>s</math>, or even depend on the observed samples up to iteration <math>n+1</math>. [[#alg:TD |Algorithm]] provides some pseudo-code for an implementation of the TD method. Equation | |||
\eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD<math>(\lambda)</math> method (for some <math>\lambda \in [0,1]</math>) which is a combination of TD learning (with weight <math>\lambda</math>) and a Monte Carlo method (with weight <math>1-\lambda</math>). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See [[#sec:REINFORCE |Section]] for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. | |||
<!-- \color{RedOrange} Monte Carlo methods are RL algorithms based on episodes and averaging sample returns. They are incremental in the episode-by-episode sense (not in a step-by-step sense). -->{\color{blue} I still dont understand. A Monte Carlo method is a simulation method for calculating expectations of functions of a stochastic process. Here we have a method, TD, which uses a single sample to update the value function. We could use more samples to update with some average of the samples from a given <math>(s,a)</math> - is this what you mean? Or do we run sample the policy for a longer time period and average these to get an approximation using the definition of <math>V^{\pi}</math> as an expectation over an infinite discounted sum?} | |||
%, like algorithms. | |||
Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in <ref name="suttonbarto"/> for more details. | |||
The TD update in \eqref{eq:TD_update} can also be written as | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:TD_update_v2} | |||
V^{\pi,(n+1)}(s) \leftarrow V^{\pi,(n)}(s) + \beta_n(s,a) \underbrace{\left[\overbrace{r+\gamma V^{\pi,(n)}(s')}^{\text{TD target}}-V^{\pi,(n)}(s)\right]}_{\text{TD error $\delta_n$}}. | |||
\end{eqnarray} | |||
</math> | |||
The term <math>\delta_n</math>, commonly called the TD error, measures the difference between the estimated value at <math>s</math> and the better estimate <math>r+\gamma V^{\pi,(n)}(s')</math>, which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature. %Notice that this TD error at iteration <math>n</math> is the error in the estimate made at this iteration becaseu the TD error depends on the next state and the next reward, it is not actually available until one iteration later. % ON THE EXT STATE AND NEXT REWARD, IT IS NOT ACUALLY AVAILABLE UNTIL ONE TIME STEP LATER. | |||
%TD target and TD error as shown below are two important components which are used in many other areas of RL. {\color{blue}[XXX.]} | |||
<!-- \color{blue}[TODO Renyuan: add TD target.] --> | |||
<proc id="alg:TD" label = "\textbf{TD(0) Method for estimating $V^{\pi}$}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> used to sample observations; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0\leq n \leq N-1)</math> | |||
</div> | |||
<div class="ms-2"> Initialize <math>V^{\pi,(0)}(s)</math> for all <math>s\in \mathcal{S}</math> | |||
% \FOR {each episode} | |||
</div> | |||
<div class="ms-2"> Initialize <math>s</math> | |||
</div> | |||
<div class="ms-2">'''if''' <math>n=0,\cdots,N-1</math> '''then''' | |||
<div class="ms-2"> Sample action <math>a</math> according to <math>\pi(s)</math> | |||
</div> | |||
<div class="ms-2"> Observe <math>r</math> and <math>s'</math> after taking action <math>a</math> | |||
</div> | |||
<div class="ms-2"> Update <math>V^{\pi,(n+1)}</math> according to \eqref{eq:TD_update} with <math>(s,a,r,s')</math> | |||
</div> | |||
<div class="ms-2"> <math>s\leftarrow s'</math> | |||
% \ENDFOR | |||
</div> | |||
'''end for'''</div></proc> | |||
====<math>Q</math>-learning Algorithm==== | |||
A version of TD learning is <math>Q</math>-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the <math>Q</math>-function with samples observed by the agent. At iteration <math>n</math> (<math>1 \leq n \leq N-1</math>), the <math>Q</math>-function is updated using a sample <math>(s,a,r,s^{\prime})</math> where <math>s' \sim P(s,a)</math>, | |||
<math display="block"> | |||
\begin{eqnarray}\label{eq:Q_learning_update} | |||
Q^{(n+1)}(s,a)\leftarrow (1-\beta_n(s,a))\underbrace{Q^{(n)}(s,a)}_{\text{current estimate}}+\beta_n(s,a)\underbrace{\left[r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')\right]}_{\text{new estimate}}. | |||
\end{eqnarray} | |||
</math> | |||
Here <math>\beta_n(s,a)</math> is the learning rate which balances the weight between the current estimate <math>Q^{(n)}</math> from iteration <math>n</math> and the new estimate <math>r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')</math> calculated using the sample <math>(s,a,r,s^{\prime})</math>. [[#alg:Q |Algorithm]] provides pseudo-code for an implementation of Q-learning. | |||
<proc id="alg:Q" label = "\textbf{$Q$-learning with samples from a policy $\pi$}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> to be evaluated; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0 \leq n\leq N-1)</math> | |||
</div> | |||
<div class="ms-2"> Initialize <math>Q^{(0)}(s,a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math> | |||
% \FOR {each episode} | |||
</div> | |||
<div class="ms-2"> Initialize <math>s</math> | |||
</div> | |||
<div class="ms-2">'''if''' <math>n=0,\cdots,N-1</math> '''then''' | |||
<div class="ms-2"> Sample action <math>a</math> according to <math>\pi(s)</math> | |||
</div> | |||
<div class="ms-2"> Observe <math>r</math> and <math>s'</math> after taking action <math>a</math> | |||
</div> | |||
<div class="ms-2"> Update <math>Q^{(n+1)}</math> according to \eqref{eq:Q_learning_update} with sample <math>(s,a,r,s')</math> | |||
</div> | |||
<div class="ms-2"> <math>s\leftarrow s'</math> | |||
% \ENDFOR | |||
</div> | |||
'''end for'''</div></proc> | |||
The following theorem guarantees the asymptotic convergence of the <math>Q</math>-learning update \eqref{eq:Q_learning_update} to the <math>Q</math>-function defined in \eqref{defsingleQ}. | |||
{{proofcard|Theorem (Watkins and Dayan (1992) <ref name="watkins1992q"/>)|Theorem-1|\label{thm:Q_conv} Assume <math>|\mathcal{A}| < \infty</math> and <math>|\mathcal{S}| < \infty</math>. | |||
Let <math>n^i(s, a)</math> be the index of the <math>i</math>-th time that the action <math>a</math> is used in state <math>s</math>. | |||
% Let <math>{n^i(s,a)}</math> be the <math>i</math>{-{th}} time to visit <math>(s,a)</math>. | |||
Let <math>R < \infty</math> be a constant. Given bounded rewards <math>|r|\leq R</math>, learning rate <math>0\leq\beta_n < 1</math> and | |||
<math display="block"> | |||
\begin{eqnarray}\label{exploration} | |||
\sum_{i=1}^{\infty}\beta_{n^i(s,a)} = \infty,\,\,\sum_{i=1}^{\infty}(\beta_{n^i(s,a)})^2 < \infty, \,\,\forall s,a, | |||
\end{eqnarray} | |||
</math> | |||
then <math>Q^{(n)}(s,a)\rightarrow Q^\star(s,a)</math> as <math>n\rightarrow \infty</math>, <math>\forall s,a</math> with probability <math>1</math>.|}} | |||
[[#thm:Q_conv |%Theorem]] says that under the additional assumptions of [[#thm:MDP_discounted |Theorem]], the <math>Q</math>-learning update converges to the <math>Q</math>-function. | |||
This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence <math>\beta</math>. [[#thm:Q_conv |Theorem]] implies that eventually <math>Q^{(n)}</math> will convergence to the true value function <math>Q^\star</math> when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for <math>Q</math>-learning related algorithms are discussed in Section~[[#subsubsec:value_theo |subsubsec:value_theo]]. | |||
====<span id="sec:sarsa"></span>SARSA==== | |||
In contrast to the <math>Q</math>-learning algorithm, which takes samples from any policy <math>\pi</math> as the input where these samples could be collected in advance before performing the <math>Q</math>-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the <math>Q</math>-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm. | |||
<proc id="alg:SARSA" label = "\textbf{SARSA: On-policy TD Learning}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, the learning rate <math>\beta\in(0,1)</math>\footnotemark[1], and small parameter <math>\varepsilon > 0</math>. | |||
</div> | |||
<div class="ms-2"> Initialize <math>Q^{(0)}(s, a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math> | |||
</div> | |||
<div class="ms-2"> Initialize <math>s\in \mathcal{S}</math> | |||
</div> | |||
<div class="ms-2"> Initialize <math>a</math>: Choose <math>a</math> when in <math>s</math> using a policy derived from <math>Q^{(0)}</math> (e.g, <math>\varepsilon</math>-greedy) | |||
</div> | |||
<div class="ms-2">'''if''' <math>n=0,1,\cdots,N-1</math> '''then''' | |||
<div class="ms-2"> Take action <math>a</math>, observe reward <math>r</math> and the next step <math>s'</math> | |||
</div> | |||
<div class="ms-2"> Choose <math>a'</math> when in <math>s'</math> using a policy derived from <math>Q^{(n)}</math> (e.g., <math>\varepsilon</math>-greedy) | |||
<math display="block"> | |||
\begin{equation}\label{eq:SARSA_update} | |||
Q^{(n+1)}(s,a) \leftarrow (1-\beta)\underbrace{ Q^{(n)}(s,a)}_{\text{current estimate}} +\beta\underbrace{(r+\gamma Q^{(n)}(s',a'))}_{\text{new estimate}} | |||
\end{equation} | |||
</math> | |||
</div> | |||
<div class="ms-2"> <math>s\leftarrow s'</math>, <math>a\leftarrow a'</math> | |||
</div> | |||
'''end for'''</div></proc> | |||
{{efn|The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate <math>\beta</math>. In the convergence analysis, we use <math>\beta_n</math> to denote a learning rate which depends on the number of iterations.}} | |||
The policy derived from <math>Q^{(n)}</math> using <math>\varepsilon</math>-greedy (line 4 and line 8 in [[#alg:SARSA |Algorithm]]) suggests the following choice of action <math>a</math> when in state <math>s</math>: | |||
<math display="block"> | |||
\begin{eqnarray}\label{epsilon_policy} | |||
\begin{cases} | |||
\text{ select from }\arg\max_{a'} Q^{(n)}(s,a') & \text{with probability } 1-\varepsilon,\\ | |||
\text{ uniformly sample from } \mathcal{A} & \text{with probability } \varepsilon. | |||
\end{cases} | |||
\end{eqnarray} | |||
</math> | |||
In [[#alg:SARSA |Algorithm]], SARSA uses the <math>Q^{(n)}</math> function with an <math>\epsilon</math>-greedy policy to choose <math>a'</math> and to perform exploration. In contrast, Q-learning uses the maximum value of <math>Q^{(n)}</math> over all possible actions for the next step, which is equivalent to setting <math>\varepsilon=0</math> when choosing <math>a'</math>. | |||
In addition to the <math>\varepsilon</math>-greedy policy, other exploration methods such as the softmax operator (resulting in a Boltzmann policy) <ref name="asadi2017alternative"/><ref name="haarnoja2017reinforcement"/><ref name="WZZ2018"/> can also be applied. See [[#sec:ex-ex |Section]] for a more detailed discussion. | |||
'''On-policy Learning vs. Off-policy Learning.''' An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. <math>Q</math>-learning is an off-policy agent as the samples <math>(s,a,r,s')</math> used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the <math>Q</math>-function. The convergence of the <math>Q</math>-function relies on the exploration scheme of the policy <math>\pi</math> (see [[#thm:Q_conv |Theorem]]). In contrast to <math>Q</math>-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the <math>Q</math>-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of <math>Q</math>-learning \eqref{eq:Q_learning_update}, we have <math>\max_{a'} Q^{(n)}(s',a')</math> which is the maximum value of the <math>Q</math>-function at a given state <math>s'</math>. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes <math>Q^{(n)}(s',a')</math> where <math>a'</math> is selected based on a policy derived from <math>Q^{(n)}</math> (via the <math>\varepsilon</math>-greedy policy \eqref{epsilon_policy}). | |||
<!-- \color{RedOrange}Note that there are several differences between the <math>Q</math>-learning Algorithm | |||
%(i.e., [[#alg:Q |Algorithm]]) and the SARSA Algorithm. %(i.e., [[#alg:SARSA |Algorithm]]). The first difference is the source of samples used to update the <math>Q</math>-function. The <math>Q</math>-learning Algorithm takes samples from any policy <math>\pi</math> as the input and these samples could be collected in advance before performing the <math>Q</math>-learning algorithm. The convergence of the <math>Q</math>-function relies on the exploration scheme of the policy <math>\pi</math> (see [[#thm:Q_conv |Theorem]]). In contrast to <math>Q</math>-learning, SARSA uses only its own estimation of the <math>Q</math>-function to select an action and receives a real-time sample in each iteration. The second difference is how the new estimate in the update is calculated. In the update step of <math>Q</math>-learning \eqref{eq:Q_learning_update}, we have <math>\max_{a'} Q^{(n)}(s',a')</math> which is the maximum value of the <math>Q</math>-function at a given state <math>s'</math>. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes <math>Q^{(n)}(s',a')</math> where <math>a'</math> is selected based on a policy derived from <math>Q^{(n)}</math> (via the <math>\varepsilon</math>-greedy policy \eqref{epsilon_policy}). --> | |||
====<span id="subsubsec:value_theo"></span>Discussion==== | |||
In this section, we provide a brief overview of sample complexity | |||
results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to [[#sec:discussion |Section]]. | |||
There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, <ref name="dalal2018finite"/> showed that the mean-squared error converges at a rate of <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})</math> with decaying learning rate <math>\beta_n= \frac{1}{(1+n)^{\sigma}}</math> for some <math>\sigma\in (0,1)</math> where i.i.d. samples are drawn from a stationary distribution; <ref name="lakshminarayanan2018linear"/> provided | |||
an improved <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})</math> bound for iterate-averaged TD(0) with constant step-size; and <ref name="bhandari2018finite"/> reached the same mean-square sample complexity <math>\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})</math> with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory. | |||
A similar analysis was applied by | |||
<ref name="zou2019finite"/> to SARSA and <math>Q</math>-learning algorithms for a continuous state space using linear function approximation (see the end of [[#sec:set_up |Section]]). | |||
The <math>Q</math>-sample complexity for the <math>Q</math>-learning algorithm of <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\mbox{poly}(\log \delta^{-1})\right)</math> was first established in <ref name="even2003learning"/> with decaying learning rate <math>\beta_n = \frac{1}{(1+n)^{4/5}}</math> and access to a simulator. | |||
This was followed by <ref name="beck2012error"/> which showed that the same order of <math>Q</math>-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)</math> could be achieved with a constant learning rate <math>\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}</math> <math>(n\leq N)</math>. Without access to a simulator, delayed <math>Q</math>-learning <ref name="strehl2009reinforcement"/> was shown to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> assuming a | |||
known reward. An improvement to this result to <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy <ref name="dong2019q"/> and a single trajectory. | |||
<!-- \color{RedOrange}In particular, the dependence on the discount rate <math>\gamma</math> is improved from <math>\frac{1}{(1-\gamma)^8}</math> to <math>\frac{1}{(1-\gamma)^7}</math>. --> | |||
In addition, sample complexities of <math>Q</math>-learning variants with (linear) function approximations have been established in several recent papers. <ref name="yang2019sample"/> assumed a linear MDP framework (see the end of [[#sec:set_up |Section]]) | |||
%has a feature-based transition function <math>P(s'|s,a) = \sum_{k=1}^K \phi_k(s,a)\psi_k(s')</math> for some '' unknown'' functions <math>\psi_1,\cdots,\psi_K:\mathcal{S}\rightarrow\mathbb{R}</math> and a set of features expressed by functions <math>\phi_1,\cdots,\phi_K:\mathcal{S}\times\mathcal{A} \rightarrow\mathbb{R}</math> that are '' given'' to the agent. | |||
and the authors provided a V-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math> with <math>d</math> denoting the dimension of the features. This result matches the information-theoretic lower bound up to <math>\log(\cdot)</math> factors. Later work, <ref name="lattimore2020learning"/> considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order <math>\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math>. Although the dependence on <math>\gamma</math> is not as good as in <ref name="yang2019sample"/>, this framework can be applied to a more general class of models which are “near” linear. Although <math>Q</math>-learning is one of the most successful algorithms for finding the best | |||
action-value function <math>Q</math>, its implementation often suffers from large high biased estimates of the <math>Q</math>-function values incurred by random sampling. The double <math>Q</math>-learning algorithm proposed in <ref name="hasselt2010double"/> | |||
tackled this high estimation issue by randomly switching the update between two <math>Q</math>-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double <math>Q</math>-learning algorithm was established in <ref name="xiong2020finite"/> with a Q-sample complexity on the order of <math>\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)</math> and learning rate <math>\beta_n = \frac{1}{n^{w}}</math> for some constant <math>\omega \in (0,1)</math>. | |||
<!-- \color{RedOrange}[TODO Renyuan: add some comments on these orders -- which one is better than the others under what situation.] --> | |||
===<span id="subsec:policy_based_methods"></span>Policy-based Methods=== | |||
<!-- \color{ForestGreen}[TODO Huining: describe the setup: infinite horizon and Markov policy.] --> | |||
So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the <math>\varepsilon</math>-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods <ref name="sewak2019policy"/><ref name="yu2020policy"/><ref name="daberius2019deep"/>. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. | |||
In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE <ref name="williams1992"/>, Actor-Critic methods <ref name="konda2000"/>, Trust Region Policy Optimization (TRPO) <ref name="schulman2015"/>, Proximal Policy Optimization (PPO) <ref name="schulman2017proximal"/> among others. We first parametrize the policy <math>\pi</math> by a vector <math>\theta</math>, thus we define <math>\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})</math>, the probability distribution parameterized by <math>\theta</math> over the action space given the state <math>s</math> at time <math>t</math>. Here we will use <math>\pi(s,a;\theta)</math> and <math>\pi_{\theta}</math> interchangeably to represent the policy parameterized by <math>\theta</math>. We write <math>\mu^{\pi_{\theta}}</math> for the stationary distribution of the state under policy <math>\pi(s,a;\theta)</math>. Then the policy objective function is defined to be | |||
<math display="block"> | |||
J(\theta):=\int_{\mathcal{S}}V^{\pi_{\theta}}(s)\mu^{\pi_{\theta}}(ds) | |||
</math> | |||
for RL with infinite time horizon, or | |||
<math display="block"> | |||
J(\theta):=\int_{\mathcal{S}}V_0^{\pi_{\theta}}(s)\mu^{\pi_{\theta}}(ds) | |||
</math> | |||
for RL with finite time horizon. In order to maximize this function, the policy parameter <math>\theta</math> is updated using the gradient ascent rule: | |||
<math display="block"> | |||
\begin{equation}\label{eqn:grad_ascent} | |||
\theta^\prime = \theta + \beta\,\widehat{\nabla_{\theta}J(\theta)}, | |||
\end{equation} | |||
</math> | |||
where <math>\beta</math> is the learning rate, and <math>\widehat{\nabla_{\theta}J(\theta)}</math> is an estimate of the gradient of <math>J(\theta)</math> with respect to <math>\theta</math>. | |||
% Compared with value-based methods, policy-based approaches allow us to utilize the prior knowledge about the possible forms of the policy, which often lead to faster convergence. Also, when dealing with continuous | |||
====Policy Gradient Theorem==== | |||
Our task now is to estimate the gradient <math>\nabla_{\theta}J(\theta)</math>. The objective function <math>J(\theta)</math> depends on both the policy and the corresponding stationary distribution <math>\mu^{\pi_{\theta}}</math>, and the parameter <math>\theta</math> affects both of them. Calculating the gradient of the action with respect to <math>\theta</math> is straightforward given the parametrization <math>\pi(s,a;\theta)</math>, whereas it might be challenging to analyze the effect of <math>\theta</math> on the state when the system is unknown. Fortunately the well-known ''policy gradient theorem'' provides an expression for <math>\nabla_{\theta}J(\theta)</math> which does not involve the derivatives of the state distribution with respect to <math>\theta</math>. We write the <math>Q</math>-function under policy <math>\pi(s,a;\theta)</math> as <math>Q^{\pi_{\theta}}(s,a)</math>, then we have the following theorem (for more details see <ref name="suttonbarto"/>). | |||
{{proofcard|Theorem (Policy Gradient Theorem <ref name="sutton2000policy"/>)|Theorem-2|Assume that <math>\pi(s,a;\theta)</math> is differentiable with respect to <math>\theta</math> and that there exists <math>\mu^{\pi_{\theta}}</math>, the stationary distribution of the dynamics under policy <math>\pi_{\theta}</math>, which is independent of the initial state <math>s_0</math>. Then the policy gradient is | |||
<math display="block"> | |||
\begin{equation}\label{eqn:policy_gra_thm} | |||
\nabla_{\theta}J(\theta) = \mathbb{E}_{s\sim \mu^{\pi_{\theta}},a\sim \pi_{\theta}}[\nabla_{\theta}\ln\pi(s,a;\theta)Q^{\pi_{\theta}}(s,a)]. | |||
\end{equation} | |||
</math>|}} | |||
For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity <ref name="konda2002"/>. | |||
====<span id="sec:REINFORCE"></span>REINFORCE: Monte Carlo Policy Gradient==== | |||
Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard ''Monte Carlo'' method that repeatedly simulates a trajectory of length <math>M</math> within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE <ref name="williams1992"/> or Monte Carlo policy gradient (see [[#alg:REINFORCE |Algorithm]]). %We consider the ''episodic'' setting, where each episode can be considered as a subsequence of some repeated agent-environment interaction, which ends in a terminal state at final time <math>T</math>. | |||
We use a simple empirical return function <math>G_t</math> to approximate the <math>Q</math>-function in \eqref{eqn:policy_gra_thm}. The return <math>G_t</math> is defined as the sum of the discounted rewards and given by <math>G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}</math>. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time <math>t</math>, it uses all future rewards up until the end of the trajectory (see line 7 in [[#alg:REINFORCE |Algorithm]]). Within the <math>n</math>-iteration the policy parameter <math>\theta^{(n+1)}</math> is updated <math>M</math> times with <math>G_t</math> for <math>t=0,1,\cdots,M-1</math>. | |||
%Therefore, REINFORCE can only be applied to episodic MDPs as it requires all episodes to terminate in some state. | |||
As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a ''baseline'', which is a function of the state <math>s</math>, from the return <math>G_t</math>. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, <math>\widehat{V}(s)</math>, which can be learnt using one of the methods introduced in [[#sec:value_based_methods |Section]]. Replacing the <math>G_t</math> in \eqref{eqn:alg_REINFORCE_update} by <math>G_t-\widehat{V}(s)</math> gives a REINFORCE algorithm with baseline. | |||
% <proc id="alg:REINFORCE" label = "\textbf{REINFORCE: Monte Carlo Policy Gradient}"></proc> | |||
<proc id="alg:REINFORCE" label = "\textbf{REINFORCE: Monte Carlo Policy Gradient}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, learning rate <math>\beta</math>, a differentiable policy parametrization <math>\pi(s,a;\theta)</math>, finite length of the trajectory <math>M</math>. | |||
</div> | |||
<div class="ms-2"> Initialize policy parameter <math>\theta^{(0)}</math>. | |||
</div> | |||
<div class="ms-2">'''if''' <math>n=0,1,\ldots,N-1</math> '''then''' | |||
<div class="ms-2"> Sample a trajectory <math>s_0,a_0,r_1,\cdots,s_{T-1},a_{T-1},r_T\sim\pi(s,a;\theta^{(n)})</math> | |||
</div> | |||
<div class="ms-2"> Initialize <math>\theta^{(n+1)} = \theta^{(n)}</math> | |||
</div> | |||
<div class="ms-2">'''if''' <math>t=0,\cdots,M-1</math> '''then''' | |||
<div class="ms-2"> Calculate the return: <math>G_t=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}</math> | |||
</div> | |||
<div class="ms-2"> Update the policy parameter | |||
<math display="block"> | |||
\begin{equation}\label{eqn:alg_REINFORCE_update} | |||
\theta^{(n+1)} \leftarrow \theta^{(n+1)} + \beta\gamma^t G_t\nabla_{\theta}\ln\pi(s_t,a_t;\theta^{(n+1)}) | |||
\end{equation} | |||
</math> | |||
</div> | |||
'''end for'''</div>'''end for'''</div></proc> | |||
====Actor-Critic Methods==== | |||
% As discussed in [[#sec:REINFORCE |Section]], incorporating a baseline function reduces the variance of the REINFORCE algorithm without introducing any bias to the estimates. However, unbiased policy estimation may still lead to slow learning as the estimates are of high variance, whereas biased methods such as TD learning are typically of lower variance. This is known as the bias-variance tradeoff (see, e.g., <ref name="franccois2019"/> and <ref name="von2011statistical"/>) in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. In this section we consider ''Actor-Critic'' methods, where both value function and policy parameter are learned, and bias is introduced through, for example, bootstrapping to speed the learning. These methods provide estimates of low variance and are online learning methods using incomplete experience. {\color{red}[The discussion on bias and variance in this paragraph is confusing. The logic is not very clear to me.]} | |||
The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., <ref name="franccois2019"/> and <ref name="von2011statistical"/>) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. | |||
In Actor-Critic methods the value function is parametrized by <math>w</math> and the policy is parametrized by <math>\theta</math>. These methods consist of two models: a ''critic'' which updates the value function or <math>Q</math>-function parameter <math>w</math>, and an ''actor'' which updates the policy parameter <math>\theta</math> based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in [[#sec:value_based_methods |Section]] for the critic; for example SARSA or <math>Q</math>-learning. We give the pseudocode for a simple Actor-Critic method in [[#alg:actor_critic |Algorithm]]. | |||
% Here for the critic, we compute the TD error and use it to update the parameter <math>w</math> (see lines 9-10 in [[#alg:actor_critic |Algorithm]]). For the actor, we use the (vanilla) policy-based method to update the policy parameter <math>\theta</math> (see line 11 in [[#alg:actor_critic |Algorithm]]), in a similar way to the REINFORCE algorithm in [[#sec:REINFORCE |Section]]. | |||
Here for the critic, we approximate the <math>Q</math>-function by a linear combination of features, that is, <math>Q(s,a;w)=\phi(s,a)^\top w</math> for some known feature <math>\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d</math>, and use TD(0) to update the parameter <math>w</math> to minimize the least-square error of the TD error with the gradient of the least-square error taking the form <math>\delta \phi(s,a)</math> (see lines 9-10 in [[#alg:actor_critic |Algorithm]]). For the actor, we use the (vanilla) policy-based method to update the policy parameter <math>\theta</math> (see line 11 in [[#alg:actor_critic |Algorithm]]). | |||
There are three main ways to execute the algorithm. In the ''nested-loop'' setting (see, e.g. <ref name="kumar2019sample"/><ref name="xu2020improving"/>), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the ''two time-scale'' setting (see, e.g. <ref name="xu2020non"/>), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the ''single-scale'' setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. <ref name="fu2020single"/><ref name="xu2021doubly"/>). | |||
<proc id="alg:actor_critic" label = "\textbf{Actor-Critic Algorithm}"><div class="ms-2"> '''input''': A differentiable policy parametrization <math>\pi(s,a;\theta)</math>, a differentiable <math>Q</math>-function parameterization <math>Q(s,a;w)</math>, learning rates <math>\beta^{\theta} > 0</math> and <math>\beta^{w} > 0</math>, number of iterations <math>N</math> | |||
</div> | |||
<div class="ms-2"> Initialize policy parameter <math>\theta^{(0)}</math> and <math>Q</math>-function parameter <math>w^{(0)}</math> | |||
</div> | |||
<div class="ms-2"> Initialize <math>s</math> | |||
%\STATE <math>I\leftarrow 1</math> | |||
</div> | |||
<div class="ms-2">'''if''' <math>n=0,1,\ldots,N-1</math> '''then''' | |||
<div class="ms-2"> Sample <math>a\sim\pi(s,a;\theta^{(n)})</math> | |||
</div> | |||
<div class="ms-2"> Take action <math>a</math>, observe state <math>s^\prime</math> and reward <math>r</math> | |||
</div> | |||
<div class="ms-2"> Sample action <math>a^\prime\sim\pi(s^\prime,\cdot;\theta^{(n)})</math> | |||
</div> | |||
<div class="ms-2"> {Compute the TD error:} <math>\delta\leftarrow r+\gamma Q(s^\prime,a^\prime;w^{(n)})-Q(s,a;w^{(n)})</math> | |||
</div> | |||
<div class="ms-2"> {Update <math>w</math>:} <math>w^{(n+1)}\leftarrow w^{(n)}+\beta^{w}\delta\phi(s,a)</math> | |||
</div> | |||
<div class="ms-2"> <math>\theta^{(n+1)}\leftarrow \theta^{(n)} + \beta^{\theta}Q(s,a;w^{(n)})\nabla_{\theta}\ln\pi(s,a;\theta^{(n)})</math> | |||
</div> | |||
<div class="ms-2"> <math>s\leftarrow s^\prime</math>, <math>a\leftarrow a^\prime</math> | |||
</div> | |||
'''end for'''</div></proc> | |||
====<span id="sec:value_discussion"></span>Discussion==== | |||
'''Variants of Policy-based Methods.''' There have been many variations of policy-based methods with improvements in various directions. For example, the ''natural'' policy-based algorithms <ref name="kakade2001natural"/><ref name="peters2008natural"/> modify the search direction of the vanilla version by involving a ''Fisher information matrix'' in the gradient ascent updating rule, which speeds the convergence significantly. To enhance the stability of learning, we may request the updated policy estimate to be not too far away from the previous estimate. Along this line, | |||
TRPO <ref name="schulman2015"/> uses the ''Kullback-Leibler (KL)-divergence'' to measure the change between the consecutive updates as a constraint, while PPO <ref name="schulman2017proximal"/> incorporate an ''adaptive KL-penalty'' or a ''clipped objective'' in the objective function to eliminate the constraint. When using the KL-penalty, PPO can be viewed as a Lagrangian relaxation of the TRPO algorithm. To reduce the sample complexity, Deterministic Policy Gradient (DPG) <ref name="silver2014deterministic"/> uses a deterministic function (rather than the stochastic policy <math>\pi</math>) to represent the policy to avoid sampling over actions (although enough exploration needs to be guaranteed in other ways); Actor-Critic with Experience Replay (ACER) <ref name="wang2016"/> reuses some experience (tuples of data collected in previous iterations/episodes, see the discussion of Deep <math>Q</math>-Networks in [[guide:576dcdd2b6#subsec:deep_value_based |Section]]) which improves the sample efficiency and decreases the data correlation. In addition we mention Asynchronous Advantage Actor-Critic (A3C) and Advantage Actor-Critic (A2C), two popular Actor-Critic methods with a special focus on parallel training <ref name="mnih2016asynchronous"/>. The latter is a synchronous, deterministic version of the former. For continuous-time framework, see developments in <ref name="jia2021policy"/><ref name="jia2022policy"/>. | |||
'''Convergence Guarantees.''' | |||
Now we discuss the convergence guarantees of policy-based algorithms. <ref name="sutton2000policy"/> provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. | |||
% <ref name="konda2003onactor"/> provides the asymptotic convergence of the Actor-Critic algorithm in which the critic uses TD learning with a linearly function approximation, and the actor is updated in an approximate gradient direction. They observe that the actor parametrization and the critic parametrization should not be chosen independently. | |||
For more examples of asymptotic convergence, see for example <ref name="konda2000"/> and <ref name="bhatnagar2010actor"/>. | |||
Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, <ref name="kumar2019sample"/> provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates <math>k</math> required to attain <math>\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 < \varepsilon</math>. We denote this smallest number as <math>K</math>. When the actor uses a policy gradient method, the critic achieves <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)</math> by employing TD(0), <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})</math> by employing the Gradient Temporal Difference, and <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})</math> by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. <ref name="zhang2020global"/> investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. | |||
% <ref name="papini2018stochastic"/> proposed a Stochastic Variance-Reduced Policy Gradient (SVRPG) algorithm which uses either REINFORCE or the Gradient of a Partially Observable Markov Decision Process (GPOMDP) <ref name="baxter2001infinite"/> to estimate the policy gradient. | |||
% The key idea is to use a so-called ''semi-stochastic gradient'' method that involves a reference (stochastic gradient) iterate which is stored in earlier iterations. | |||
% They prove that the SVRPG algorithm converges to an <math>\varepsilon</math>-optimal stationary point, i.e. <math>\mathbb{E}[\|\nabla J(\theta)\|_2^2]\leq \varepsilon</math> after <math>\widetilde{O}^{\prime}(1/\varepsilon^2)</math> iterations. This convergence rate is then improved in <ref name="xu2020improved"/> to <math>\widetilde{O}^{\prime}(1/\varepsilon^{5/3})</math> and in <ref name="xu2019sample"/> to <math>\widetilde{O}^{\prime}(1/\varepsilon^{3/2})</math>. | |||
For more examples of sample complexity analysis for convergence to a stationary point, see for example <ref name="papini2018stochastic"/><ref name="xu2020improved"/><ref name="xu2019sample"/><ref name="shen2019hessian"/><ref name="xiong2020non"/>. The global optimality of stationary points was studied in <ref name="bhandari2019"/> where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. | |||
Recently there have been results on the sample complexity analysis of the convergence to the global optimum. | |||
% For example, <ref name="Zhang2021REINFORCE"/> proves a sublinear high probability regret bound for REINFORCE with a constant learning rate for MDPs with entropy regularization. {\color{red}[I doubt if <ref name="Zhang2021REINFORCE"/> provides the first theoretical guarentee for REINFORCE. Maybe it's better to cite in chronological order. Plus we mention in the beginning there is no regret notion for infinite horizon discounted setting.]} <ref name="agarwal2021theory"/> provides the convergence analysis of three methods under a discounted MDP and compares their convergence rates under fixed learning rates: (1) projected gradient ascent with mean-squared sample complexity <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^6\varepsilon^2})</math>, (2) asymptotic convergence of gradient ascent with softmax policy parameterization, and (3) natural policy gradient ascent with softmax parametrization with mean-squared sample complexity <math>\widetilde{\mathcal{O}}(\frac{2}{(1-\gamma)^2\varepsilon})</math>. {\color{red}[please double check, <math>\leftarrow</math>this paper seems under known parameter setting]} | |||
% <ref name="mei2020"/> strengthens the asymptotic convergence of the softmax policy gradient method in (2) of <ref name="agarwal2021theory"/> to a mean-squared sample complexity <math>\widetilde{\mathcal{O}}(1/\varepsilon)</math> for {\color{ForestGreen}tabular} {\color{blue} what does this mean?} MDPs {\color{ForestGreen}introduced in [[#sec:set_up |Section]]} and a linear convergence rate for entropy-regularized MDPs; <ref name="mei2020"/> also analyzes why entropy regularization speeds the convergence of policy gradient methods from an optimization perspective. | |||
<ref name="cen2020fast"/> proved that the entropy regularized natural policy gradient method achieves a <math>Q</math>-sample complexity (and <math>\varepsilon</math>-optimal policies) with linear convergence rate. <ref name="shani2020adaptive"/> showed that with high probability at least <math>1-\delta</math>, the TRPO algorithm has sublinear rate <math> \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)</math> with mean-squared sample complexity <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)</math> in the unregularized (tabular) MDPs, and has an improved rate <math>\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)</math> with mean-squared sample complexity <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)</math> in MDPs with entropy regularization. | |||
% See also <ref name="xu2020improving"/> for the sample complexity for nested-loop (natural) Actor-Critic methods. | |||
<ref name="xu2020non"/> gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order <math>\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})</math>. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both <ref name="fu2020single"/> and <ref name="xu2021doubly"/>. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see <ref name="zhang2020sample"/> for the regret analysis of the REINFORCE algorithm for discounted MDPs and <ref name="agarwal2021theory"/><ref name="mei2020"/> for policy gradient methods in the setting of known model parameters. | |||
% For more details of policy-based methods and their variants, see for example <ref name="agarwal2021theory"/> and <ref name="kaelbling1996reinforcement"/>. | |||
===<span id="sec:discussion"></span>General Discussion=== | |||
So far we have focused on model-free RL with discounting over an infinite time horizon. In this section, we discuss two other cases: RL over a finite time horizon with both model-based and model-free methods, and model-based RL with an infinite time horizon. | |||
====RL with Finite Time Horizon==== | |||
As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. | |||
% Recall that sample complexity in episodic RL with finite time horizon <math>T</math> typically refers to the number of episodes <math>K=M/T</math> (<math>M</math> is the total number of steps) {\color{blue} we need to make this consistent with changes to sample complexity section} required to guarantee an <math>\varepsilon</math>-optimal policy with probability at least <math>1-\delta</math> in \eqref{eq:sample_complexity_epi} and the regret is defined in \eqref{eq:regret_epi}. | |||
<ref name="PSRL2013"/> proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an <math>\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})</math> expected regret bound. <ref name="DannBrunskill2015"/> proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has <math>V</math>-sample complexity{{efn|In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by <math>T</math> to match the definition of sample complexity in this paper.}} of order <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})</math> assuming known rewards. | |||
<ref name="azar2017minimax"/> considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of <math>\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})</math> (when <math>M\geq T|\mathcal{S}|^3|\mathcal{A}|</math> and <math>|\mathcal{S}||\mathcal{A}|\geq T</math>) and <math>\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})</math> (when <math>M\geq T^3|\mathcal{S}|^3|\mathcal{A}|</math> and <math>|\mathcal{S}||\mathcal{A}|\geq T</math>), respectively. | |||
<ref name="DannUBEV"/> proposed an Upper Bounding the Expected Next State Value (UBEV) algorithm which achieves a sample complexity\footnotemark[2] of <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^5}{\varepsilon^2}{\rm polylog}(\frac{1}{\delta})\big)</math>. | |||
They also proved that the algorithm | |||
has a regret bound of <math>\widetilde{\mathcal{O}}(T^2\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})</math> with <math>M\ge |\mathcal{S}|^5|\mathcal{A}|^3</math>. <ref name="jin2018q"/> proved that <math>Q</math>-learning with UCB exploration achieves regret of <math>\widetilde{\mathcal{O}}(\sqrt{T^3|\mathcal{S}|\,|\mathcal{A}|M})</math> without requiring access to a simulator. The above regret bounds depend on the size of the state and action space and thus may suffer from the curse of dimensionality as <math>|\mathcal{S}|</math> and <math>|\mathcal{A}|</math> increase. To tackle this issue, <ref name="yang2020reinforcement"/> learned a low-dimensional representation of the probability transition model and proved that their MatrixRL algorithm achieves a regret bound of <math>\widetilde{\mathcal{O}}(T^2\,d\sqrt{M})</math> which depends on the number of features <math>d</math> rather than <math>|\mathcal{S}|</math> and <math>|\mathcal{A}|</math>. | |||
<ref name="jin2020provably"/> provided a <math>\widetilde{\mathcal{O}}(\sqrt{d^3T^3M})</math> regret bound with high probability for a modified version of the Least-Squares Value Iteration (LSVI) algorithm with UCB exploration. | |||
====Model-based RL with Infinite Time Horizon==== | |||
%Unlike model-free RL discussed above, model-based methods estimate the transition probabilities and reward functions in order to derive policies by utilizing the approximate model.For example, | |||
To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, <ref name="brafman2002r"/> proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)</math> by <ref name="li2009rmax"/>. The Model-based Interval Estimation (MBIE) in <ref name="Strehl2005"/> was proved to achieve a sample complexity of exploration of the same order as R-MAX. <ref name="MoRmax2010"/> further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)</math>. | |||
<ref name="azar2013minimax"/> proved that the model-based <math>Q</math>-Value Iteration (QVI) achieves the <math>Q</math>-sample complexity of <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})</math> for some small <math>\varepsilon</math> | |||
% <math>\varepsilon\in[0,\frac{1}{\sqrt{1-\gamma}|\mathcal{S}|}]</math> | |||
with high probability at least <math>1-\delta</math>. <ref name="agarwal2020model"/> studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as <ref name="azar2013minimax"/>, but with a larger range of <math>\varepsilon</math>. | |||
% <math>\varepsilon\in[0,\frac{1}{\sqrt{1-\gamma}}]</math>. | |||
See also <ref name="strehl2009reinforcement"/><ref name="kearns2002near"/><ref name="lattimore2012pac"/> for model-based RL along this line. | |||
% {\color{red}[Please use the notations we introduced to describe the sample complexity: V-complexity, Q-complexity, ...]} | |||
''Linear-quadratic'' (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) ''linear-quadratic regulator'' (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example <ref name="fazel2018global"/> and <ref name="hambly2020policy"/> for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also <ref name="basei2021logarithmic"/> and <ref name="guo2021reinforcement"/> for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time. | |||
====<span id="sec:ex-ex"></span>Exploration Exploitation Trade-offs==== | |||
As mentioned earlier, exploitation versus exploration is a critical topic in RL. RL agents want to find the optimal solution as fast as possible, meanwhile committing to solutions too fast without enough exploration could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while efficient exploration remains a challenging and less understood topic. | |||
For multi-armed bandit problems or MDP with finite state and action spaces, classic exploration algorithms such as <math>\varepsilon</math>-greedy, UCB, Boltzmann exploration and Thompson sampling show promising performance in different settings. | |||
For the <math>\varepsilon</math>-greedy algorithm <ref name="dann2022guarantees"/><ref name="dabney2020temporally"/>, the agent randomly explores occasionally with probability <math>\varepsilon</math> and takes the optimal action most of the time with probability <math>1-\varepsilon</math>. For example, see Equation \eqref{epsilon_policy} for the <math>\varepsilon</math>-greedy method combined with a value-based algorithm. The UCB type of algorithms are mainly designed for value-based algorithms <ref name="dong2019q"/><ref name="azar2017minimax"/><ref name="jin2018q"/><ref name="jin2020provably"/>. The agent selects the greediest action to maximize the upper confidence bound <math>Q(s,a) + U(s,a)</math>, where <math>Q(s,a)</math> | |||
is the Q value function and <math>U(s,a)</math> | |||
is a function inversely proportional to how many times the state-action pair <math>(s,a)</math> has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration <ref name="asadi2017alternative"/><ref name="haarnoja2017reinforcement"/><ref name="WZZ2018"/>, the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method <ref name="thompson1933likelihood"/><ref name="ouyang2017learning"/><ref name="gopalan2015thompson"/>, the agent keeps track of a '' belief of the optimal action'' via a distribution over the set of admissible actions (for each given state). | |||
% \ben{do you mean that the agent has a distribution over the set of optimal actions} and samples from this distribution. | |||
For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space. | |||
%Entropy loss term: Add an entropy term into the loss function, encouraging the policy to take diverse actions.Noise-based Exploration: Add noise into the observation, action or even parameter space (Fortunato, et al. 2017, Plappert, et al. 2017). | |||
====<span id="subsec:regularization"></span>Regularization in Reinforcement Learning==== | |||
Many recent developments in RL benefit from regularization, which has been shown to improve not only the exploration but also the robustness. Examples include both policy-based methods and value-based methods in various settings. Here we provide a general overview of different regularization techniques and their advantages along with several examples. For example, | |||
TRPO and PPO use the KL-divergence between two consecutive policies as a penalty in policy improvement <ref name="schulman2015"/><ref name="schulman2017proximal"/>. Soft-<math>Q</math>-learning uses '' Shannon entropy'' as a penalty in value iteration <ref name="haarnoja2017reinforcement"/>. The authors confirmed in simulated experiments that entropy regularization helps to improve exploration and allows transferring skills between tasks. <ref name="geist2019theory"/> proposed a unified framework to analyze the above algorithms via a regularized Bellman operator. | |||
<ref name="cen2020fast"/> showed that entropy regularization can also help to speed up the convergence rate from a theoretical perspective. | |||
It is worth noting that <ref name="WZZ2018"/> explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently <ref name="gao2020state"/> and <ref name="tang2021exploratory"/> extended this idea to a general control framework beyond the linear-quadratic case. <ref name="grau2018soft"/> extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. | |||
When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors <ref name="massoud2009regularized"/><ref name="farahmand2008regularized"/>. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for <math>Q</math>-functions) and the <math>l_1</math> norm (for parameters). | |||
%See also <ref name="liu2019neural"/> and <ref name="shani2020adaptive"/> for convergence analysis of regularized policy iteration. | |||
====Reinforcement Learning in Non-stationary Environment==== | |||
Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce | |||
the mathematical formulations of non-stationary RL (in both episodic and infinite horizon settings) below. | |||
'''Non-stationary Episodic RL.''' For the episodic (or finite-time horizon) setting, an agent interacts with a non-stationary MDP | |||
for <math>K</math> episodes, with each episode containing <math>T</math> timestamps. Let <math>(t, k)</math> denote a (time,episode) | |||
index for the <math>t</math>-th timestamp in the <math>k</math>-th episode. The non-stationary environment can be denoted by a tuple | |||
<math>(\mathcal{S},\mathcal{A}, T, \pmb{P}, \pmb{r})</math>, where <math>\mathcal{S}</math> is the state space, <math>\mathcal{A}</math> is the action space, <math>T</math> is the number of timestamps in one episode, <math>\pmb{P} = \{P_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}</math> | |||
is the set of transition kernels with <math>P_{k,t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})</math> for all <math>1\leq k\leq K</math> and <math>0\leq t\leq T</math>, | |||
and <math>\pmb{r} = \{r_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}</math> is the set of reward functions with <math>r_{k,t}(s,a)</math> a random variable in <math>\mathbb{R}</math> (depending on both the time-step and the episode number) for each pair <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math> and for <math>0\leq t\leq T-1</math>. Similarly, the terminal reward <math>r_{k,T}(s)</math> is a real-valued random variable for each <math>s\in\mathcal{S}</math>. Denote by <math>\Pi=\{\pi_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T-1}</math> the Markovian policy in which <math>\pi_{k,t}</math> can be either deterministic or randomized. The value function for this policy at the <math>k</math>-th episode can be written as | |||
<math display="block"> | |||
\begin{eqnarray} | |||
V^{\Pi}_{k,t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^{T-1} r_{k,u}(s_u,a_u)+r_{k,T}(s_T)\right|s_t = s\right], | |||
\end{eqnarray} | |||
</math> | |||
where <math>s_{u+1} \sim P_{k,u}(s_u,a_u)</math> and <math>a_u\sim \pi_{k,u}(s_u)</math>. | |||
\iffalse | |||
'''Non-stationary RL with Infinite Time Horizon.''' | |||
For the infinite horizon setting, an agent interacts with a non-stationary MDP starting from an initial timestamp <math>t</math>. The non-stationary environment can be denoted by a tuple | |||
<math>(\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})</math>, where <math>\mathcal{S}</math> is the state space, <math>\mathcal{A}</math> is the action space, <math>\pmb{P} = \{P_{t} \}_{t \ge 0}</math> | |||
is the set of transition kernels with <math>P_{t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})</math>, | |||
and <math>\pmb{r} = \{r_{t} \}_{t \ge 0}</math> is the set of reward functions with <math>r_{t}(s,a)</math> a random variable in <math>\mathbb{R}</math> (depending on both the time-step and the episode number) for each pair <math>(s,a)\in \mathcal{S}\times \mathcal{A}</math>. Denote by <math>\Pi=\{\pi_{t} \}_{t \ge 0}</math> the Markovian policy in which <math>\pi_{t}</math> can be either deterministic or randomized. Now we consider the associated value function with a discount factor <math>\gamma</math>: | |||
<math display="block"> | |||
\begin{eqnarray} | |||
V^{\Pi}_{t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^\infty \gamma^{u-t} r_{u}(s_u,a_u)\right|s_t = s\right], | |||
\end{eqnarray} | |||
</math> | |||
where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.\\ | |||
\ben{This repeats too much how about:} | |||
\fi | |||
'''Non-stationary RL with Infinite Time Horizon.''' | |||
For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon <math>T</math> become infinite. Then the non-stationary environment can be denoted by a tuple | |||
<math>(\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})</math>, with the obvious changes from the episodic case. Again we write <math>\Pi=\{\pi_{t} \}_{t \ge 0}</math> for the Markovian policy in which <math>\pi_{t}</math> can be either deterministic or randomized and consider the associated value function with a discount factor <math>\gamma</math>: | |||
<math display="block"> | |||
\begin{eqnarray} | |||
V^{\Pi}_{t}(s) = \left.\mathbb{E}^{\Pi}\left[\sum_{u=t}^\infty \gamma^{u-t} r_{u}(s_u,a_u)\right|s_t = s\right], | |||
\end{eqnarray} | |||
</math> | |||
where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.\\ | |||
Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces <ref name="gajane2018sliding"/><ref name="cheung2019learning"/>, episodic MDPs with finite state and action spaces <ref name="mao2020model"/>, %proposes a UCB based <math>Q</math>-learning algorithm %which achieves a near-optimal regret on the order of <math>\tilde{\mathcal{O}}(\Delta^{1/3}T^{2/3}+\sqrt{T})</math>, with prior knowledge of <math>\Delta</math>. | |||
and episodic linear MDPs <ref name="touati2020efficient"/>. | |||
However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. | |||
To relax this assumption, <ref name="cheung2020reinforcement"/> proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result <ref name="cheung2019learning"/>. However, it introduces extra overheads and leads to suboptimal regret. | |||
More recently, <ref name="wei2021non"/> proposes a general approach that is applicable to various reinforcement learning settings | |||
(including both episodic MDPs and infinite-horizon MDPs) and achieves optimal dynamic regret without any prior knowledge of the degree of non-stationarity. | |||
==General references== | |||
{{cite arXiv|last1=Hambly|first1=Ben|last2=Xu|first2=Renyuan|last3=Yang|first3=Huining|year=2023|title=Recent Advances in Reinforcement Learning in Finance|eprint=2112.04553|class=q-fin.MF}} | |||
==Notes== | |||
{{notelist}} | |||
==References== | |||
{{reflist}} |
Revision as of 22:59, 11 May 2024
label{sec:rl_basics} Reinforcement learning is an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. %This thread runs through some of the earliest work in artificial intelligence and led to the revival {\color{blue} I am not sure what revival means in this context} of RL in the early 1980s. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as [math]Q[/math]-learning and Actor-Critic. Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go [1] and multi-player StarCraft II [2], as well as the development of ChatGPT [3]. The generality of the reinforcement learning framework allows its application in both discrete and continuous spaces to solve tasks in both real and simulated environments [4]. Classical RL research during the last third of the previous century developed an extensive theoretical core on which modern algorithms are based. Several algorithms, such as Temporal-Difference Learning and [math]Q[/math]-learning, were developed and are able to solve small-scale problems when either the states of the environment can be enumerated (and stored in memory) or the optimal policy lies in the space of linear or quadratic functions of the state variable. Although these restrictions are extremely limiting, these foundations of classical RL theory underlie the modern approaches which benefit from increased computational power. Combining this framework with deep learning [5] was popularized by the Deep [math]Q[/math]-learning algorithm, introduced in [6], which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years [7][8][9].
Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in Section with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In Section we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in Section and policy-based methods in Section~subsec:policy_based_methods, with an infinite-time horizon. Finally, Section~sec:discussion contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL.
Setup: Markov Decision Processes
We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker.
Infinite Time Horizon and Discounted Reward. Let us start with a discrete time MDP with an infinite time horizon and discounted reward. We have a Markov process taking values in a state space [math]\Sc[/math], where we can influence the evolution by taking an action from a set [math]\Ac[/math]. The aim is to optimize the expected discounted return from the system by choosing a policy, that is a sequence of actions through time. We formulate this mathematically by defining the value function [math]V^\star[/math] for each [math]s\in \Sc[/math] to be
subject to
%where [math]h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} [/math] is the history up to time [math]t[/math]; Here and throughout the paper, [math]\E^{\Pi}[/math] denotes the expectation under the policy [math]{\Pi}[/math], where the probability measure [math]\mathbb{P}[/math] describes both dynamics and the rewards in the MDP. We will write [math]\mathcal{P}(\mathcal{X})[/math] for the probability measures over a space [math]\mathcal{X}[/math]. The state space [math](\Sc, d_{\Sc})[/math] and the action space [math](\Ac, d_{\Ac})[/math] are both complete separable metric spaces, which includes the case of [math]\Sc[/math] and [math]\Ac[/math] being finite, as often seen in the RL literature; [math]\gamma[/math] [math]\in[/math] [math](0, 1)[/math] is a discount factor; [math]s_t \in \Sc[/math] and [math]a_t \in \Ac[/math] are the state and the action at time [math]t[/math]; [math]P: \Sc \times \Ac \to \Pc(\Sc)[/math] is the transition function of the underlying Markov process; the notation [math] s_{t + 1} \sim P(s_t, a_t)[/math] denotes that [math] s_{t + 1}[/math] is sampled from the distribution [math] P(s_t, a_t)\in \mathcal{P}(\mathcal{S})[/math]; the reward [math]r(s, a)[/math] is a random variable in [math]\R[/math] for each pair [math](s, a)\in \Sc \times \Ac[/math]; and the policy [math]\Pi =\{\pi_t\}_{t=0}^{\infty}[/math] is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, [math]\pi_t: \Sc \to \Ac[/math] maps the current state [math]s_t[/math] to a deterministic action. On the other hand, a randomized policy [math]\pi_t: \Sc\to \Pc(\Ac)[/math] maps the current state [math]s_t[/math] to a distribution over the action space. For a randomized policy [math]\pi_t[/math], we will use [math]\pi_t(s)\in \mathcal{P}(\mathcal{A})[/math] and [math]\pi_t(s,a)[/math] to represent, respectively, the distribution over the action space given state [math]s[/math] and the probability of taking action [math]a[/math] at state [math]s[/math]. %that can be either deterministic, in that {\color{RedOrange}[math]\pi_t: \Sc \to \Ac[/math]}, or randomized, in that {\color{RedOrange}[math]\pi_t: (\Sc\times \Ac)^t\times\Sc\to \Pc(\Ac)[/math]}. In this case, with infinite-time horizon, we assume the reward [math]r[/math] and the transition dynamics [math]P[/math] are time homogeneous, which is a standard assumption in the MDP literature [10]. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward. % {\color{blue} I am concerned with the assumptions we make. At the moment everything looks very general - dont we need some sort of irreducibility? Later we talk about stationary distributions and sampling from them, without irreducibility these could be trivial} The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev};
We write the value function as
where the [math]Q[/math]-function, one of the basic quantities used for RL, is defined to be
the expected reward from taking action [math]a[/math] at state [math]s[/math] and then following the optimal policy thereafter. There is also a Bellman equation for the [math]Q[/math]-function given by
This allows us to retrieve the optimal (stationary) policy [math]\pi^*(s, a)[/math] (if it exists) from [math]Q(s, a)[/math], in that [math]\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)[/math]. An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer the reader to [11][12] for a more detailed discussion of RL with infinite-time horizon and average reward. \iffalse
Infinite Time Horizon and Average Reward. Another setting over an infinite time horizon is to consider the average reward, which is also referred to as the ergodic reward,
subject to
where [math]h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} [/math] is the history up to time [math]t[/math]; [math]s_t \in \Sc[/math] and [math]a_t \in \Ac[/math] are the state and the action at time [math]t[/math]; [math]P: \Sc \times \Ac \to \Pc(\Sc)[/math] is the transition matrix of the underlying Markov system; the reward [math]r(s, a)[/math] is random valued in [math]\R[/math] for each pair [math](s, a)\in \Sc \times \Ac[/math]; and [math]\Pi =\{\pi_t\}_{t=0}^{\infty}[/math] is the policy adopted by the agent. \fi
Finite Time Horizon. When there is a finite time horizon [math]T \lt \infty[/math] we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as
subject to
%where [math]h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} [/math] is the history up to time [math]t[/math]; As in the infinite horizon case, we denote by [math]s_u \in \Sc[/math] and [math]a_u \in \Ac[/math] the state and the action of the agent at time [math]u[/math]. % and denote by [math]h_u =\{s_0,a_0,\cdots,s_{u-1},a_{u-1},s_u\}[/math] the history up to time [math]u[/math]. However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now [math]P_u: \Sc \times \Ac \to \Pc(\Sc)[/math] denotes the transition function and [math]r_u(s, a)[/math] a real-valued random variable for each pair [math](s, a)\in \Sc \times \Ac[/math] for [math]t \leq u \leq T-1[/math]. Similarly [math]r_T(s)[/math], the terminal reward, is a real-valued random variable for all [math]s\in \Sc[/math]. Finally, the Markovian policy [math]\Pi =\{\pi_u\}_{t=0}^T[/math] can be either deterministic or randomized. The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as
with the terminal condition [math]V^\star_T(s) = \mathbb{E}[r_{T}(s)][/math]. We write the value function as
where the [math]Q^\star_t[/math] function is defined to be
% the expected reward from taking action [math]a[/math] at state [math]s[/math] and at time [math]t[/math] and then following the optimal policy thereafter. There is also a Bellman equation for the [math]Q[/math]-function given by
with the terminal condition [math]Q^\star_T(s,a) = \mathbb{E}[r_T(s)][/math] for all [math]a\in \mathcal{A}[/math]. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications. \iffalse
Policy Class and Optimal Policies. We consider the following types of policies for [math]\Pi = \{\pi_t\}_{t=0}^{T}[/math] with [math]T \lt \infty[/math] for the finite time horizon setting and [math]T=\infty[/math] for the infinite time horizon setting:
- History-dependent Randomized Policy (HR): [math]\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Pc(\Ac)[/math],
- History-dependent Deterministic Policy (HD): [math]\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Ac[/math],
- Markov Randomized Policy (MR): [math]\pi_t: \Sc \to \Pc(\Ac)[/math],
- Markov Deterministic Policy (MD): [math]\pi_t: \Sc \to \Ac[/math].
It is easy to observe that MD [math]\subset[/math] MR [math]\subset[/math] HR. Recall that in \eqref{equ: singlev} and \eqref{equ: singlev_fh}, the optimal policy is defined as the one that maximizes the gain of the MDP. Due to the structure of MDPs it is not difficult to show that it is sufficient to consider Markovian policies. Henceforth, we shall focus only on Markovian policies. \fi For an infinite horizon MDP with finite state and action space, and finite reward [math]r[/math], a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists.
Assume [math]|\mathcal{A}| \lt \infty[/math], [math]|\mathcal{S}| \lt \infty[/math] and [math]|r| \lt \infty[/math] with probability one. For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.
\iffalse
Assume [math]|\mathcal{A}| \lt \infty[/math], [math]|\mathcal{S}| \lt \infty[/math] and [math]|r| \lt \infty[/math] with probability one. For infinite horizon average reward MDP, there always exist a stationary (possibly randomized) policy which is an optimal policy.
\fi Theorem implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in [10]. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the best stationary policy. To this end, we write [math]\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})[/math] (without a time index) for a stationary policy throughout the paper.
Linear MDPs and Linear Functional Approximation. Linear MDPs have been extensively studied in the recent literature on theoretical RL in order to establish tractable and efficient algorithms. In a linear MDP, both the transition kernels and the reward function are assumed to be linear with respect to some feature mappings [13][14].
In the infinite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if there exist [math]d[/math] unknown (signed) measures [math]\mu = (\mu^{(1)},\cdots,\mu^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have
Similarly for the finite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if for any [math]0\leq t \leq T[/math], there exist [math]d[/math] unknown (signed) measures [math]\mu_t = (\mu_t^{(1)},\cdots,\mu_t^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta_t \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have
Typically features are assumed to be known to the agent and bounded, namely, [math]\|\phi(s,a)\|\leq 1[/math] for all [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math]. The linear MDP framework is closely related to the literature on RL with linear functional approximation, where the value function is assumed to be of the form
for the infinite horizon case, and
for the finite horizon case. Here [math]\psi: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}^d[/math] and [math]\xi: \mathcal{S}\rightarrow \mathbb{R}^d[/math] are known feature mappings and [math]\omega[/math], [math]\omega_t[/math] [math]\eta[/math], and [math]\eta_t[/math] are unknown vectors. It has been shown that the linear MDP (i.e., \eqref{eq:linearMDP_infinite} or \eqref{eq:linearMDP_finite}) and the linear functional approximation formulation (i.e., \eqref{eq:lfa_infinite} or \eqref{eq:lfa_finite}) are equivalent under mild conditions [15][16]. In addition to linear functional approximations, we refer the reader to [17] for a general discussion of RL with nonlinear functional approximations and to Section \ref{sec:deep_value_based} for neural network approximation, one of the most popular nonlinear functional approximations used in practice.
Nonlinear Functional Approximation. Compared to linear functional approximations, nonlinear functional approximations do not require knowledge of the kernel functions a priori. Nonlinear functional approximation could potentially address the mis-specification issue when the agent has an incorrect understanding of the functional space that the MDP lies in. The most popular nonlinear functional approximation approach is to use neural networks, leading to deep RL. Thanks to the universal approximation theorem, this is a theoretically sound approach and neural networks show promising performance for a wide range of applications. Meanwhile gradient-based algorithms for certain neural network architectures enjoy provable convergence guarantees. We defer the discussion of deep RL to Section \ref{sec:deep_value_based} and its financial applications to Section \ref{sec:fin_app}.
From MDP to Learning
When the transition dynamics [math]P[/math] and the reward function [math]r[/math] for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy [math]\pi[/math] (if it exists) while simultaneously learning the unknown [math]P[/math] and [math]r[/math]. The learning of [math]P[/math] and [math]r[/math] can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as [18][19].
Agent-environment Interface. In the RL setting, the learner or the decision-maker is called the agent. The physical world that the agent operates in and interacts with, comprising everything outside the agent, is called the environment. The agent and environment interact at each of a sequence of discrete time steps, [math]t =0,1,2,3,\cdots[/math], in the following way. At the beginning of each time step [math]t[/math], the agent receives some representation of the environment's state, [math]s_t \in \mathcal{S}[/math] and selects an action [math]a_t\in \mathcal{A}[/math]. At the end of this time step, in part as a consequence of its action, the agent receives a numerical reward [math]r_t[/math] (possibly stochastic) and a new state [math]s_{t+1}[/math] from the environment. The tuple [math](s_t,a_t,r_t,s_{t+1})[/math] is called a sample at time [math]t[/math] and [math]h_t :=\{(s_u,a_u,r_u,s_{u+1})\}_{u=0}^t[/math] is referred to as the history or experience up to time [math]t[/math]. An RL algorithm is a finite sequence of well-defined policies for the agent to interact with the environment.
Exploration vs Exploitation. In order to maximize the accumulated reward over time, the agent
learns to select her actions based on her past experiences (exploitation) and/or by trying new choices (exploration). Exploration provides opportunities to improve performance from the current sub-optimal solution to the ultimate globally optimal one, yet it is time consuming and computationally expensive as over-exploration may impair the convergence to the optimal solution. Meanwhile, pure exploitation, i.e., myopically picking the current solution based solely on
past experience, though easy to implement, tends to yield sub-optimal global solutions. Therefore,
an appropriate trade-off between exploration and exploitation is crucial in the design of RL algorithms in order to improve the learning and the optimization performance. See Section for a more detailed discussion.
Simulator. In the procedure described above on how the agent interacts with the environment, the agent does so in an online fashion. Namely,
the initial state at time step [math]t+1[/math], [math]s_{t+1}[/math], is the state that the agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. This is a challenging setting since an efficient exploration scheme is needed. For instance, [math]Q[/math]-learning with the [math]\varepsilon[/math]-greedy policy (to be introduced in Section) may take exponentially many episodes
to learn the optimal policy [20].
Some results assume access to a simulator [21] (a.k.a., a generative model [22]),
which allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. This implies the agent can “restart” the system at any time. That is, the initial state at time step [math]t+1[/math] does not need to be the state that agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [22].
Randomized Policy vs Deterministic Policy. A randomized policy [math]\pi_t:\Sc \rightarrow \mathcal{P}(\Ac)[/math] is also known in the control literature as a relaxed control and in game theory as a mixed strategy. Despite the existence of a deterministic optimal policy as stated in Theorem for the infinite time horizon case, most of the RL algorithms adopt randomized policies to encourage exploration when RL agents are not certain about the environment.
An Example: Multi-armed bandits. We illustrate these points by discussing
multi-armed bandit problems, a special case of RL problems.
%In the bandit setting, the action does not impact the state dynamics of the agent (if exists).
%The multi-armed bandit problem has been widely studied in a variety of setups.
The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown.
We try these arms in some order, which may depend on the sequence of rewards that have been
observed so far and attempt to maximize our return. A common objective in this context is to find a policy for choosing the next arm
to be tried, under which the sum of the expected rewards comes as close as possible to the ideal
reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times.
Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward.
The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return.
In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time [math]t[/math] can be interpreted as the personal information and purchasing behavior of a targeted client.
The problem was first considered in the seminal work of Robbins [23], which
derives policies that asymptotically attain an average reward that converges in the limit to the reward of the best arm.
The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected
reward, and adversarial setups. See Berry and Fristedt [24] for a review of the classical results on the multi-armed bandit problem.
%One of the attractive features of the multi-armed bandit problem is that despite its simplicity,
%it encompasses many important decision theoretic issues, such as the tradeoff between exploration
%and exploitation.
Performance Evaluation
It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time [math]0[/math] and ends at the terminal time [math]T[/math], and the performance is evaluated in terms of the total number of samples in all episodes.[a] Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE [27] and the policy gradient method [28] (see Section~subsec:policy_based_methods). %We refer the the reader to [28] for more theoretical discussions on this technique.}
Throughout the analysis, we use the notation [math]\widetilde{O}(\cdot)[/math] to hide logarithmic factors when describing the order of the scaling in [math]\varepsilon[/math], [math]\delta[/math], [math]|\mathcal{A}|[/math] and [math]|\mathcal{S}|[/math] (when these are finite), and other possible model parameters such as the discount rate [math]\gamma[/math] in the infinite horizon case or the dimension [math]d[/math] of the features when functional approximation is used. We write [math]\widetilde{\mathcal{O}}^{\prime}[/math] rather than [math]\widetilde{\mathcal{O}}[/math] to indicate that, although there may be other parameters, we only include the dependence on [math]\varepsilon[/math]. We denote by [math]\mbox{poly}(\cdot)[/math] a polynomial function of its arguments.
Sample Complexity. In RL, a sample [math](s,a,r,s^{\prime})[/math] is defined as a tuple consisting of a state [math]s[/math], an action [math]a[/math], an instantaneous reward received by taking action [math]a[/math] at state [math]s[/math], and the next state [math]s^{\prime}[/math] observed afterwards. Sample complexity is defined as the total number of samples required to find an approximately optimal policy. This evaluation criterion can be used for any kind of RL problem. For episodic RL, an algorithm returns, after the end of the [math](k-1)[/math]-th episode with [math]M_{k-1}[/math] samples used in this episode, a policy [math]\pi_k[/math] to be applied in the [math]k[/math]-th episode. The sample complexity of this algorithm is the minimum number [math]\sum_{k=1}^K M_{k-1}(\varepsilon)[/math] of samples such that for all [math]k \ge K[/math], [math]\pi_k[/math] is [math]\varepsilon[/math]-optimal with probability at least [math]1-\delta[/math], i.e., for [math]k \ge K[/math],
% {\color{ForestGreen}See [26] for a detailed discussion of other notions of sample complexity in episodic RL and the relationships among them.} For discounted RL, an algorithm returns, after the end of the [math](n-1)[/math]-th step with [math]M_{n-1}[/math] samples used in this iteration, a policy [math]\pi_n[/math] to be applied in the [math]n[/math]-th step. There are several notions of sample complexity in this setting. The most commonly used ones are defined through the value function and the [math]Q[/math]-function with all input variables (i.e., [math]s\in \mathcal{S}[/math] for [math]V[/math] and [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] for [math]Q[/math]) and they are referred to as the [math]V[/math]-sample complexity and [math]Q[/math]-sample complexity in the literature. The [math]Q[/math]-sample complexity of a given algorithm is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]Q^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]Q^*[/math], that is
holds either with high probability (that is [math]\mathbb{P}\left(\|Q^{(n)}-Q^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|Q^{(n)}-Q^*\|_{\infty}] \lt \varepsilon[/math]. Similarly, the [math]V[/math]-sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]V^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]V^*[/math], that is
holds either with high probability (in that [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|V^{(n)}-V^*\|_{\infty}] \lt \varepsilon[/math]. Note that \eqref{eq:sample_complexity_infi_Q} implies \eqref{eq:sample_complexity_infi_V_2} since [math]V^{(n)}(s) = \sup_{a\in \mathcal{A}}Q^{(n)}(s,a)[/math] and [math]V^{*}(s) = \sup_{a\in \mathcal{A}}Q^*(s,a)[/math]. In our analysis we do not distinguish between whether the sample complexity bounds for \eqref{eq:sample_complexity_infi_Q} and \eqref{eq:sample_complexity_infi_V_2} are in the high probability sense or in the expectation sense. The second type of sample complexity, the sample complexity of exploration, is defined as the number of samples such that the non-stationary policy [math]\pi_n[/math] at time [math]n[/math] is not [math]\varepsilon[/math]-optimal for the current state [math]s_n[/math]. This measure counts the number of mistakes along the whole trajectory. Mathematically speaking, the sample complexity of exploration for a given algorithm is the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]\pi_n[/math] is [math]\varepsilon[/math]-optimal when starting from the current state [math]s_n[/math] with probability at least [math]1-\delta[/math], i.e., for [math]n \ge N[/math],
Note that the condition [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math] associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. Another weaker criterion, the mean-square sample complexity, measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution [math]\mu[/math]. %Let [math]D = diag(\mu(s_1), . . . , \mu(s_{|\mathcal{S}|})) \in \mathbb{|\mathcal{S}|}^{|\mathcal{S}|}[/math] denote the diagonal matrix whose elements are given by the entries of the stationary distribution or initial distribution [math]\mu(\cdot)[/math]. Then, In this case, the mean-square sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math],
Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the [math]V[/math] function over a state distribution [math]\mu[/math], it is implied by \eqref{eq:sample_complexity_infi_V_2}.
{\color{blue} What does [math]\widetilde{\mathcal{O}}[/math] mean?}
Rate of Convergence. To emphasize the convergence with respect to the number of iterations required, we introduce the notion of rate of convergence, which uses the relationship between the number of iterations/steps [math]N[/math] and the error term [math]\varepsilon[/math], to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from [math]\varepsilon[/math]) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to [math]\varepsilon[/math]. In particular, an algorithm is said to converge at a linear rate if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(1/\varepsilon))[/math]. Similarly, an algorithm is said to converge at a sublinear rate (slower than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^p)[/math] with [math]p\geq 1[/math], and is said to converge at a superlinear (faster than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(\log(1/\varepsilon))[/math].
Regret Analysis. The regret of a given policy [math]\pi[/math] is defined as the difference between the cumulative reward of the optimal policy and that gathered by [math]\pi[/math]. It quantifies the exploration-exploitation trade-off. For episodic RL with horizon [math]T[/math] in formulation \eqref{equ: singlev_fh}, the regret of an algorithm after [math]K [/math] episodes with [math]M=T\,K[/math] steps is
There is currently no regret analysis for RL problems with infinite time horizon and discounted reward. The natural definition of regret as given above is less meaningful in this case as the cumulative discounted reward is bounded and hence does not scale with [math]T[/math].
Asymptotic Convergence. In addition to sample complexity and regret analysis discussed above, asymptotic convergence is another weaker convergence guarantee, which requires the error term to decay to zero as [math]N[/math] goes to infinity (without specifying the order of [math]N[/math]). This is often a first step in the theoretical analysis of RL algorithms. % Throughout the analysis, we use the notation [math]\widetilde{O}(\cdot)[/math] for to hide logarithmic factors when describing the order of the scaling in [math]\varepsilon[/math], [math]\delta[/math], [math]|\mathcal{A}|[/math] and [math]|\mathcal{S}|[/math] (when these are finite), and other possible model parameters such as the discount rate [math]\gamma[/math] in the infinite horizon case or the dimension [math]d[/math] of the features when functional approximation is used. We write [math]\widetilde{\mathcal{O}}^{\prime}[/math] rather than [math]\widetilde{\mathcal{O}}[/math] to indicate that, although there may be other parameters, we only include the dependence on [math]\varepsilon[/math].
Classification of RL Algorithms
%Most of the reinforcement learning algorithms can be classified into two categories, i.e., model-based algorithms and the model-free algorithms. An RL algorithm includes one or more of the following components:
- a representation of a value function that provides a prediction of how good each state or each state/action pair is,
- a direct representation of the policy [math]\pi(s)[/math] or [math]\pi(s,a)[/math], %or
- a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).
The first two components are related to what is called model-free RL. When the latter component is used, the algorithm is referred to as model-based RL. In the MDP setting, model-based algorithms maintain an approximate MDP model by estimating the transition probabilities and the reward function, and derive a value function from the approximate MDP. The policy is then derived from the value function. Examples include [29][20][30] and [31]. Another line of model-based algorithms make structural assumptions on the model, using some prior knowledge, and utilize the structural information for algorithm design. For example, see [32] for learning linear-quadratic regulators over an infinite time horizon and [33][34] for the finite time horizon case. Unlike the model-based method, model-free algorithms directly learn a value (or state-value) function or the optimal policy without inferring the model. Model-free algorithms can be further divided into two categories, value-based methods and policy-based methods. Policy-based methods explicitly build a representation of a policy and keep it in memory during learning. Examples include policy gradient methods and trust-region policy optimization methods. As an alternative, value-based methods store only a value function without an explicit policy during the learning process. In this case, the policy is implicit and can be derived directly from the value function (by picking the action with the best value). In our discussion of methodology, we focus on model-free RL algorithms for MDP with infinite horizon and discounted reward. In particular, we introduce some classical value-based and policy-based methods in Sections and respectively. For the episodic setting and model-based algorithms, see the discussion in Section.
Value-based Methods
In this section, we introduce the methodologies of several classical value-based methods in the setting of an infinite time horizon with discounting, finite state and action spaces ([math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]), and stationary policies. The setting with finite state and action spaces is also referred to as the tabular case in the RL literature. For more general setups with an infinite number of actions and states, we refer the reader to Section for a discussion of linear functional approximations and to Section for neural network approximations. Recall that for reinforcement learning, the distribution of the reward function [math]r[/math] and the transition function [math]P[/math] are unknown to the agent. Therefore the expectations in the Bellman equations \eqref{equ:classicalV} and \eqref{equ: singleQ} cannot be calculated directly. On the other hand, the agents can observe samples [math](s,a,r,s^{\prime})[/math] by interacting with the system without a model of the system's dynamics [math]P[/math] or any prior knowledge of [math]r[/math]. The agents can then learn the value function or [math]Q[/math]-function using the samples. Following this idea, we next introduce the classical temporal-difference learning algorithm, with [math]Q[/math]-learning and State-Action-Reward-State-Action (SARSA) as two special cases.
Temporal-Difference Learning
Given some samples [math](s,a,r,s')[/math] obtained by following a policy [math]\pi[/math], the agent can update her estimate of the value function [math]V^{\pi}[/math] (defined in \eqref{equ: singlev}) at the [math](n+1)[/math]-th iteration by
with some initialization of the value function [math]V^{\pi,(0)}[/math]. Here [math]\beta_n(s,a)[/math] is the learning rate at iteration [math]n+1[/math] which balances the weighting between the current estimate and the new estimate. The quantity [math]\beta_n[/math] can be a constant, can depend on the current state [math]s[/math], or even depend on the observed samples up to iteration [math]n+1[/math]. Algorithm provides some pseudo-code for an implementation of the TD method. Equation \eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD[math](\lambda)[/math] method (for some [math]\lambda \in [0,1][/math]) which is a combination of TD learning (with weight [math]\lambda[/math]) and a Monte Carlo method (with weight [math]1-\lambda[/math]). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See Section for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. {\color{blue} I still dont understand. A Monte Carlo method is a simulation method for calculating expectations of functions of a stochastic process. Here we have a method, TD, which uses a single sample to update the value function. We could use more samples to update with some average of the samples from a given [math](s,a)[/math] - is this what you mean? Or do we run sample the policy for a longer time period and average these to get an approximation using the definition of [math]V^{\pi}[/math] as an expectation over an infinite discounted sum?} %, like algorithms. Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in [18] for more details. The TD update in \eqref{eq:TD_update} can also be written as
The term [math]\delta_n[/math], commonly called the TD error, measures the difference between the estimated value at [math]s[/math] and the better estimate [math]r+\gamma V^{\pi,(n)}(s')[/math], which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature. %Notice that this TD error at iteration [math]n[/math] is the error in the estimate made at this iteration becaseu the TD error depends on the next state and the next reward, it is not actually available until one iteration later. % ON THE EXT STATE AND NEXT REWARD, IT IS NOT ACUALLY AVAILABLE UNTIL ONE TIME STEP LATER. %TD target and TD error as shown below are two important components which are used in many other areas of RL. {\color{blue}[XXX.]}
% \FOR {each episode}
% \ENDFOR
[math]Q[/math]-learning Algorithm
A version of TD learning is [math]Q[/math]-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the [math]Q[/math]-function with samples observed by the agent. At iteration [math]n[/math] ([math]1 \leq n \leq N-1[/math]), the [math]Q[/math]-function is updated using a sample [math](s,a,r,s^{\prime})[/math] where [math]s' \sim P(s,a)[/math],
Here [math]\beta_n(s,a)[/math] is the learning rate which balances the weight between the current estimate [math]Q^{(n)}[/math] from iteration [math]n[/math] and the new estimate [math]r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')[/math] calculated using the sample [math](s,a,r,s^{\prime})[/math]. Algorithm provides pseudo-code for an implementation of Q-learning.
% \FOR {each episode}
% \ENDFOR
The following theorem guarantees the asymptotic convergence of the [math]Q[/math]-learning update \eqref{eq:Q_learning_update} to the [math]Q[/math]-function defined in \eqref{defsingleQ}. {{proofcard|Theorem (Watkins and Dayan (1992) [35])|Theorem-1|\label{thm:Q_conv} Assume [math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]. Let [math]n^i(s, a)[/math] be the index of the [math]i[/math]-th time that the action [math]a[/math] is used in state [math]s[/math]. % Let [math]{n^i(s,a)}[/math] be the [math]i[/math]{-{th}} time to visit [math](s,a)[/math]. Let [math]R \lt \infty[/math] be a constant. Given bounded rewards [math]|r|\leq R[/math], learning rate [math]0\leq\beta_n \lt 1[/math] and
then [math]Q^{(n)}(s,a)\rightarrow Q^\star(s,a)[/math] as [math]n\rightarrow \infty[/math], [math]\forall s,a[/math] with probability [math]1[/math].|}} %Theorem says that under the additional assumptions of Theorem, the [math]Q[/math]-learning update converges to the [math]Q[/math]-function. This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence [math]\beta[/math]. Theorem implies that eventually [math]Q^{(n)}[/math] will convergence to the true value function [math]Q^\star[/math] when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for [math]Q[/math]-learning related algorithms are discussed in Section~subsubsec:value_theo.
SARSA
In contrast to the [math]Q[/math]-learning algorithm, which takes samples from any policy [math]\pi[/math] as the input where these samples could be collected in advance before performing the [math]Q[/math]-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the [math]Q[/math]-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm.
[b] The policy derived from [math]Q^{(n)}[/math] using [math]\varepsilon[/math]-greedy (line 4 and line 8 in Algorithm) suggests the following choice of action [math]a[/math] when in state [math]s[/math]:
In Algorithm, SARSA uses the [math]Q^{(n)}[/math] function with an [math]\epsilon[/math]-greedy policy to choose [math]a'[/math] and to perform exploration. In contrast, Q-learning uses the maximum value of [math]Q^{(n)}[/math] over all possible actions for the next step, which is equivalent to setting [math]\varepsilon=0[/math] when choosing [math]a'[/math].
In addition to the [math]\varepsilon[/math]-greedy policy, other exploration methods such as the softmax operator (resulting in a Boltzmann policy) [36][37][38] can also be applied. See Section for a more detailed discussion.
On-policy Learning vs. Off-policy Learning. An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. [math]Q[/math]-learning is an off-policy agent as the samples [math](s,a,r,s')[/math] used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the [math]Q[/math]-function. The convergence of the [math]Q[/math]-function relies on the exploration scheme of the policy [math]\pi[/math] (see Theorem). In contrast to [math]Q[/math]-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the [math]Q[/math]-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of [math]Q[/math]-learning \eqref{eq:Q_learning_update}, we have [math]\max_{a'} Q^{(n)}(s',a')[/math] which is the maximum value of the [math]Q[/math]-function at a given state [math]s'[/math]. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes [math]Q^{(n)}(s',a')[/math] where [math]a'[/math] is selected based on a policy derived from [math]Q^{(n)}[/math] (via the [math]\varepsilon[/math]-greedy policy \eqref{epsilon_policy}).
Discussion
In this section, we provide a brief overview of sample complexity results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to Section. There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, [39] showed that the mean-squared error converges at a rate of [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})[/math] with decaying learning rate [math]\beta_n= \frac{1}{(1+n)^{\sigma}}[/math] for some [math]\sigma\in (0,1)[/math] where i.i.d. samples are drawn from a stationary distribution; [40] provided an improved [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] bound for iterate-averaged TD(0) with constant step-size; and [41] reached the same mean-square sample complexity [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory. A similar analysis was applied by [42] to SARSA and [math]Q[/math]-learning algorithms for a continuous state space using linear function approximation (see the end of Section).
The [math]Q[/math]-sample complexity for the [math]Q[/math]-learning algorithm of [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\mbox{poly}(\log \delta^{-1})\right)[/math] was first established in [43] with decaying learning rate [math]\beta_n = \frac{1}{(1+n)^{4/5}}[/math] and access to a simulator. This was followed by [44] which showed that the same order of [math]Q[/math]-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)[/math] could be achieved with a constant learning rate [math]\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}[/math] [math](n\leq N)[/math]. Without access to a simulator, delayed [math]Q[/math]-learning [45] was shown to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] assuming a known reward. An improvement to this result to [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy [46] and a single trajectory. In addition, sample complexities of [math]Q[/math]-learning variants with (linear) function approximations have been established in several recent papers. [16] assumed a linear MDP framework (see the end of Section) %has a feature-based transition function [math]P(s'|s,a) = \sum_{k=1}^K \phi_k(s,a)\psi_k(s')[/math] for some unknown functions [math]\psi_1,\cdots,\psi_K:\mathcal{S}\rightarrow\mathbb{R}[/math] and a set of features expressed by functions [math]\phi_1,\cdots,\phi_K:\mathcal{S}\times\mathcal{A} \rightarrow\mathbb{R}[/math] that are given to the agent. and the authors provided a V-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math] with [math]d[/math] denoting the dimension of the features. This result matches the information-theoretic lower bound up to [math]\log(\cdot)[/math] factors. Later work, [47] considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order [math]\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math]. Although the dependence on [math]\gamma[/math] is not as good as in [16], this framework can be applied to a more general class of models which are “near” linear. Although [math]Q[/math]-learning is one of the most successful algorithms for finding the best action-value function [math]Q[/math], its implementation often suffers from large high biased estimates of the [math]Q[/math]-function values incurred by random sampling. The double [math]Q[/math]-learning algorithm proposed in [48] tackled this high estimation issue by randomly switching the update between two [math]Q[/math]-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double [math]Q[/math]-learning algorithm was established in [49] with a Q-sample complexity on the order of [math]\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)[/math] and learning rate [math]\beta_n = \frac{1}{n^{w}}[/math] for some constant [math]\omega \in (0,1)[/math].
Policy-based Methods
So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the [math]\varepsilon[/math]-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods [50][51][52]. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE [53], Actor-Critic methods [54], Trust Region Policy Optimization (TRPO) [55], Proximal Policy Optimization (PPO) [56] among others. We first parametrize the policy [math]\pi[/math] by a vector [math]\theta[/math], thus we define [math]\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})[/math], the probability distribution parameterized by [math]\theta[/math] over the action space given the state [math]s[/math] at time [math]t[/math]. Here we will use [math]\pi(s,a;\theta)[/math] and [math]\pi_{\theta}[/math] interchangeably to represent the policy parameterized by [math]\theta[/math]. We write [math]\mu^{\pi_{\theta}}[/math] for the stationary distribution of the state under policy [math]\pi(s,a;\theta)[/math]. Then the policy objective function is defined to be
for RL with infinite time horizon, or
for RL with finite time horizon. In order to maximize this function, the policy parameter [math]\theta[/math] is updated using the gradient ascent rule:
where [math]\beta[/math] is the learning rate, and [math]\widehat{\nabla_{\theta}J(\theta)}[/math] is an estimate of the gradient of [math]J(\theta)[/math] with respect to [math]\theta[/math]. % Compared with value-based methods, policy-based approaches allow us to utilize the prior knowledge about the possible forms of the policy, which often lead to faster convergence. Also, when dealing with continuous
Policy Gradient Theorem
Our task now is to estimate the gradient [math]\nabla_{\theta}J(\theta)[/math]. The objective function [math]J(\theta)[/math] depends on both the policy and the corresponding stationary distribution [math]\mu^{\pi_{\theta}}[/math], and the parameter [math]\theta[/math] affects both of them. Calculating the gradient of the action with respect to [math]\theta[/math] is straightforward given the parametrization [math]\pi(s,a;\theta)[/math], whereas it might be challenging to analyze the effect of [math]\theta[/math] on the state when the system is unknown. Fortunately the well-known policy gradient theorem provides an expression for [math]\nabla_{\theta}J(\theta)[/math] which does not involve the derivatives of the state distribution with respect to [math]\theta[/math]. We write the [math]Q[/math]-function under policy [math]\pi(s,a;\theta)[/math] as [math]Q^{\pi_{\theta}}(s,a)[/math], then we have the following theorem (for more details see [18]).
Assume that [math]\pi(s,a;\theta)[/math] is differentiable with respect to [math]\theta[/math] and that there exists [math]\mu^{\pi_{\theta}}[/math], the stationary distribution of the dynamics under policy [math]\pi_{\theta}[/math], which is independent of the initial state [math]s_0[/math]. Then the policy gradient is
For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity [58].
REINFORCE: Monte Carlo Policy Gradient
Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard Monte Carlo method that repeatedly simulates a trajectory of length [math]M[/math] within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE [53] or Monte Carlo policy gradient (see Algorithm). %We consider the episodic setting, where each episode can be considered as a subsequence of some repeated agent-environment interaction, which ends in a terminal state at final time [math]T[/math]. We use a simple empirical return function [math]G_t[/math] to approximate the [math]Q[/math]-function in \eqref{eqn:policy_gra_thm}. The return [math]G_t[/math] is defined as the sum of the discounted rewards and given by [math]G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}[/math]. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time [math]t[/math], it uses all future rewards up until the end of the trajectory (see line 7 in Algorithm). Within the [math]n[/math]-iteration the policy parameter [math]\theta^{(n+1)}[/math] is updated [math]M[/math] times with [math]G_t[/math] for [math]t=0,1,\cdots,M-1[/math]. %Therefore, REINFORCE can only be applied to episodic MDPs as it requires all episodes to terminate in some state. As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a baseline, which is a function of the state [math]s[/math], from the return [math]G_t[/math]. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, [math]\widehat{V}(s)[/math], which can be learnt using one of the methods introduced in Section. Replacing the [math]G_t[/math] in \eqref{eqn:alg_REINFORCE_update} by [math]G_t-\widehat{V}(s)[/math] gives a REINFORCE algorithm with baseline.
%
Actor-Critic Methods
% As discussed in Section, incorporating a baseline function reduces the variance of the REINFORCE algorithm without introducing any bias to the estimates. However, unbiased policy estimation may still lead to slow learning as the estimates are of high variance, whereas biased methods such as TD learning are typically of lower variance. This is known as the bias-variance tradeoff (see, e.g., [59] and [60]) in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. In this section we consider Actor-Critic methods, where both value function and policy parameter are learned, and bias is introduced through, for example, bootstrapping to speed the learning. These methods provide estimates of low variance and are online learning methods using incomplete experience. {\color{red}[The discussion on bias and variance in this paragraph is confusing. The logic is not very clear to me.]} The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., [59] and [60]) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. In Actor-Critic methods the value function is parametrized by [math]w[/math] and the policy is parametrized by [math]\theta[/math]. These methods consist of two models: a critic which updates the value function or [math]Q[/math]-function parameter [math]w[/math], and an actor which updates the policy parameter [math]\theta[/math] based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in Section for the critic; for example SARSA or [math]Q[/math]-learning. We give the pseudocode for a simple Actor-Critic method in Algorithm. % Here for the critic, we compute the TD error and use it to update the parameter [math]w[/math] (see lines 9-10 in Algorithm). For the actor, we use the (vanilla) policy-based method to update the policy parameter [math]\theta[/math] (see line 11 in Algorithm), in a similar way to the REINFORCE algorithm in Section. Here for the critic, we approximate the [math]Q[/math]-function by a linear combination of features, that is, [math]Q(s,a;w)=\phi(s,a)^\top w[/math] for some known feature [math]\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d[/math], and use TD(0) to update the parameter [math]w[/math] to minimize the least-square error of the TD error with the gradient of the least-square error taking the form [math]\delta \phi(s,a)[/math] (see lines 9-10 in Algorithm). For the actor, we use the (vanilla) policy-based method to update the policy parameter [math]\theta[/math] (see line 11 in Algorithm). There are three main ways to execute the algorithm. In the nested-loop setting (see, e.g. [61][62]), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time-scale setting (see, e.g. [63]), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the single-scale setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. [64][65]).
%\STATE [math]I\leftarrow 1[/math]
Discussion
Variants of Policy-based Methods. There have been many variations of policy-based methods with improvements in various directions. For example, the natural policy-based algorithms [66][67] modify the search direction of the vanilla version by involving a Fisher information matrix in the gradient ascent updating rule, which speeds the convergence significantly. To enhance the stability of learning, we may request the updated policy estimate to be not too far away from the previous estimate. Along this line, TRPO [55] uses the Kullback-Leibler (KL)-divergence to measure the change between the consecutive updates as a constraint, while PPO [56] incorporate an adaptive KL-penalty or a clipped objective in the objective function to eliminate the constraint. When using the KL-penalty, PPO can be viewed as a Lagrangian relaxation of the TRPO algorithm. To reduce the sample complexity, Deterministic Policy Gradient (DPG) [68] uses a deterministic function (rather than the stochastic policy [math]\pi[/math]) to represent the policy to avoid sampling over actions (although enough exploration needs to be guaranteed in other ways); Actor-Critic with Experience Replay (ACER) [69] reuses some experience (tuples of data collected in previous iterations/episodes, see the discussion of Deep [math]Q[/math]-Networks in Section) which improves the sample efficiency and decreases the data correlation. In addition we mention Asynchronous Advantage Actor-Critic (A3C) and Advantage Actor-Critic (A2C), two popular Actor-Critic methods with a special focus on parallel training [70]. The latter is a synchronous, deterministic version of the former. For continuous-time framework, see developments in [71][72].
Convergence Guarantees. Now we discuss the convergence guarantees of policy-based algorithms. [57] provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. % [73] provides the asymptotic convergence of the Actor-Critic algorithm in which the critic uses TD learning with a linearly function approximation, and the actor is updated in an approximate gradient direction. They observe that the actor parametrization and the critic parametrization should not be chosen independently. For more examples of asymptotic convergence, see for example [54] and [74]. Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [61] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates [math]k[/math] required to attain [math]\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 \lt \varepsilon[/math]. We denote this smallest number as [math]K[/math]. When the actor uses a policy gradient method, the critic achieves [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)[/math] by employing TD(0), [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})[/math] by employing the Gradient Temporal Difference, and [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})[/math] by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. [75] investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. % [76] proposed a Stochastic Variance-Reduced Policy Gradient (SVRPG) algorithm which uses either REINFORCE or the Gradient of a Partially Observable Markov Decision Process (GPOMDP) [77] to estimate the policy gradient. % The key idea is to use a so-called semi-stochastic gradient method that involves a reference (stochastic gradient) iterate which is stored in earlier iterations. % They prove that the SVRPG algorithm converges to an [math]\varepsilon[/math]-optimal stationary point, i.e. [math]\mathbb{E}[\|\nabla J(\theta)\|_2^2]\leq \varepsilon[/math] after [math]\widetilde{O}^{\prime}(1/\varepsilon^2)[/math] iterations. This convergence rate is then improved in [78] to [math]\widetilde{O}^{\prime}(1/\varepsilon^{5/3})[/math] and in [79] to [math]\widetilde{O}^{\prime}(1/\varepsilon^{3/2})[/math]. For more examples of sample complexity analysis for convergence to a stationary point, see for example [76][78][79][80][81]. The global optimality of stationary points was studied in [82] where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. Recently there have been results on the sample complexity analysis of the convergence to the global optimum. % For example, [83] proves a sublinear high probability regret bound for REINFORCE with a constant learning rate for MDPs with entropy regularization. {\color{red}[I doubt if [83] provides the first theoretical guarentee for REINFORCE. Maybe it's better to cite in chronological order. Plus we mention in the beginning there is no regret notion for infinite horizon discounted setting.]} [84] provides the convergence analysis of three methods under a discounted MDP and compares their convergence rates under fixed learning rates: (1) projected gradient ascent with mean-squared sample complexity [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^6\varepsilon^2})[/math], (2) asymptotic convergence of gradient ascent with softmax policy parameterization, and (3) natural policy gradient ascent with softmax parametrization with mean-squared sample complexity [math]\widetilde{\mathcal{O}}(\frac{2}{(1-\gamma)^2\varepsilon})[/math]. {\color{red}[please double check, [math]\leftarrow[/math]this paper seems under known parameter setting]} % [85] strengthens the asymptotic convergence of the softmax policy gradient method in (2) of [84] to a mean-squared sample complexity [math]\widetilde{\mathcal{O}}(1/\varepsilon)[/math] for {\color{ForestGreen}tabular} {\color{blue} what does this mean?} MDPs {\color{ForestGreen}introduced in Section} and a linear convergence rate for entropy-regularized MDPs; [85] also analyzes why entropy regularization speeds the convergence of policy gradient methods from an optimization perspective. [86] proved that the entropy regularized natural policy gradient method achieves a [math]Q[/math]-sample complexity (and [math]\varepsilon[/math]-optimal policies) with linear convergence rate. [87] showed that with high probability at least [math]1-\delta[/math], the TRPO algorithm has sublinear rate [math] \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)[/math] in the unregularized (tabular) MDPs, and has an improved rate [math]\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)[/math] in MDPs with entropy regularization. % See also [62] for the sample complexity for nested-loop (natural) Actor-Critic methods. [63] gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order [math]\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})[/math]. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both [64] and [65]. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see [27] for the regret analysis of the REINFORCE algorithm for discounted MDPs and [84][85] for policy gradient methods in the setting of known model parameters. % For more details of policy-based methods and their variants, see for example [84] and [88].
General Discussion
So far we have focused on model-free RL with discounting over an infinite time horizon. In this section, we discuss two other cases: RL over a finite time horizon with both model-based and model-free methods, and model-based RL with an infinite time horizon.
RL with Finite Time Horizon
As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. % Recall that sample complexity in episodic RL with finite time horizon [math]T[/math] typically refers to the number of episodes [math]K=M/T[/math] ([math]M[/math] is the total number of steps) {\color{blue} we need to make this consistent with changes to sample complexity section} required to guarantee an [math]\varepsilon[/math]-optimal policy with probability at least [math]1-\delta[/math] in \eqref{eq:sample_complexity_epi} and the regret is defined in \eqref{eq:regret_epi}. [89] proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an [math]\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})[/math] expected regret bound. [25] proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has [math]V[/math]-sample complexity[c] of order [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})[/math] assuming known rewards. [90] considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of [math]\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]) and [math]\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T^3|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]), respectively. [26] proposed an Upper Bounding the Expected Next State Value (UBEV) algorithm which achieves a sample complexity\footnotemark[2] of [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^5}{\varepsilon^2}{\rm polylog}(\frac{1}{\delta})\big)[/math]. They also proved that the algorithm has a regret bound of [math]\widetilde{\mathcal{O}}(T^2\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] with [math]M\ge |\mathcal{S}|^5|\mathcal{A}|^3[/math]. [91] proved that [math]Q[/math]-learning with UCB exploration achieves regret of [math]\widetilde{\mathcal{O}}(\sqrt{T^3|\mathcal{S}|\,|\mathcal{A}|M})[/math] without requiring access to a simulator. The above regret bounds depend on the size of the state and action space and thus may suffer from the curse of dimensionality as [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math] increase. To tackle this issue, [92] learned a low-dimensional representation of the probability transition model and proved that their MatrixRL algorithm achieves a regret bound of [math]\widetilde{\mathcal{O}}(T^2\,d\sqrt{M})[/math] which depends on the number of features [math]d[/math] rather than [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math]. [15] provided a [math]\widetilde{\mathcal{O}}(\sqrt{d^3T^3M})[/math] regret bound with high probability for a modified version of the Least-Squares Value Iteration (LSVI) algorithm with UCB exploration.
Model-based RL with Infinite Time Horizon
%Unlike model-free RL discussed above, model-based methods estimate the transition probabilities and reward functions in order to derive policies by utilizing the approximate model.For example, To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, [29] proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)[/math] by [93]. The Model-based Interval Estimation (MBIE) in [94] was proved to achieve a sample complexity of exploration of the same order as R-MAX. [31] further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)[/math]. [95] proved that the model-based [math]Q[/math]-Value Iteration (QVI) achieves the [math]Q[/math]-sample complexity of [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})[/math] for some small [math]\varepsilon[/math] % [math]\varepsilon\in[0,\frac{1}{\sqrt{1-\gamma}|\mathcal{S}|}][/math] with high probability at least [math]1-\delta[/math]. [96] studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as [95], but with a larger range of [math]\varepsilon[/math]. % [math]\varepsilon\in[0,\frac{1}{\sqrt{1-\gamma}}][/math]. See also [45][20][30] for model-based RL along this line. % {\color{red}[Please use the notations we introduced to describe the sample complexity: V-complexity, Q-complexity, ...]} Linear-quadratic (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) linear-quadratic regulator (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example [32] and [33] for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also [34] and [97] for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time.
Exploration Exploitation Trade-offs
As mentioned earlier, exploitation versus exploration is a critical topic in RL. RL agents want to find the optimal solution as fast as possible, meanwhile committing to solutions too fast without enough exploration could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while efficient exploration remains a challenging and less understood topic. For multi-armed bandit problems or MDP with finite state and action spaces, classic exploration algorithms such as [math]\varepsilon[/math]-greedy, UCB, Boltzmann exploration and Thompson sampling show promising performance in different settings. For the [math]\varepsilon[/math]-greedy algorithm [98][99], the agent randomly explores occasionally with probability [math]\varepsilon[/math] and takes the optimal action most of the time with probability [math]1-\varepsilon[/math]. For example, see Equation \eqref{epsilon_policy} for the [math]\varepsilon[/math]-greedy method combined with a value-based algorithm. The UCB type of algorithms are mainly designed for value-based algorithms [46][90][91][15]. The agent selects the greediest action to maximize the upper confidence bound [math]Q(s,a) + U(s,a)[/math], where [math]Q(s,a)[/math] is the Q value function and [math]U(s,a)[/math] is a function inversely proportional to how many times the state-action pair [math](s,a)[/math] has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration [36][37][38], the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method [100][101][102], the agent keeps track of a belief of the optimal action via a distribution over the set of admissible actions (for each given state). % \ben{do you mean that the agent has a distribution over the set of optimal actions} and samples from this distribution.
For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space. %Entropy loss term: Add an entropy term into the loss function, encouraging the policy to take diverse actions.Noise-based Exploration: Add noise into the observation, action or even parameter space (Fortunato, et al. 2017, Plappert, et al. 2017).
Regularization in Reinforcement Learning
Many recent developments in RL benefit from regularization, which has been shown to improve not only the exploration but also the robustness. Examples include both policy-based methods and value-based methods in various settings. Here we provide a general overview of different regularization techniques and their advantages along with several examples. For example, TRPO and PPO use the KL-divergence between two consecutive policies as a penalty in policy improvement [55][56]. Soft-[math]Q[/math]-learning uses Shannon entropy as a penalty in value iteration [37]. The authors confirmed in simulated experiments that entropy regularization helps to improve exploration and allows transferring skills between tasks. [103] proposed a unified framework to analyze the above algorithms via a regularized Bellman operator. [86] showed that entropy regularization can also help to speed up the convergence rate from a theoretical perspective. It is worth noting that [38] explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently [104] and [105] extended this idea to a general control framework beyond the linear-quadratic case. [106] extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors [107][108]. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for [math]Q[/math]-functions) and the [math]l_1[/math] norm (for parameters). %See also [109] and [87] for convergence analysis of regularized policy iteration.
Reinforcement Learning in Non-stationary Environment
Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce the mathematical formulations of non-stationary RL (in both episodic and infinite horizon settings) below.
Non-stationary Episodic RL. For the episodic (or finite-time horizon) setting, an agent interacts with a non-stationary MDP for [math]K[/math] episodes, with each episode containing [math]T[/math] timestamps. Let [math](t, k)[/math] denote a (time,episode) index for the [math]t[/math]-th timestamp in the [math]k[/math]-th episode. The non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, T, \pmb{P}, \pmb{r})[/math], where [math]\mathcal{S}[/math] is the state space, [math]\mathcal{A}[/math] is the action space, [math]T[/math] is the number of timestamps in one episode, [math]\pmb{P} = \{P_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of transition kernels with [math]P_{k,t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})[/math] for all [math]1\leq k\leq K[/math] and [math]0\leq t\leq T[/math], and [math]\pmb{r} = \{r_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of reward functions with [math]r_{k,t}(s,a)[/math] a random variable in [math]\mathbb{R}[/math] (depending on both the time-step and the episode number) for each pair [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] and for [math]0\leq t\leq T-1[/math]. Similarly, the terminal reward [math]r_{k,T}(s)[/math] is a real-valued random variable for each [math]s\in\mathcal{S}[/math]. Denote by [math]\Pi=\{\pi_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T-1}[/math] the Markovian policy in which [math]\pi_{k,t}[/math] can be either deterministic or randomized. The value function for this policy at the [math]k[/math]-th episode can be written as
where [math]s_{u+1} \sim P_{k,u}(s_u,a_u)[/math] and [math]a_u\sim \pi_{k,u}(s_u)[/math]. \iffalse
Non-stationary RL with Infinite Time Horizon. For the infinite horizon setting, an agent interacts with a non-stationary MDP starting from an initial timestamp [math]t[/math]. The non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})[/math], where [math]\mathcal{S}[/math] is the state space, [math]\mathcal{A}[/math] is the action space, [math]\pmb{P} = \{P_{t} \}_{t \ge 0}[/math] is the set of transition kernels with [math]P_{t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})[/math], and [math]\pmb{r} = \{r_{t} \}_{t \ge 0}[/math] is the set of reward functions with [math]r_{t}(s,a)[/math] a random variable in [math]\mathbb{R}[/math] (depending on both the time-step and the episode number) for each pair [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math]. Denote by [math]\Pi=\{\pi_{t} \}_{t \ge 0}[/math] the Markovian policy in which [math]\pi_{t}[/math] can be either deterministic or randomized. Now we consider the associated value function with a discount factor [math]\gamma[/math]:
where [math]s_{u+1} \sim P_{u}(\cdot|s_u,a_u)[/math] and [math]a_u\sim \pi_{u}(s_u)[/math].\\ \ben{This repeats too much how about:} \fi
Non-stationary RL with Infinite Time Horizon. For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon [math]T[/math] become infinite. Then the non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})[/math], with the obvious changes from the episodic case. Again we write [math]\Pi=\{\pi_{t} \}_{t \ge 0}[/math] for the Markovian policy in which [math]\pi_{t}[/math] can be either deterministic or randomized and consider the associated value function with a discount factor [math]\gamma[/math]:
where [math]s_{u+1} \sim P_{u}(\cdot|s_u,a_u)[/math] and [math]a_u\sim \pi_{u}(s_u)[/math].\\
Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces [110][111], episodic MDPs with finite state and action spaces [112], %proposes a UCB based [math]Q[/math]-learning algorithm %which achieves a near-optimal regret on the order of [math]\tilde{\mathcal{O}}(\Delta^{1/3}T^{2/3}+\sqrt{T})[/math], with prior knowledge of [math]\Delta[/math]. and episodic linear MDPs [113]. However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. To relax this assumption, [114] proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result [111]. However, it introduces extra overheads and leads to suboptimal regret. More recently, [115] proposes a general approach that is applicable to various reinforcement learning settings (including both episodic MDPs and infinite-horizon MDPs) and achieves optimal dynamic regret without any prior knowledge of the degree of non-stationarity.
General references
Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].
Notes
- Sample complexity with respect to the number of episodes has also been adopted in the literature [25][26]. The translation between these two criteria is straightforward.
- The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate [math]\beta[/math]. In the convergence analysis, we use [math]\beta_n[/math] to denote a learning rate which depends on the number of iterations.
- In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by [math]T[/math] to match the definition of sample complexity in this paper.
References
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedsilver2017mastering
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedvinyals2019grandmaster
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedradford2019language
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlillicrap2015continuous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgoodfellow2016deep
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmnih2013playing
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedvan2016deep
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgu2016continuous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfan2020theoretical
- 10.0 10.1 10.2 10.3 Cite error: Invalid
<ref>
tag; no text was provided for refs namedputerman2014markov
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedabbasi2019politex
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwei2020model
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbradtke1996linear
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmelo2007q
- 15.0 15.1 15.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedjin2020provably
- 16.0 16.1 16.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedyang2019sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddai2018sbeed
- 18.0 18.1 18.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedsuttonbarto
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedpowell2021reinforcement
- 20.0 20.1 20.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedkearns2002near
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkoenig1993complexity
- 22.0 22.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedazar2012sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedrobbins1952some
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedberry1985bandit
- 25.0 25.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedDannBrunskill2015
- 26.0 26.1 26.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedDannUBEV
- 27.0 27.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedzhang2020sample
- 28.0 28.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedliu2020improved
- 29.0 29.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedbrafman2002r
- 30.0 30.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedlattimore2012pac
- 31.0 31.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedMoRmax2010
- 32.0 32.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedfazel2018global
- 33.0 33.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhambly2020policy
- 34.0 34.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedbasei2021logarithmic
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwatkins1992q
- 36.0 36.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedasadi2017alternative
- 37.0 37.1 37.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhaarnoja2017reinforcement
- 38.0 38.1 38.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedWZZ2018
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddalal2018finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlakshminarayanan2018linear
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbhandari2018finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedzou2019finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedeven2003learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbeck2012error
- 45.0 45.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedstrehl2009reinforcement
- 46.0 46.1 Cite error: Invalid
<ref>
tag; no text was provided for refs nameddong2019q
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlattimore2020learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedhasselt2010double
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxiong2020finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedsewak2019policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedyu2020policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddaberius2019deep
- 53.0 53.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedwilliams1992
- 54.0 54.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedkonda2000
- 55.0 55.1 55.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedschulman2015
- 56.0 56.1 56.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedschulman2017proximal
- 57.0 57.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedsutton2000policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkonda2002
- 59.0 59.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedfranccois2019
- 60.0 60.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedvon2011statistical
- 61.0 61.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedkumar2019sample
- 62.0 62.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020improving
- 63.0 63.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020non
- 64.0 64.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedfu2020single
- 65.0 65.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2021doubly
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkakade2001natural
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedpeters2008natural
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedsilver2014deterministic
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwang2016
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmnih2016asynchronous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedjia2021policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedjia2022policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkonda2003onactor
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbhatnagar2010actor
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedzhang2020global
- 76.0 76.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedpapini2018stochastic
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbaxter2001infinite
- 78.0 78.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020improved
- 79.0 79.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2019sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedshen2019hessian
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxiong2020non
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbhandari2019
- 83.0 83.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedZhang2021REINFORCE
- 84.0 84.1 84.2 84.3 Cite error: Invalid
<ref>
tag; no text was provided for refs namedagarwal2021theory
- 85.0 85.1 85.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedmei2020
- 86.0 86.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedcen2020fast
- 87.0 87.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedshani2020adaptive
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkaelbling1996reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedPSRL2013
- 90.0 90.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedazar2017minimax
- 91.0 91.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedjin2018q
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedyang2020reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedli2009rmax
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedStrehl2005
- 95.0 95.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedazar2013minimax
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedagarwal2020model
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedguo2021reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddann2022guarantees
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddabney2020temporally
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedthompson1933likelihood
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedouyang2017learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgopalan2015thompson
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgeist2019theory
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgao2020state
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedtang2021exploratory
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgrau2018soft
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmassoud2009regularized
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfarahmand2008regularized
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedliu2019neural
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgajane2018sliding
- 111.0 111.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedcheung2019learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmao2020model
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedtouati2020efficient
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedcheung2020reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwei2021non