guide:C8c80a2ae8: Difference between revisions
No edit summary |
No edit summary |
||
Line 45: | Line 45: | ||
an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. | an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. | ||
RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. | RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. | ||
The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. | The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as <math>Q</math>-learning and Actor-Critic. | ||
The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as <math>Q</math>-learning and Actor-Critic. | |||
Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go <ref name="silver2017mastering"/> and multi-player StarCraft II <ref name="vinyals2019grandmaster"/>, as well as the development of ChatGPT <ref name="radford2019language"/>. The generality of the reinforcement learning framework allows its application in both discrete and | Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go <ref name="silver2017mastering"/> and multi-player StarCraft II <ref name="vinyals2019grandmaster"/>, as well as the development of ChatGPT <ref name="radford2019language"/>. The generality of the reinforcement learning framework allows its application in both discrete and | ||
continuous spaces to solve tasks in both real and simulated environments <ref name="lillicrap2015continuous"/>. | continuous spaces to solve tasks in both real and simulated environments <ref name="lillicrap2015continuous"/>. | ||
Line 53: | Line 52: | ||
Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in [[#sec:set_up |Section]] with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In [[#sec:MDP2learning |Section]] we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in [[#sec:value_based_methods |Section]] and policy-based methods in Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]], with an infinite-time horizon. Finally, Section~[[#sec:discussion |sec:discussion]] contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL. | Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in [[#sec:set_up |Section]] with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In [[#sec:MDP2learning |Section]] we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in [[#sec:value_based_methods |Section]] and policy-based methods in Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]], with an infinite-time horizon. Finally, Section~[[#sec:discussion |sec:discussion]] contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL. | ||
===<span id="sec:set_up"></span>Setup: Markov Decision Processes=== | ===<span id="sec:set_up"></span>Setup: Markov Decision Processes=== | ||
We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker. | We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker. | ||
Line 73: | Line 72: | ||
\end{eqnarray} | \end{eqnarray} | ||
</math> | </math> | ||
Here and throughout the paper, <math>\E^{\Pi}</math> denotes the expectation under the policy <math>{\Pi}</math>, where the probability measure <math>\mathbb{P}</math> describes both dynamics and the rewards in the MDP. We will write <math>\mathcal{P}(\mathcal{X})</math> for the probability measures over a space <math>\mathcal{X}</math>. The state space <math>(\Sc, d_{\Sc})</math> and the action space <math>(\Ac, d_{\Ac})</math> are both complete separable metric spaces, which includes the case of <math>\Sc</math> and <math>\Ac</math> being finite, as often seen in the RL literature; <math>\gamma</math> <math>\in</math> <math>(0, 1)</math> is a discount factor; <math>s_t \in \Sc</math> and <math>a_t \in \Ac</math> are the state and the action at time <math>t</math>; <math>P: \Sc \times \Ac \to \Pc(\Sc)</math> is the transition function of the underlying Markov process; the notation <math> s_{t + 1} \sim P(s_t, a_t)</math> denotes that <math> s_{t + 1}</math> is sampled from the distribution <math> P(s_t, a_t)\in \mathcal{P}(\mathcal{S})</math>; the reward <math>r(s, a)</math> is a random variable in <math>\R</math> for each pair <math>(s, a)\in \Sc \times \Ac</math>; and the policy <math>\Pi =\{\pi_t\}_{t=0}^{\infty}</math> is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, <math>\pi_t: \Sc \to \Ac</math> maps the current state <math>s_t</math> to a deterministic action. On the other hand, a randomized policy <math>\pi_t: \Sc\to \Pc(\Ac)</math> maps the current state <math>s_t</math> to a distribution over the action space. | Here and throughout the paper, <math>\E^{\Pi}</math> denotes the expectation under the policy <math>{\Pi}</math>, where the probability measure <math>\mathbb{P}</math> describes both dynamics and the rewards in the MDP. We will write <math>\mathcal{P}(\mathcal{X})</math> for the probability measures over a space <math>\mathcal{X}</math>. The state space <math>(\Sc, d_{\Sc})</math> and the action space <math>(\Ac, d_{\Ac})</math> are both complete separable metric spaces, which includes the case of <math>\Sc</math> and <math>\Ac</math> being finite, as often seen in the RL literature; <math>\gamma</math> <math>\in</math> <math>(0, 1)</math> is a discount factor; <math>s_t \in \Sc</math> and <math>a_t \in \Ac</math> are the state and the action at time <math>t</math>; <math>P: \Sc \times \Ac \to \Pc(\Sc)</math> is the transition function of the underlying Markov process; the notation <math> s_{t + 1} \sim P(s_t, a_t)</math> denotes that <math> s_{t + 1}</math> is sampled from the distribution <math> P(s_t, a_t)\in \mathcal{P}(\mathcal{S})</math>; the reward <math>r(s, a)</math> is a random variable in <math>\R</math> for each pair <math>(s, a)\in \Sc \times \Ac</math>; and the policy <math>\Pi =\{\pi_t\}_{t=0}^{\infty}</math> is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, <math>\pi_t: \Sc \to \Ac</math> maps the current state <math>s_t</math> to a deterministic action. On the other hand, a randomized policy <math>\pi_t: \Sc\to \Pc(\Ac)</math> maps the current state <math>s_t</math> to a distribution over the action space. | ||
For a randomized policy <math>\pi_t</math>, we will use <math>\pi_t(s)\in \mathcal{P}(\mathcal{A})</math> and <math>\pi_t(s,a)</math> to represent, respectively, the distribution over the action space given state <math>s</math> and the probability of taking action <math>a</math> at state <math>s</math>. | For a randomized policy <math>\pi_t</math>, we will use <math>\pi_t(s)\in \mathcal{P}(\mathcal{A})</math> and <math>\pi_t(s,a)</math> to represent, respectively, the distribution over the action space given state <math>s</math> and the probability of taking action <math>a</math> at state <math>s</math>. | ||
In this case, with infinite-time horizon, we assume the reward <math>r</math> and the transition dynamics <math>P</math> are time homogeneous, which is a standard assumption in the MDP literature <ref name="puterman2014markov"/>. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward. | In this case, with infinite-time horizon, we assume the reward <math>r</math> and the transition dynamics <math>P</math> are time homogeneous, which is a standard assumption in the MDP literature <ref name="puterman2014markov"/>. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward. | ||
The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev}; | The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev}; | ||
Line 145: | Line 142: | ||
\end{eqnarray} | \end{eqnarray} | ||
</math> | </math> | ||
As in the infinite horizon case, we denote by <math>s_u \in \Sc</math> and <math>a_u \in \Ac</math> the state and the action of the agent at time <math>u</math>. However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now <math>P_u: \Sc \times \Ac \to \Pc(\Sc)</math> denotes the transition function and <math>r_u(s, a)</math> a real-valued random variable for each pair <math>(s, a)\in \Sc \times \Ac</math> for <math>t \leq u \leq T-1</math>. Similarly <math>r_T(s)</math>, the terminal reward, is a real-valued random variable for all <math>s\in \Sc</math>. Finally, the Markovian policy <math>\Pi =\{\pi_u\}_{t=0}^T</math> can be either deterministic or randomized. | |||
As in the infinite horizon case, we denote by <math>s_u \in \Sc</math> and <math>a_u \in \Ac</math> the state and the action of the agent at time <math>u</math>. | |||
However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now <math>P_u: \Sc \times \Ac \to \Pc(\Sc)</math> denotes the transition function and <math>r_u(s, a)</math> a real-valued random variable for each pair <math>(s, a)\in \Sc \times \Ac</math> for <math>t \leq u \leq T-1</math>. Similarly <math>r_T(s)</math>, the terminal reward, is a real-valued random variable for all <math>s\in \Sc</math>. Finally, the Markovian policy <math>\Pi =\{\pi_u\}_{t=0}^T</math> can be either deterministic or randomized. | |||
The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as | The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as | ||
Line 169: | Line 164: | ||
\end{eqnarray} | \end{eqnarray} | ||
</math> | </math> | ||
There is also a Bellman equation for the <math>Q</math>-function given by | There is also a Bellman equation for the <math>Q</math>-function given by | ||
Line 189: | Line 183: | ||
It is easy to observe that MD <math>\subset</math> MR <math>\subset</math> HR. Recall that in \eqref{equ: singlev} and \eqref{equ: singlev_fh}, the optimal policy is defined as the one that maximizes the gain of the MDP. Due to the structure of MDPs it is not difficult to show that it is sufficient to consider Markovian policies. Henceforth, we shall focus only on Markovian policies. | It is easy to observe that MD <math>\subset</math> MR <math>\subset</math> HR. Recall that in \eqref{equ: singlev} and \eqref{equ: singlev_fh}, the optimal policy is defined as the one that maximizes the gain of the MDP. Due to the structure of MDPs it is not difficult to show that it is sufficient to consider Markovian policies. Henceforth, we shall focus only on Markovian policies. | ||
\fi | \fi | ||
For an infinite horizon MDP with finite state and action space, and finite reward <math>r</math>, a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists. | For an infinite horizon MDP with finite state and action space, and finite reward <math>r</math>, a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists. | ||
{{proofcard|Theorem (Theorem 6.2.7 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one. | {{proofcard|Theorem (Theorem 6.2.7 in <ref name="puterman2014markov"/>)|thm:MDP_discounted|Assume <math>|\mathcal{A}| < \infty</math>, <math>|\mathcal{S}| < \infty</math> and <math>|r| < \infty</math> with probability one. | ||
Line 237: | Line 231: | ||
===<span id="sec:MDP2learning"></span>From MDP to Learning=== | ===<span id="sec:MDP2learning"></span>From MDP to Learning=== | ||
When the transition dynamics <math>P</math> and the reward function <math>r</math> for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy <math>\pi</math> (if it exists) | When the transition dynamics <math>P</math> and the reward function <math>r</math> for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy <math>\pi</math> (if it exists) | ||
while simultaneously learning the unknown <math>P</math> and <math>r</math>. The learning of <math>P</math> and <math>r</math> can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as <ref name="suttonbarto"/><ref name="powell2021reinforcement"/>. | while simultaneously learning the unknown <math>P</math> and <math>r</math>. The learning of <math>P</math> and <math>r</math> can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as <ref name="suttonbarto"/><ref name="powell2021reinforcement"/>. | ||
Line 262: | Line 256: | ||
'''An Example: Multi-armed bandits.''' We illustrate these points by discussing | '''An Example: Multi-armed bandits.''' We illustrate these points by discussing | ||
multi-armed bandit problems, a special case of RL problems. | multi-armed bandit problems, a special case of RL problems. | ||
The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown. | The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown. | ||
We try these arms in some order, which may depend on the sequence of rewards that have been | We try these arms in some order, which may depend on the sequence of rewards that have been | ||
Line 269: | Line 261: | ||
to be tried, under which the sum of the expected rewards comes as close as possible to the ideal | to be tried, under which the sum of the expected rewards comes as close as possible to the ideal | ||
reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times. | reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times. | ||
Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward. | Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward. | ||
The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return. | The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return. | ||
In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time <math>t</math> can be interpreted as the personal information and purchasing behavior of a targeted client. | In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time <math>t</math> can be interpreted as the personal information and purchasing behavior of a targeted client. | ||
Line 276: | Line 268: | ||
The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected | The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected | ||
reward, and adversarial setups. See Berry and Fristedt <ref name="berry1985bandit"/> for a review of the classical results on the multi-armed bandit problem. | reward, and adversarial setups. See Berry and Fristedt <ref name="berry1985bandit"/> for a review of the classical results on the multi-armed bandit problem. | ||
====Performance Evaluation==== | ====Performance Evaluation==== | ||
It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. | It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. | ||
In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. | In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. | ||
For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time <math>0</math> and ends at the terminal time <math>T</math>, and the performance is evaluated in terms of the total number of samples in all episodes.{{efn|Sample complexity with respect to the number of episodes has also been adopted in the literature <ref name="DannBrunskill2015"/><ref name="DannUBEV"/>. The translation between these two criteria is straightforward.}} Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE <ref name="zhang2020sample"/> and the policy gradient method <ref name="liu2020improved"/> (see Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]]). | For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time <math>0</math> and ends at the terminal time <math>T</math>, and the performance is evaluated in terms of the total number of samples in all episodes.{{efn|Sample complexity with respect to the number of episodes has also been adopted in the literature <ref name="DannBrunskill2015"/><ref name="DannUBEV"/>. The translation between these two criteria is straightforward.}} Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE <ref name="zhang2020sample"/> and the policy gradient method <ref name="liu2020improved"/> (see Section~[[#subsec:policy_based_methods |subsec:policy_based_methods]]). | ||
Throughout the analysis, we use the notation <math>\widetilde{O}(\cdot)</math> to hide logarithmic factors when describing the order of the scaling in <math>\varepsilon</math>, <math>\delta</math>, <math>|\mathcal{A}|</math> and <math>|\mathcal{S}|</math> (when these are finite), and other possible model parameters such as the discount rate <math>\gamma</math> in the infinite horizon case or the dimension <math>d</math> of the features when functional approximation is used. We write <math>\widetilde{\mathcal{O}}^{\prime}</math> rather than <math>\widetilde{\mathcal{O}}</math> to indicate that, although there may be other parameters, we only include the dependence on <math>\varepsilon</math>. | Throughout the analysis, we use the notation <math>\widetilde{O}(\cdot)</math> to hide logarithmic factors when describing the order of the scaling in <math>\varepsilon</math>, <math>\delta</math>, <math>|\mathcal{A}|</math> and <math>|\mathcal{S}|</math> (when these are finite), and other possible model parameters such as the discount rate <math>\gamma</math> in the infinite horizon case or the dimension <math>d</math> of the features when functional approximation is used. We write <math>\widetilde{\mathcal{O}}^{\prime}</math> rather than <math>\widetilde{\mathcal{O}}</math> to indicate that, although there may be other parameters, we only include the dependence on <math>\varepsilon</math>. | ||
We denote by <math>\mbox{poly}(\cdot)</math> a polynomial function of its arguments. | We denote by <math>\mbox{poly}(\cdot)</math> a polynomial function of its arguments. | ||
Line 297: | Line 285: | ||
\end{eqnarray} | \end{eqnarray} | ||
</math> | </math> | ||
For discounted RL, an algorithm returns, after | For discounted RL, an algorithm returns, after | ||
the end of the <math>(n-1)</math>-th step with <math>M_{n-1}</math> samples used in this iteration, a policy <math>\pi_n</math> to be applied in the <math>n</math>-th step. There are several notions of sample complexity in this setting. | the end of the <math>(n-1)</math>-th step with <math>M_{n-1}</math> samples used in this iteration, a policy <math>\pi_n</math> to be applied in the <math>n</math>-th step. There are several notions of sample complexity in this setting. | ||
Line 328: | Line 316: | ||
Note that the condition <math>\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} < \varepsilon\right) \ge 1-\delta</math> associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. | Note that the condition <math>\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} < \varepsilon\right) \ge 1-\delta</math> associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. | ||
Another weaker criterion, the '' mean-square sample complexity'', measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution <math>\mu</math>. | Another weaker criterion, the '' mean-square sample complexity'', measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution <math>\mu</math>. | ||
In this case, the '' mean-square sample complexity'' is defined as the minimum number <math>\sum_{n=1}^N M_{n-1}(\varepsilon)</math> of samples such that for all <math>n \ge N</math>, | In this case, the '' mean-square sample complexity'' is defined as the minimum number <math>\sum_{n=1}^N M_{n-1}(\varepsilon)</math> of samples such that for all <math>n \ge N</math>, | ||
Line 337: | Line 324: | ||
</math> | </math> | ||
Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the <math>V</math> function over a state distribution <math>\mu</math>, it is implied by \eqref{eq:sample_complexity_infi_V_2}. | Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the <math>V</math> function over a state distribution <math>\mu</math>, it is implied by \eqref{eq:sample_complexity_infi_V_2}. | ||
{\color{blue} What does <math>\widetilde{\mathcal{O}}</math> mean?} | |||
'''Rate of Convergence.''' To emphasize the convergence with respect to the number of iterations required, we introduce the notion of ''rate of convergence'', which uses the relationship between the number of iterations/steps <math>N</math> and the error term <math>\varepsilon</math>, to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from <math>\varepsilon</math>) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to <math>\varepsilon</math>. | '''Rate of Convergence.''' To emphasize the convergence with respect to the number of iterations required, we introduce the notion of ''rate of convergence'', which uses the relationship between the number of iterations/steps <math>N</math> and the error term <math>\varepsilon</math>, to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from <math>\varepsilon</math>) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to <math>\varepsilon</math>. | ||
Line 358: | Line 341: | ||
'''Asymptotic Convergence.''' In addition to sample complexity and regret analysis discussed above, ''asymptotic'' convergence is another weaker convergence guarantee, which requires the error term to decay to zero as <math>N</math> goes to infinity (without specifying the order of <math>N</math>). This is often a first step in the theoretical analysis of RL algorithms. | '''Asymptotic Convergence.''' In addition to sample complexity and regret analysis discussed above, ''asymptotic'' convergence is another weaker convergence guarantee, which requires the error term to decay to zero as <math>N</math> goes to infinity (without specifying the order of <math>N</math>). This is often a first step in the theoretical analysis of RL algorithms. | ||
====Classification of RL Algorithms==== | ====Classification of RL Algorithms==== | ||
An RL algorithm includes one or more of the following components: | An RL algorithm includes one or more of the following components: | ||
<ul style{{=}}"list-style-type:lower-roman"><li>a representation of a value function that provides a prediction of how good each state or each state/action pair is,</li> | <ul style{{=}}"list-style-type:lower-roman"><li>a representation of a value function that provides a prediction of how good each state or each state/action pair is,</li> | ||
<li>a direct representation of the policy <math>\pi(s)</math> or <math>\pi(s,a)</math>, | <li>a direct representation of the policy <math>\pi(s)</math> or <math>\pi(s,a)</math>,</li> | ||
<li>a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).</li> | <li>a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).</li> | ||
</ul> | </ul> | ||
Line 387: | Line 369: | ||
<math>\beta_n(s,a)</math> is the learning rate at iteration <math>n+1</math> which balances the weighting between the current estimate and the new estimate. The quantity <math>\beta_n</math> can be a constant, can depend on the current state <math>s</math>, or even depend on the observed samples up to iteration <math>n+1</math>. [[#alg:TD |Algorithm]] provides some pseudo-code for an implementation of the TD method. Equation | <math>\beta_n(s,a)</math> is the learning rate at iteration <math>n+1</math> which balances the weighting between the current estimate and the new estimate. The quantity <math>\beta_n</math> can be a constant, can depend on the current state <math>s</math>, or even depend on the observed samples up to iteration <math>n+1</math>. [[#alg:TD |Algorithm]] provides some pseudo-code for an implementation of the TD method. Equation | ||
\eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD<math>(\lambda)</math> method (for some <math>\lambda \in [0,1]</math>) which is a combination of TD learning (with weight <math>\lambda</math>) and a Monte Carlo method (with weight <math>1-\lambda</math>). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See [[#sec:REINFORCE |Section]] for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. | \eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD<math>(\lambda)</math> method (for some <math>\lambda \in [0,1]</math>) which is a combination of TD learning (with weight <math>\lambda</math>) and a Monte Carlo method (with weight <math>1-\lambda</math>). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See [[#sec:REINFORCE |Section]] for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. | ||
{\color{blue} I still dont understand. A Monte Carlo method is a simulation method for calculating expectations of functions of a stochastic process. Here we have a method, TD, which uses a single sample to update the value function. We could use more samples to update with some average of the samples from a given <math>(s,a)</math> - is this what you mean? Or do we run sample the policy for a longer time period and average these to get an approximation using the definition of <math>V^{\pi}</math> as an expectation over an infinite discounted sum?} | |||
Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in <ref name="suttonbarto"/> for more details. | Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in <ref name="suttonbarto"/> for more details. | ||
The TD update in \eqref{eq:TD_update} can also be written as | The TD update in \eqref{eq:TD_update} can also be written as | ||
Line 397: | Line 378: | ||
\end{eqnarray} | \end{eqnarray} | ||
</math> | </math> | ||
The term <math>\delta_n</math>, commonly called the TD error, measures the difference between the estimated value at <math>s</math> and the better estimate <math>r+\gamma V^{\pi,(n)}(s')</math>, which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature. | The term <math>\delta_n</math>, commonly called the TD error, measures the difference between the estimated value at <math>s</math> and the better estimate <math>r+\gamma V^{\pi,(n)}(s')</math>, which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature. | ||
<proc id="alg:TD" label = "\textbf{TD(0) Method for estimating $V^{\pi}$}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> used to sample observations; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0\leq n \leq N-1)</math> | <proc id="alg:TD" label = "\textbf{TD(0) Method for estimating $V^{\pi}$}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>; the policy <math>\pi</math> used to sample observations; rule to set learning rate <math>\beta_n \in (0,1]</math> <math>(0\leq n \leq N-1)</math> | ||
</div> | </div> | ||
<div class="ms-2"> Initialize <math>V^{\pi,(0)}(s)</math> for all <math>s\in \mathcal{S}</math> | <div class="ms-2"> Initialize <math>V^{\pi,(0)}(s)</math> for all <math>s\in \mathcal{S}</math> | ||
</div> | </div> | ||
<div class="ms-2"> Initialize <math>s</math> | <div class="ms-2"> Initialize <math>s</math> | ||
Line 415: | Line 393: | ||
</div> | </div> | ||
<div class="ms-2"> <math>s\leftarrow s'</math> | <div class="ms-2"> <math>s\leftarrow s'</math> | ||
</div> | </div> | ||
'''end for'''</div></proc> | '''end for'''</div></proc> | ||
Line 431: | Line 408: | ||
</div> | </div> | ||
<div class="ms-2"> Initialize <math>Q^{(0)}(s,a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math> | <div class="ms-2"> Initialize <math>Q^{(0)}(s,a)</math> for all <math>s\in \mathcal{S}</math> and <math>a\in \mathcal{A}</math> | ||
</div> | </div> | ||
<div class="ms-2"> Initialize <math>s</math> | <div class="ms-2"> Initialize <math>s</math> | ||
Line 443: | Line 419: | ||
</div> | </div> | ||
<div class="ms-2"> <math>s\leftarrow s'</math> | <div class="ms-2"> <math>s\leftarrow s'</math> | ||
</div> | </div> | ||
'''end for'''</div></proc> | '''end for'''</div></proc> | ||
Line 449: | Line 424: | ||
{{proofcard|Theorem (Watkins and Dayan (1992) <ref name="watkins1992q"/>)|Theorem-1|\label{thm:Q_conv} Assume <math>|\mathcal{A}| < \infty</math> and <math>|\mathcal{S}| < \infty</math>. | {{proofcard|Theorem (Watkins and Dayan (1992) <ref name="watkins1992q"/>)|Theorem-1|\label{thm:Q_conv} Assume <math>|\mathcal{A}| < \infty</math> and <math>|\mathcal{S}| < \infty</math>. | ||
Let <math>n^i(s, a)</math> be the index of the <math>i</math>-th time that the action <math>a</math> is used in state <math>s</math>. | Let <math>n^i(s, a)</math> be the index of the <math>i</math>-th time that the action <math>a</math> is used in state <math>s</math>. | ||
Let <math>R < \infty</math> be a constant. Given bounded rewards <math>|r|\leq R</math>, learning rate <math>0\leq\beta_n < 1</math> and | Let <math>R < \infty</math> be a constant. Given bounded rewards <math>|r|\leq R</math>, learning rate <math>0\leq\beta_n < 1</math> and | ||
Line 458: | Line 432: | ||
</math> | </math> | ||
then <math>Q^{(n)}(s,a)\rightarrow Q^\star(s,a)</math> as <math>n\rightarrow \infty</math>, <math>\forall s,a</math> with probability <math>1</math>.|}} | then <math>Q^{(n)}(s,a)\rightarrow Q^\star(s,a)</math> as <math>n\rightarrow \infty</math>, <math>\forall s,a</math> with probability <math>1</math>.|}} | ||
This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence <math>\beta</math>. [[#thm:Q_conv |Theorem]] implies that eventually <math>Q^{(n)}</math> will convergence to the true value function <math>Q^\star</math> when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for <math>Q</math>-learning related algorithms are discussed in Section~[[#subsubsec:value_theo |subsubsec:value_theo]]. | This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence <math>\beta</math>. [[#thm:Q_conv |Theorem]] implies that eventually <math>Q^{(n)}</math> will convergence to the true value function <math>Q^\star</math> when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for <math>Q</math>-learning related algorithms are discussed in Section~[[#subsubsec:value_theo |subsubsec:value_theo]]. | ||
Line 505: | Line 478: | ||
'''On-policy Learning vs. Off-policy Learning.''' An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. <math>Q</math>-learning is an off-policy agent as the samples <math>(s,a,r,s')</math> used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the <math>Q</math>-function. The convergence of the <math>Q</math>-function relies on the exploration scheme of the policy <math>\pi</math> (see [[#thm:Q_conv |Theorem]]). In contrast to <math>Q</math>-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the <math>Q</math>-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of <math>Q</math>-learning \eqref{eq:Q_learning_update}, we have <math>\max_{a'} Q^{(n)}(s',a')</math> which is the maximum value of the <math>Q</math>-function at a given state <math>s'</math>. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes <math>Q^{(n)}(s',a')</math> where <math>a'</math> is selected based on a policy derived from <math>Q^{(n)}</math> (via the <math>\varepsilon</math>-greedy policy \eqref{epsilon_policy}). | '''On-policy Learning vs. Off-policy Learning.''' An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. <math>Q</math>-learning is an off-policy agent as the samples <math>(s,a,r,s')</math> used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the <math>Q</math>-function. The convergence of the <math>Q</math>-function relies on the exploration scheme of the policy <math>\pi</math> (see [[#thm:Q_conv |Theorem]]). In contrast to <math>Q</math>-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the <math>Q</math>-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of <math>Q</math>-learning \eqref{eq:Q_learning_update}, we have <math>\max_{a'} Q^{(n)}(s',a')</math> which is the maximum value of the <math>Q</math>-function at a given state <math>s'</math>. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes <math>Q^{(n)}(s',a')</math> where <math>a'</math> is selected based on a policy derived from <math>Q^{(n)}</math> (via the <math>\varepsilon</math>-greedy policy \eqref{epsilon_policy}). | ||
%{\color{RedOrange}Note that there are several differences between the <math>Q</math>-learning Algorithm | |||
====<span id="subsubsec:value_theo"></span>Discussion==== | ====<span id="subsubsec:value_theo"></span>Discussion==== | ||
Line 519: | Line 491: | ||
This was followed by <ref name="beck2012error"/> which showed that the same order of <math>Q</math>-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)</math> could be achieved with a constant learning rate <math>\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}</math> <math>(n\leq N)</math>. Without access to a simulator, delayed <math>Q</math>-learning <ref name="strehl2009reinforcement"/> was shown to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> assuming a | This was followed by <ref name="beck2012error"/> which showed that the same order of <math>Q</math>-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)</math> could be achieved with a constant learning rate <math>\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}</math> <math>(n\leq N)</math>. Without access to a simulator, delayed <math>Q</math>-learning <ref name="strehl2009reinforcement"/> was shown to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> assuming a | ||
known reward. An improvement to this result to <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy <ref name="dong2019q"/> and a single trajectory. | known reward. An improvement to this result to <math>\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)</math> has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy <ref name="dong2019q"/> and a single trajectory. | ||
%{\color{RedOrange}In particular, the dependence on the discount rate <math>\gamma</math> is improved from <math>\frac{1}{(1-\gamma)^8}</math> to <math>\frac{1}{(1-\gamma)^7}</math>.} | |||
In addition, sample complexities of <math>Q</math>-learning variants with (linear) function approximations have been established in several recent papers. <ref name="yang2019sample"/> assumed a linear MDP framework (see the end of [[#sec:set_up |Section]]) | In addition, sample complexities of <math>Q</math>-learning variants with (linear) function approximations have been established in several recent papers. <ref name="yang2019sample"/> assumed a linear MDP framework (see the end of [[#sec:set_up |Section]]) | ||
and the authors provided a V-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math> with <math>d</math> denoting the dimension of the features. This result matches the information-theoretic lower bound up to <math>\log(\cdot)</math> factors. Later work, <ref name="lattimore2020learning"/> considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order <math>\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math>. Although the dependence on <math>\gamma</math> is not as good as in <ref name="yang2019sample"/>, this framework can be applied to a more general class of models which are “near” linear. Although <math>Q</math>-learning is one of the most successful algorithms for finding the best | and the authors provided a V-sample complexity <math>\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math> with <math>d</math> denoting the dimension of the features. This result matches the information-theoretic lower bound up to <math>\log(\cdot)</math> factors. Later work, <ref name="lattimore2020learning"/> considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order <math>\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)</math>. Although the dependence on <math>\gamma</math> is not as good as in <ref name="yang2019sample"/>, this framework can be applied to a more general class of models which are “near” linear. Although <math>Q</math>-learning is one of the most successful algorithms for finding the best | ||
action-value function <math>Q</math>, its implementation often suffers from large high biased estimates of the <math>Q</math>-function values incurred by random sampling. The double <math>Q</math>-learning algorithm proposed in <ref name="hasselt2010double"/> | action-value function <math>Q</math>, its implementation often suffers from large high biased estimates of the <math>Q</math>-function values incurred by random sampling. The double <math>Q</math>-learning algorithm proposed in <ref name="hasselt2010double"/> | ||
tackled this high estimation issue by randomly switching the update between two <math>Q</math>-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double <math>Q</math>-learning algorithm was established in <ref name="xiong2020finite"/> with a Q-sample complexity on the order of <math>\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)</math> and learning rate <math>\beta_n = \frac{1}{n^{w}}</math> for some constant <math>\omega \in (0,1)</math>. | tackled this high estimation issue by randomly switching the update between two <math>Q</math>-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double <math>Q</math>-learning algorithm was established in <ref name="xiong2020finite"/> with a Q-sample complexity on the order of <math>\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)</math> and learning rate <math>\beta_n = \frac{1}{n^{w}}</math> for some constant <math>\omega \in (0,1)</math>. | ||
%{\color{RedOrange}[TODO Renyuan: add some comments on these orders -- which one is better than the others under what situation.]} | |||
===<span id="subsec:policy_based_methods"></span>Policy-based Methods=== | ===<span id="subsec:policy_based_methods"></span>Policy-based Methods=== | ||
%{\color{ForestGreen}[TODO Huining: describe the setup: infinite horizon and Markov policy.]} | |||
So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the <math>\varepsilon</math>-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods <ref name="sewak2019policy"/><ref name="yu2020policy"/><ref name="daberius2019deep"/>. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. | So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the <math>\varepsilon</math>-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods <ref name="sewak2019policy"/><ref name="yu2020policy"/><ref name="daberius2019deep"/>. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. | ||
In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE <ref name="williams1992"/>, Actor-Critic methods <ref name="konda2000"/>, Trust Region Policy Optimization (TRPO) <ref name="schulman2015"/>, Proximal Policy Optimization (PPO) <ref name="schulman2017proximal"/> among others. We first parametrize the policy <math>\pi</math> by a vector <math>\theta</math>, thus we define <math>\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})</math>, the probability distribution parameterized by <math>\theta</math> over the action space given the state <math>s</math> at time <math>t</math>. Here we will use <math>\pi(s,a;\theta)</math> and <math>\pi_{\theta}</math> interchangeably to represent the policy parameterized by <math>\theta</math>. We write <math>\mu^{\pi_{\theta}}</math> for the stationary distribution of the state under policy <math>\pi(s,a;\theta)</math>. Then the policy objective function is defined to be | In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE <ref name="williams1992"/>, Actor-Critic methods <ref name="konda2000"/>, Trust Region Policy Optimization (TRPO) <ref name="schulman2015"/>, Proximal Policy Optimization (PPO) <ref name="schulman2017proximal"/> among others. We first parametrize the policy <math>\pi</math> by a vector <math>\theta</math>, thus we define <math>\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})</math>, the probability distribution parameterized by <math>\theta</math> over the action space given the state <math>s</math> at time <math>t</math>. Here we will use <math>\pi(s,a;\theta)</math> and <math>\pi_{\theta}</math> interchangeably to represent the policy parameterized by <math>\theta</math>. We write <math>\mu^{\pi_{\theta}}</math> for the stationary distribution of the state under policy <math>\pi(s,a;\theta)</math>. Then the policy objective function is defined to be | ||
Line 547: | Line 518: | ||
</math> | </math> | ||
where <math>\beta</math> is the learning rate, and <math>\widehat{\nabla_{\theta}J(\theta)}</math> is an estimate of the gradient of <math>J(\theta)</math> with respect to <math>\theta</math>. | where <math>\beta</math> is the learning rate, and <math>\widehat{\nabla_{\theta}J(\theta)}</math> is an estimate of the gradient of <math>J(\theta)</math> with respect to <math>\theta</math>. | ||
====Policy Gradient Theorem==== | ====Policy Gradient Theorem==== | ||
Our task now is to estimate the gradient <math>\nabla_{\theta}J(\theta)</math>. The objective function <math>J(\theta)</math> depends on both the policy and the corresponding stationary distribution <math>\mu^{\pi_{\theta}}</math>, and the parameter <math>\theta</math> affects both of them. Calculating the gradient of the action with respect to <math>\theta</math> is straightforward given the parametrization <math>\pi(s,a;\theta)</math>, whereas it might be challenging to analyze the effect of <math>\theta</math> on the state when the system is unknown. Fortunately the well-known ''policy gradient theorem'' provides an expression for <math>\nabla_{\theta}J(\theta)</math> which does not involve the derivatives of the state distribution with respect to <math>\theta</math>. We write the <math>Q</math>-function under policy <math>\pi(s,a;\theta)</math> as <math>Q^{\pi_{\theta}}(s,a)</math>, then we have the following theorem (for more details see <ref name="suttonbarto"/>). | Our task now is to estimate the gradient <math>\nabla_{\theta}J(\theta)</math>. The objective function <math>J(\theta)</math> depends on both the policy and the corresponding stationary distribution <math>\mu^{\pi_{\theta}}</math>, and the parameter <math>\theta</math> affects both of them. Calculating the gradient of the action with respect to <math>\theta</math> is straightforward given the parametrization <math>\pi(s,a;\theta)</math>, whereas it might be challenging to analyze the effect of <math>\theta</math> on the state when the system is unknown. Fortunately the well-known ''policy gradient theorem'' provides an expression for <math>\nabla_{\theta}J(\theta)</math> which does not involve the derivatives of the state distribution with respect to <math>\theta</math>. We write the <math>Q</math>-function under policy <math>\pi(s,a;\theta)</math> as <math>Q^{\pi_{\theta}}(s,a)</math>, then we have the following theorem (for more details see <ref name="suttonbarto"/>). | ||
Line 559: | Line 529: | ||
For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity <ref name="konda2002"/>. | For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity <ref name="konda2002"/>. | ||
====<span id="sec:REINFORCE"></span>REINFORCE: Monte Carlo Policy Gradient==== | ====<span id="sec:REINFORCE"></span>REINFORCE: Monte Carlo Policy Gradient==== | ||
Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard ''Monte Carlo'' method that repeatedly simulates a trajectory of length <math>M</math> within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE <ref name="williams1992"/> or Monte Carlo policy gradient (see [[#alg:REINFORCE |Algorithm]]). | Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard ''Monte Carlo'' method that repeatedly simulates a trajectory of length <math>M</math> within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE <ref name="williams1992"/> or Monte Carlo policy gradient (see [[#alg:REINFORCE |Algorithm]]). We use a simple empirical return function <math>G_t</math> to approximate the <math>Q</math>-function in \eqref{eqn:policy_gra_thm}. The return <math>G_t</math> is defined as the sum of the discounted rewards and given by <math>G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}</math>. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time <math>t</math>, it uses all future rewards up until the end of the trajectory (see line 7 in [[#alg:REINFORCE |Algorithm]]). Within the <math>n</math>-iteration the policy parameter <math>\theta^{(n+1)}</math> is updated <math>M</math> times with <math>G_t</math> for <math>t=0,1,\cdots,M-1</math>. | ||
We use a simple empirical return function <math>G_t</math> to approximate the <math>Q</math>-function in \eqref{eqn:policy_gra_thm}. The return <math>G_t</math> is defined as the sum of the discounted rewards and given by <math>G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}</math>. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time <math>t</math>, it uses all future rewards up until the end of the trajectory (see line 7 in [[#alg:REINFORCE |Algorithm]]). Within the <math>n</math>-iteration the policy parameter <math>\theta^{(n+1)}</math> is updated <math>M</math> times with <math>G_t</math> for <math>t=0,1,\cdots,M-1</math> | |||
As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a ''baseline'', which is a function of the state <math>s</math>, from the return <math>G_t</math>. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, <math>\widehat{V}(s)</math>, which can be learnt using one of the methods introduced in [[#sec:value_based_methods |Section]]. Replacing the <math>G_t</math> in \eqref{eqn:alg_REINFORCE_update} by <math>G_t-\widehat{V}(s)</math> gives a REINFORCE algorithm with baseline. | As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a ''baseline'', which is a function of the state <math>s</math>, from the return <math>G_t</math>. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, <math>\widehat{V}(s)</math>, which can be learnt using one of the methods introduced in [[#sec:value_based_methods |Section]]. Replacing the <math>G_t</math> in \eqref{eqn:alg_REINFORCE_update} by <math>G_t-\widehat{V}(s)</math> gives a REINFORCE algorithm with baseline. | ||
<proc id="alg:REINFORCE" label = "\textbf{REINFORCE: Monte Carlo Policy Gradient}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, learning rate <math>\beta</math>, a differentiable policy parametrization <math>\pi(s,a;\theta)</math>, finite length of the trajectory <math>M</math>. | <proc id="alg:REINFORCE" label = "\textbf{REINFORCE: Monte Carlo Policy Gradient}"><div class="ms-2"> '''input''': total number of iterations <math>N</math>, learning rate <math>\beta</math>, a differentiable policy parametrization <math>\pi(s,a;\theta)</math>, finite length of the trajectory <math>M</math>. | ||
</div> | </div> | ||
Line 586: | Line 553: | ||
'''end for'''</div>'''end for'''</div></proc> | '''end for'''</div>'''end for'''</div></proc> | ||
====Actor-Critic Methods==== | ====Actor-Critic Methods==== | ||
The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., <ref name="franccois2019"/> and <ref name="von2011statistical"/>) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. | The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., <ref name="franccois2019"/> and <ref name="von2011statistical"/>) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. | ||
In Actor-Critic methods the value function is parametrized by <math>w</math> and the policy is parametrized by <math>\theta</math>. These methods consist of two models: a ''critic'' which updates the value function or <math>Q</math>-function parameter <math>w</math>, and an ''actor'' which updates the policy parameter <math>\theta</math> based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in [[#sec:value_based_methods |Section]] for the critic; for example SARSA or <math>Q</math>-learning. We give the pseudocode for a simple Actor-Critic method in [[#alg:actor_critic |Algorithm]]. | In Actor-Critic methods the value function is parametrized by <math>w</math> and the policy is parametrized by <math>\theta</math>. These methods consist of two models: a ''critic'' which updates the value function or <math>Q</math>-function parameter <math>w</math>, and an ''actor'' which updates the policy parameter <math>\theta</math> based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in [[#sec:value_based_methods |Section]] for the critic; for example SARSA or <math>Q</math>-learning. We give the pseudocode for a simple Actor-Critic method in [[#alg:actor_critic |Algorithm]]. | ||
Here for the critic, we approximate the <math>Q</math>-function by a linear combination of features, that is, <math>Q(s,a;w)=\phi(s,a)^\top w</math> for some known feature <math>\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d</math>, and use TD(0) to update the parameter <math>w</math> to minimize the least-square error of the TD error with the gradient of the least-square error taking the form <math>\delta \phi(s,a)</math> (see lines 9-10 in [[#alg:actor_critic |Algorithm]]). For the actor, we use the (vanilla) policy-based method to update the policy parameter <math>\theta</math> (see line 11 in [[#alg:actor_critic |Algorithm]]). | Here for the critic, we approximate the <math>Q</math>-function by a linear combination of features, that is, <math>Q(s,a;w)=\phi(s,a)^\top w</math> for some known feature <math>\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d</math>, and use TD(0) to update the parameter <math>w</math> to minimize the least-square error of the TD error with the gradient of the least-square error taking the form <math>\delta \phi(s,a)</math> (see lines 9-10 in [[#alg:actor_critic |Algorithm]]). For the actor, we use the (vanilla) policy-based method to update the policy parameter <math>\theta</math> (see line 11 in [[#alg:actor_critic |Algorithm]]). | ||
There are three main ways to execute the algorithm. In the ''nested-loop'' setting (see, e.g. <ref name="kumar2019sample"/><ref name="xu2020improving"/>), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the ''two time-scale'' setting (see, e.g. <ref name="xu2020non"/>), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the ''single-scale'' setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. <ref name="fu2020single"/><ref name="xu2021doubly"/>). | There are three main ways to execute the algorithm. In the ''nested-loop'' setting (see, e.g. <ref name="kumar2019sample"/><ref name="xu2020improving"/>), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the ''two time-scale'' setting (see, e.g. <ref name="xu2020non"/>), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the ''single-scale'' setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. <ref name="fu2020single"/><ref name="xu2021doubly"/>). | ||
Line 597: | Line 563: | ||
</div> | </div> | ||
<div class="ms-2"> Initialize <math>s</math> | <div class="ms-2"> Initialize <math>s</math> | ||
</div> | </div> | ||
<div class="ms-2">'''if''' <math>n=0,1,\ldots,N-1</math> '''then''' | <div class="ms-2">'''if''' <math>n=0,1,\ldots,N-1</math> '''then''' | ||
Line 624: | Line 589: | ||
'''Convergence Guarantees.''' | '''Convergence Guarantees.''' | ||
Now we discuss the convergence guarantees of policy-based algorithms. <ref name="sutton2000policy"/> provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. | Now we discuss the convergence guarantees of policy-based algorithms. <ref name="sutton2000policy"/> provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. | ||
For more examples of asymptotic convergence, see for example <ref name="konda2000"/> and <ref name="bhatnagar2010actor"/>. | For more examples of asymptotic convergence, see for example <ref name="konda2000"/> and <ref name="bhatnagar2010actor"/>. | ||
Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, <ref name="kumar2019sample"/> provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates <math>k</math> required to attain <math>\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 < \varepsilon</math>. We denote this smallest number as <math>K</math>. When the actor uses a policy gradient method, the critic achieves <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)</math> by employing TD(0), <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})</math> by employing the Gradient Temporal Difference, and <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})</math> by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. <ref name="zhang2020global"/> investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. | Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, <ref name="kumar2019sample"/> provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates <math>k</math> required to attain <math>\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 < \varepsilon</math>. We denote this smallest number as <math>K</math>. When the actor uses a policy gradient method, the critic achieves <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)</math> by employing TD(0), <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})</math> by employing the Gradient Temporal Difference, and <math>K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})</math> by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. <ref name="zhang2020global"/> investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. | ||
For more examples of sample complexity analysis for convergence to a stationary point, see for example <ref name="papini2018stochastic"/><ref name="xu2020improved"/><ref name="xu2019sample"/><ref name="shen2019hessian"/><ref name="xiong2020non"/>. The global optimality of stationary points was studied in <ref name="bhandari2019"/> where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. | For more examples of sample complexity analysis for convergence to a stationary point, see for example <ref name="papini2018stochastic"/><ref name="xu2020improved"/><ref name="xu2019sample"/><ref name="shen2019hessian"/><ref name="xiong2020non"/>. The global optimality of stationary points was studied in <ref name="bhandari2019"/> where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. | ||
Recently there have been results on the sample complexity analysis of the convergence to the global optimum. | Recently there have been results on the sample complexity analysis of the convergence to the global optimum. | ||
<ref name="cen2020fast"/> proved that the entropy regularized natural policy gradient method achieves a <math>Q</math>-sample complexity (and <math>\varepsilon</math>-optimal policies) with linear convergence rate. <ref name="shani2020adaptive"/> showed that with high probability at least <math>1-\delta</math>, the TRPO algorithm has sublinear rate <math> \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)</math> with mean-squared sample complexity <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)</math> in the unregularized (tabular) MDPs, and has an improved rate <math>\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)</math> with mean-squared sample complexity <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)</math> in MDPs with entropy regularization. | <ref name="cen2020fast"/> proved that the entropy regularized natural policy gradient method achieves a <math>Q</math>-sample complexity (and <math>\varepsilon</math>-optimal policies) with linear convergence rate. <ref name="shani2020adaptive"/> showed that with high probability at least <math>1-\delta</math>, the TRPO algorithm has sublinear rate <math> \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)</math> with mean-squared sample complexity <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)</math> in the unregularized (tabular) MDPs, and has an improved rate <math>\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)</math> with mean-squared sample complexity <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)</math> in MDPs with entropy regularization. | ||
<ref name="xu2020non"/> gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order <math>\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})</math>. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both <ref name="fu2020single"/> and <ref name="xu2021doubly"/>. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see <ref name="zhang2020sample"/> for the regret analysis of the REINFORCE algorithm for discounted MDPs and <ref name="agarwal2021theory"/><ref name="mei2020"/> for policy gradient methods in the setting of known model parameters. | <ref name="xu2020non"/> gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order <math>\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})</math>. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both <ref name="fu2020single"/> and <ref name="xu2021doubly"/>. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see <ref name="zhang2020sample"/> for the regret analysis of the REINFORCE algorithm for discounted MDPs and <ref name="agarwal2021theory"/><ref name="mei2020"/> for policy gradient methods in the setting of known model parameters. | ||
===<span id="sec:discussion"></span>General Discussion=== | ===<span id="sec:discussion"></span>General Discussion=== | ||
Line 643: | Line 600: | ||
====RL with Finite Time Horizon==== | ====RL with Finite Time Horizon==== | ||
As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. | As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. | ||
<ref name="PSRL2013"/> proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an <math>\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})</math> expected regret bound. <ref name="DannBrunskill2015"/> proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has <math>V</math>-sample complexity{{efn|In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by <math>T</math> to match the definition of sample complexity in this paper.}} of order <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})</math> assuming known rewards. | <ref name="PSRL2013"/> proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an <math>\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})</math> expected regret bound. <ref name="DannBrunskill2015"/> proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has <math>V</math>-sample complexity{{efn|In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by <math>T</math> to match the definition of sample complexity in this paper.}} of order <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})</math> assuming known rewards. | ||
<ref name="azar2017minimax"/> considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of <math>\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})</math> (when <math>M\geq T|\mathcal{S}|^3|\mathcal{A}|</math> and <math>|\mathcal{S}||\mathcal{A}|\geq T</math>) and <math>\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})</math> (when <math>M\geq T^3|\mathcal{S}|^3|\mathcal{A}|</math> and <math>|\mathcal{S}||\mathcal{A}|\geq T</math>), respectively. | <ref name="azar2017minimax"/> considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of <math>\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})</math> (when <math>M\geq T|\mathcal{S}|^3|\mathcal{A}|</math> and <math>|\mathcal{S}||\mathcal{A}|\geq T</math>) and <math>\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})</math> (when <math>M\geq T^3|\mathcal{S}|^3|\mathcal{A}|</math> and <math>|\mathcal{S}||\mathcal{A}|\geq T</math>), respectively. | ||
Line 653: | Line 609: | ||
====Model-based RL with Infinite Time Horizon==== | ====Model-based RL with Infinite Time Horizon==== | ||
To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, <ref name="brafman2002r"/> proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)</math> by <ref name="li2009rmax"/>. The Model-based Interval Estimation (MBIE) in <ref name="Strehl2005"/> was proved to achieve a sample complexity of exploration of the same order as R-MAX. <ref name="MoRmax2010"/> further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)</math>. | To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, <ref name="brafman2002r"/> proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)</math> by <ref name="li2009rmax"/>. The Model-based Interval Estimation (MBIE) in <ref name="Strehl2005"/> was proved to achieve a sample complexity of exploration of the same order as R-MAX. <ref name="MoRmax2010"/> further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order <math>\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)</math>. | ||
<ref name="azar2013minimax"/> proved that the model-based <math>Q</math>-Value Iteration (QVI) achieves the <math>Q</math>-sample complexity of <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})</math> for some small <math>\varepsilon</math> | <ref name="azar2013minimax"/> proved that the model-based <math>Q</math>-Value Iteration (QVI) achieves the <math>Q</math>-sample complexity of <math>\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})</math> for some small <math>\varepsilon</math> | ||
with high probability at least <math>1-\delta</math>. <ref name="agarwal2020model"/> studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as <ref name="azar2013minimax"/>, but with a larger range of <math>\varepsilon</math>. | with high probability at least <math>1-\delta</math>. <ref name="agarwal2020model"/> studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as <ref name="azar2013minimax"/>, but with a larger range of <math>\varepsilon</math>. | ||
See also <ref name="strehl2009reinforcement"/><ref name="kearns2002near"/><ref name="lattimore2012pac"/> for model-based RL along this line. | See also <ref name="strehl2009reinforcement"/><ref name="kearns2002near"/><ref name="lattimore2012pac"/> for model-based RL along this line. | ||
''Linear-quadratic'' (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) ''linear-quadratic regulator'' (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example <ref name="fazel2018global"/> and <ref name="hambly2020policy"/> for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also <ref name="basei2021logarithmic"/> and <ref name="guo2021reinforcement"/> for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time. | ''Linear-quadratic'' (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) ''linear-quadratic regulator'' (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example <ref name="fazel2018global"/> and <ref name="hambly2020policy"/> for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also <ref name="basei2021logarithmic"/> and <ref name="guo2021reinforcement"/> for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time. | ||
====<span id="sec:ex-ex"></span>Exploration Exploitation Trade-offs==== | ====<span id="sec:ex-ex"></span>Exploration Exploitation Trade-offs==== | ||
Line 668: | Line 620: | ||
is the Q value function and <math>U(s,a)</math> | is the Q value function and <math>U(s,a)</math> | ||
is a function inversely proportional to how many times the state-action pair <math>(s,a)</math> has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration <ref name="asadi2017alternative"/><ref name="haarnoja2017reinforcement"/><ref name="WZZ2018"/>, the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method <ref name="thompson1933likelihood"/><ref name="ouyang2017learning"/><ref name="gopalan2015thompson"/>, the agent keeps track of a '' belief of the optimal action'' via a distribution over the set of admissible actions (for each given state). | is a function inversely proportional to how many times the state-action pair <math>(s,a)</math> has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration <ref name="asadi2017alternative"/><ref name="haarnoja2017reinforcement"/><ref name="WZZ2018"/>, the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method <ref name="thompson1933likelihood"/><ref name="ouyang2017learning"/><ref name="gopalan2015thompson"/>, the agent keeps track of a '' belief of the optimal action'' via a distribution over the set of admissible actions (for each given state). | ||
For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space. | For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space. | ||
====<span id="subsec:regularization"></span>Regularization in Reinforcement Learning==== | ====<span id="subsec:regularization"></span>Regularization in Reinforcement Learning==== | ||
Line 679: | Line 629: | ||
It is worth noting that <ref name="WZZ2018"/> explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently <ref name="gao2020state"/> and <ref name="tang2021exploratory"/> extended this idea to a general control framework beyond the linear-quadratic case. <ref name="grau2018soft"/> extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. | It is worth noting that <ref name="WZZ2018"/> explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently <ref name="gao2020state"/> and <ref name="tang2021exploratory"/> extended this idea to a general control framework beyond the linear-quadratic case. <ref name="grau2018soft"/> extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. | ||
When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors <ref name="massoud2009regularized"/><ref name="farahmand2008regularized"/>. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for <math>Q</math>-functions) and the <math>l_1</math> norm (for parameters). | When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors <ref name="massoud2009regularized"/><ref name="farahmand2008regularized"/>. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for <math>Q</math>-functions) and the <math>l_1</math> norm (for parameters). | ||
====Reinforcement Learning in Non-stationary Environment==== | ====Reinforcement Learning in Non-stationary Environment==== | ||
Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce | Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce | ||
Line 725: | Line 675: | ||
where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.\\ | where <math>s_{u+1} \sim P_{u}(\cdot|s_u,a_u)</math> and <math>a_u\sim \pi_{u}(s_u)</math>.\\ | ||
Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces <ref name="gajane2018sliding"/><ref name="cheung2019learning"/>, episodic MDPs with finite state and action spaces <ref name="mao2020model"/>, | Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces <ref name="gajane2018sliding"/><ref name="cheung2019learning"/>, episodic MDPs with finite state and action spaces <ref name="mao2020model"/>, and episodic linear MDPs <ref name="touati2020efficient"/>. | ||
and episodic linear MDPs <ref name="touati2020efficient"/>. | |||
However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. | However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. | ||
To relax this assumption, <ref name="cheung2020reinforcement"/> proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result <ref name="cheung2019learning"/>. However, it introduces extra overheads and leads to suboptimal regret. | To relax this assumption, <ref name="cheung2020reinforcement"/> proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result <ref name="cheung2019learning"/>. However, it introduces extra overheads and leads to suboptimal regret. |
Revision as of 23:52, 11 May 2024
label{sec:rl_basics} Reinforcement learning is an approach to understanding and automating goal-directed learning and decision-making. It leads naturally to algorithms for such problems and is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with {their} environment, without relying on exemplary supervision or complete models of the environment. RL uses a formal framework defining the interaction between a learning agent and {their} environment in terms of states, actions, and rewards. This framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. The history of RL has two main threads that were pursued independently before intertwining in modern RL. One thread concerns learning by trial and error and started in the psychology of animal learning. The other thread concerns the problem of optimal control for an evolving system and its solution using value functions and dynamic programming. Although this thread did not involve learning directly, the Bellman equation developed from this line of literature is viewed as the foundation of many important modern RL algorithms such as [math]Q[/math]-learning and Actor-Critic. Over the last few years, RL, grounded on combining classical theoretical results with deep learning and the functional approximation paradigm, has proved to be a fruitful approach to many artificial intelligence tasks from diverse domains. Breakthrough achievements include reaching human-level performance in such complex tasks as Go [1] and multi-player StarCraft II [2], as well as the development of ChatGPT [3]. The generality of the reinforcement learning framework allows its application in both discrete and continuous spaces to solve tasks in both real and simulated environments [4]. Classical RL research during the last third of the previous century developed an extensive theoretical core on which modern algorithms are based. Several algorithms, such as Temporal-Difference Learning and [math]Q[/math]-learning, were developed and are able to solve small-scale problems when either the states of the environment can be enumerated (and stored in memory) or the optimal policy lies in the space of linear or quadratic functions of the state variable. Although these restrictions are extremely limiting, these foundations of classical RL theory underlie the modern approaches which benefit from increased computational power. Combining this framework with deep learning [5] was popularized by the Deep [math]Q[/math]-learning algorithm, introduced in [6], which was able to play any of 57 Atari console games without tweaking the network architecture or algorithm hyperparameters. This novel approach was extensively researched and significantly improved over the following years [7][8][9].
Our aim in this section is to introduce the foundations of reinforcement learning. We start with the setup for MDP in Section with both an infinite time horizon and a finite time horizon, as there are financial applications of both settings in the literature. In Section we then introduce the learning procedure when the reward and transition dynamics of the MDPs are unknown to the decision-maker. In particular, we introduce various criteria to measure the performance of learning algorithms in different settings. Then we focus on two major classes of model-free reinforcement learning algorithms, value-based methods in Section and policy-based methods in Section~subsec:policy_based_methods, with an infinite-time horizon. Finally, Section~sec:discussion contains a general discussion of (model-based and model-free) RL with a finite-time horizon, model-based RL with an infinite-time horizon, and regularization methods for RL.
Setup: Markov Decision Processes
We first introduce MDPs with infinite time horizon. Many portfolio optimization problems, such as those arising from pension funds, investigate long-term investment strategies and hence it is natural to formulate them as a decision-making problem over an infinite time horizon. We then introduce MDPs with finite time horizon. Problems such as the optimal execution of the purchase or liquidation of a quantity of asset usually involve trading over a short time horizon and there may be a penalty at the terminal time if the targeted amount is not completely executed by then. Finally we discuss different policy classes that are commonly adopted by the decision-maker.
Infinite Time Horizon and Discounted Reward. Let us start with a discrete time MDP with an infinite time horizon and discounted reward. We have a Markov process taking values in a state space [math]\Sc[/math], where we can influence the evolution by taking an action from a set [math]\Ac[/math]. The aim is to optimize the expected discounted return from the system by choosing a policy, that is a sequence of actions through time. We formulate this mathematically by defining the value function [math]V^\star[/math] for each [math]s\in \Sc[/math] to be
subject to
Here and throughout the paper, [math]\E^{\Pi}[/math] denotes the expectation under the policy [math]{\Pi}[/math], where the probability measure [math]\mathbb{P}[/math] describes both dynamics and the rewards in the MDP. We will write [math]\mathcal{P}(\mathcal{X})[/math] for the probability measures over a space [math]\mathcal{X}[/math]. The state space [math](\Sc, d_{\Sc})[/math] and the action space [math](\Ac, d_{\Ac})[/math] are both complete separable metric spaces, which includes the case of [math]\Sc[/math] and [math]\Ac[/math] being finite, as often seen in the RL literature; [math]\gamma[/math] [math]\in[/math] [math](0, 1)[/math] is a discount factor; [math]s_t \in \Sc[/math] and [math]a_t \in \Ac[/math] are the state and the action at time [math]t[/math]; [math]P: \Sc \times \Ac \to \Pc(\Sc)[/math] is the transition function of the underlying Markov process; the notation [math] s_{t + 1} \sim P(s_t, a_t)[/math] denotes that [math] s_{t + 1}[/math] is sampled from the distribution [math] P(s_t, a_t)\in \mathcal{P}(\mathcal{S})[/math]; the reward [math]r(s, a)[/math] is a random variable in [math]\R[/math] for each pair [math](s, a)\in \Sc \times \Ac[/math]; and the policy [math]\Pi =\{\pi_t\}_{t=0}^{\infty}[/math] is Markovian, in that it just depends on the current state, and can be either deterministic or randomized. For a deterministic policy, [math]\pi_t: \Sc \to \Ac[/math] maps the current state [math]s_t[/math] to a deterministic action. On the other hand, a randomized policy [math]\pi_t: \Sc\to \Pc(\Ac)[/math] maps the current state [math]s_t[/math] to a distribution over the action space. For a randomized policy [math]\pi_t[/math], we will use [math]\pi_t(s)\in \mathcal{P}(\mathcal{A})[/math] and [math]\pi_t(s,a)[/math] to represent, respectively, the distribution over the action space given state [math]s[/math] and the probability of taking action [math]a[/math] at state [math]s[/math]. In this case, with infinite-time horizon, we assume the reward [math]r[/math] and the transition dynamics [math]P[/math] are time homogeneous, which is a standard assumption in the MDP literature [10]. We also note that the setup is essentially the same if we consider the problem of minimizing costs rather than maximizing a reward.
The well-known Dynamic Programming Principle (DPP), that the optimal policy can be obtained by maximizing the reward from one step and then proceeding optimally from the new state, can be used to derive the following Bellman equation for the value function \eqref{equ: singlev};
We write the value function as
where the [math]Q[/math]-function, one of the basic quantities used for RL, is defined to be
the expected reward from taking action [math]a[/math] at state [math]s[/math] and then following the optimal policy thereafter. There is also a Bellman equation for the [math]Q[/math]-function given by
This allows us to retrieve the optimal (stationary) policy [math]\pi^*(s, a)[/math] (if it exists) from [math]Q(s, a)[/math], in that [math]\pi^*(s, a) \in \arg\max_{a \in \Ac} Q(s, a)[/math]. An alternative setting over an infinite time horizon is the case of average reward, which is also referred to as an ergodic reward. This ergodic setting is not as relevant to financial applications and hence will not be covered in this review paper. We refer the reader to [11][12] for a more detailed discussion of RL with infinite-time horizon and average reward. \iffalse
Infinite Time Horizon and Average Reward. Another setting over an infinite time horizon is to consider the average reward, which is also referred to as the ergodic reward,
subject to
where [math]h_t :=\{s_0,a_0,\cdots,s_{t-1},a_{t-1},s_t\} [/math] is the history up to time [math]t[/math]; [math]s_t \in \Sc[/math] and [math]a_t \in \Ac[/math] are the state and the action at time [math]t[/math]; [math]P: \Sc \times \Ac \to \Pc(\Sc)[/math] is the transition matrix of the underlying Markov system; the reward [math]r(s, a)[/math] is random valued in [math]\R[/math] for each pair [math](s, a)\in \Sc \times \Ac[/math]; and [math]\Pi =\{\pi_t\}_{t=0}^{\infty}[/math] is the policy adopted by the agent. \fi
Finite Time Horizon. When there is a finite time horizon [math]T \lt \infty[/math] we no longer discount future values and have a terminal reward. The MDP problem with finite time horizon can be expressed as
subject to
As in the infinite horizon case, we denote by [math]s_u \in \Sc[/math] and [math]a_u \in \Ac[/math] the state and the action of the agent at time [math]u[/math]. However, where this differs from the infinite horizon case, is that we allow time-dependent transition and reward functions. Now [math]P_u: \Sc \times \Ac \to \Pc(\Sc)[/math] denotes the transition function and [math]r_u(s, a)[/math] a real-valued random variable for each pair [math](s, a)\in \Sc \times \Ac[/math] for [math]t \leq u \leq T-1[/math]. Similarly [math]r_T(s)[/math], the terminal reward, is a real-valued random variable for all [math]s\in \Sc[/math]. Finally, the Markovian policy [math]\Pi =\{\pi_u\}_{t=0}^T[/math] can be either deterministic or randomized. The corresponding Bellman equation for the value function \eqref{equ: singlev_fh} is defined as
with the terminal condition [math]V^\star_T(s) = \mathbb{E}[r_{T}(s)][/math]. We write the value function as
where the [math]Q^\star_t[/math] function is defined to be
There is also a Bellman equation for the [math]Q[/math]-function given by
with the terminal condition [math]Q^\star_T(s,a) = \mathbb{E}[r_T(s)][/math] for all [math]a\in \mathcal{A}[/math]. We note that since financial time series data are typically non-stationary, time-varying transition kernels and reward functions in \eqref{equ: singlev_fh}-\eqref{equ: singledynamics_fh} are particularly important for financial applications. \iffalse
Policy Class and Optimal Policies. We consider the following types of policies for [math]\Pi = \{\pi_t\}_{t=0}^{T}[/math] with [math]T \lt \infty[/math] for the finite time horizon setting and [math]T=\infty[/math] for the infinite time horizon setting:
- History-dependent Randomized Policy (HR): [math]\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Pc(\Ac)[/math],
- History-dependent Deterministic Policy (HD): [math]\pi_t: (\Sc\times \Ac)^t\times\Sc \to \Ac[/math],
- Markov Randomized Policy (MR): [math]\pi_t: \Sc \to \Pc(\Ac)[/math],
- Markov Deterministic Policy (MD): [math]\pi_t: \Sc \to \Ac[/math].
It is easy to observe that MD [math]\subset[/math] MR [math]\subset[/math] HR. Recall that in \eqref{equ: singlev} and \eqref{equ: singlev_fh}, the optimal policy is defined as the one that maximizes the gain of the MDP. Due to the structure of MDPs it is not difficult to show that it is sufficient to consider Markovian policies. Henceforth, we shall focus only on Markovian policies. \fi
For an infinite horizon MDP with finite state and action space, and finite reward [math]r[/math], a further useful observation is that such an MDP always has a stationary optimal policy, whenever an optimal policy exists.
Assume [math]|\mathcal{A}| \lt \infty[/math], [math]|\mathcal{S}| \lt \infty[/math] and [math]|r| \lt \infty[/math] with probability one. For any infinite horizon discounted MDP, there always exists a deterministic stationary policy that is optimal.
\iffalse
Assume [math]|\mathcal{A}| \lt \infty[/math], [math]|\mathcal{S}| \lt \infty[/math] and [math]|r| \lt \infty[/math] with probability one. For infinite horizon average reward MDP, there always exist a stationary (possibly randomized) policy which is an optimal policy.
\fi Theorem implies that there always exists a fixed policy so that taking actions specified by that policy at each time step maximizes the discounted reward. The agent does not need to change policies with time. There is a similar result for the average reward case, see Theorem 8.1.2 in [10]. This insight reduces the question of finding the best sequential decision-making strategy to the question of finding the best stationary policy. To this end, we write [math]\pi:\mathcal{S}\rightarrow \mathcal{P}(\mathcal{A})[/math] (without a time index) for a stationary policy throughout the paper.
Linear MDPs and Linear Functional Approximation. Linear MDPs have been extensively studied in the recent literature on theoretical RL in order to establish tractable and efficient algorithms. In a linear MDP, both the transition kernels and the reward function are assumed to be linear with respect to some feature mappings [13][14].
In the infinite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if there exist [math]d[/math] unknown (signed) measures [math]\mu = (\mu^{(1)},\cdots,\mu^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have
Similarly for the finite horizon setting, an MDP is said to be linear with a feature map [math]\phi: \mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R}^d[/math], if for any [math]0\leq t \leq T[/math], there exist [math]d[/math] unknown (signed) measures [math]\mu_t = (\mu_t^{(1)},\cdots,\mu_t^{(d)})[/math] over [math]\mathcal{S}[/math] and an unknown vector [math]\theta_t \in \mathbb{R}^d[/math] such that for any [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math], we have
Typically features are assumed to be known to the agent and bounded, namely, [math]\|\phi(s,a)\|\leq 1[/math] for all [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math]. The linear MDP framework is closely related to the literature on RL with linear functional approximation, where the value function is assumed to be of the form
for the infinite horizon case, and
for the finite horizon case. Here [math]\psi: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}^d[/math] and [math]\xi: \mathcal{S}\rightarrow \mathbb{R}^d[/math] are known feature mappings and [math]\omega[/math], [math]\omega_t[/math] [math]\eta[/math], and [math]\eta_t[/math] are unknown vectors. It has been shown that the linear MDP (i.e., \eqref{eq:linearMDP_infinite} or \eqref{eq:linearMDP_finite}) and the linear functional approximation formulation (i.e., \eqref{eq:lfa_infinite} or \eqref{eq:lfa_finite}) are equivalent under mild conditions [15][16]. In addition to linear functional approximations, we refer the reader to [17] for a general discussion of RL with nonlinear functional approximations and to Section \ref{sec:deep_value_based} for neural network approximation, one of the most popular nonlinear functional approximations used in practice.
Nonlinear Functional Approximation. Compared to linear functional approximations, nonlinear functional approximations do not require knowledge of the kernel functions a priori. Nonlinear functional approximation could potentially address the mis-specification issue when the agent has an incorrect understanding of the functional space that the MDP lies in. The most popular nonlinear functional approximation approach is to use neural networks, leading to deep RL. Thanks to the universal approximation theorem, this is a theoretically sound approach and neural networks show promising performance for a wide range of applications. Meanwhile gradient-based algorithms for certain neural network architectures enjoy provable convergence guarantees. We defer the discussion of deep RL to Section \ref{sec:deep_value_based} and its financial applications to Section \ref{sec:fin_app}.
From MDP to Learning
When the transition dynamics [math]P[/math] and the reward function [math]r[/math] for the infinite horizon MDP problem are unknown, this MDP becomes a reinforcement learning problem, which is to find an optimal stationary policy [math]\pi[/math] (if it exists) while simultaneously learning the unknown [math]P[/math] and [math]r[/math]. The learning of [math]P[/math] and [math]r[/math] can be either explicit or implicit, which leads to model-based and model-free RL, respectively. The analogous ideas hold for the finite horizon case. We introduce some standard RL terminology. A more detailed introduction to RL can be found in textbooks such as [18][19].
Agent-environment Interface. In the RL setting, the learner or the decision-maker is called the agent. The physical world that the agent operates in and interacts with, comprising everything outside the agent, is called the environment. The agent and environment interact at each of a sequence of discrete time steps, [math]t =0,1,2,3,\cdots[/math], in the following way. At the beginning of each time step [math]t[/math], the agent receives some representation of the environment's state, [math]s_t \in \mathcal{S}[/math] and selects an action [math]a_t\in \mathcal{A}[/math]. At the end of this time step, in part as a consequence of its action, the agent receives a numerical reward [math]r_t[/math] (possibly stochastic) and a new state [math]s_{t+1}[/math] from the environment. The tuple [math](s_t,a_t,r_t,s_{t+1})[/math] is called a sample at time [math]t[/math] and [math]h_t :=\{(s_u,a_u,r_u,s_{u+1})\}_{u=0}^t[/math] is referred to as the history or experience up to time [math]t[/math]. An RL algorithm is a finite sequence of well-defined policies for the agent to interact with the environment.
Exploration vs Exploitation. In order to maximize the accumulated reward over time, the agent
learns to select her actions based on her past experiences (exploitation) and/or by trying new choices (exploration). Exploration provides opportunities to improve performance from the current sub-optimal solution to the ultimate globally optimal one, yet it is time consuming and computationally expensive as over-exploration may impair the convergence to the optimal solution. Meanwhile, pure exploitation, i.e., myopically picking the current solution based solely on
past experience, though easy to implement, tends to yield sub-optimal global solutions. Therefore,
an appropriate trade-off between exploration and exploitation is crucial in the design of RL algorithms in order to improve the learning and the optimization performance. See Section for a more detailed discussion.
Simulator. In the procedure described above on how the agent interacts with the environment, the agent does so in an online fashion. Namely,
the initial state at time step [math]t+1[/math], [math]s_{t+1}[/math], is the state that the agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. This is a challenging setting since an efficient exploration scheme is needed. For instance, [math]Q[/math]-learning with the [math]\varepsilon[/math]-greedy policy (to be introduced in Section) may take exponentially many episodes
to learn the optimal policy [20].
Some results assume access to a simulator [21] (a.k.a., a generative model [22]),
which allows the algorithm to query arbitrary state-action pairs and return the reward and the next state. This implies the agent can “restart” the system at any time. That is, the initial state at time step [math]t+1[/math] does not need to be the state that agent moves to after taking action [math]a_t[/math] from state [math]s_t[/math]. The “simulator” significantly alleviates the difficulty of exploration, since a naive exploration strategy which queries all state-action pairs uniformly at random already leads to the most efficient algorithm for finding an optimal policy [22].
Randomized Policy vs Deterministic Policy. A randomized policy [math]\pi_t:\Sc \rightarrow \mathcal{P}(\Ac)[/math] is also known in the control literature as a relaxed control and in game theory as a mixed strategy. Despite the existence of a deterministic optimal policy as stated in Theorem for the infinite time horizon case, most of the RL algorithms adopt randomized policies to encourage exploration when RL agents are not certain about the environment.
An Example: Multi-armed bandits. We illustrate these points by discussing
multi-armed bandit problems, a special case of RL problems.
The multi-armed bandit is a model for a set of slot machines. A simple version is that there are a number of arms, each with a stochastic reward coming from a fixed probability distribution which is initially unknown.
We try these arms in some order, which may depend on the sequence of rewards that have been
observed so far and attempt to maximize our return. A common objective in this context is to find a policy for choosing the next arm
to be tried, under which the sum of the expected rewards comes as close as possible to the ideal
reward, i.e., the expected reward that would be obtained if we were to try the “best” arm at all times.
Thus the agent interacts with the environment by selecting an action - pulling an arm - and receiving a reward.
The policy of the order of pulling on the arms has to balance exploitation, the continued use of an arm that is returning a decent reward, with exploration, sampling arms about which there is less information to find one with a possibly better return.
In this setting for a multi-armed bandit problem we have not involved state dynamics. In this case, an admissible policy is simply a distribution over the action space for a randomized policy and a single arm for a deterministic policy. A simple example of a bandit with a state space is the stochastic contextual bandit which is used extensively in the personalized advertisement recommendation and clinical treatment recommendation literature. In this case the state dynamics (also referred to as the context) are sampled i.i.d. from an unknown distribution. For the personalized advertisement recommendation example, the state at time [math]t[/math] can be interpreted as the personal information and purchasing behavior of a targeted client.
The problem was first considered in the seminal work of Robbins [23], which
derives policies that asymptotically attain an average reward that converges in the limit to the reward of the best arm.
The multi-armed bandit problem was later studied in discounted, Bayesian, Markovian, expected
reward, and adversarial setups. See Berry and Fristedt [24] for a review of the classical results on the multi-armed bandit problem.
Performance Evaluation
It is not possible to solve MDPs analytically in general and therefore we will discuss numerical algorithms to determine the optimal policies. In order to assess different algorithms we will need a measure of their performance. Here we will consider several types of performance measure: sample complexity, rate of convergence, regret analysis and asymptotic convergence. For RL with a finite time horizon, one episode contains a sequence of states, actions and rewards, which starts at time [math]0[/math] and ends at the terminal time [math]T[/math], and the performance is evaluated in terms of the total number of samples in all episodes.[a] Sometimes we also refer to this setting as episodic RL. For RL with infinite time horizon, the performance is evaluated in terms of the number of steps. In this setting one step contains one (state, action, reward, next state) tuple. It is worth noting that the episodic criterion can also be used to analyze infinite horizon problems. One popular technique is to truncate the trajectory (with infinite steps) at a large time and hence translate the problem into an approximating finite time horizon problem. Examples include REINFORCE [27] and the policy gradient method [28] (see Section~subsec:policy_based_methods). Throughout the analysis, we use the notation [math]\widetilde{O}(\cdot)[/math] to hide logarithmic factors when describing the order of the scaling in [math]\varepsilon[/math], [math]\delta[/math], [math]|\mathcal{A}|[/math] and [math]|\mathcal{S}|[/math] (when these are finite), and other possible model parameters such as the discount rate [math]\gamma[/math] in the infinite horizon case or the dimension [math]d[/math] of the features when functional approximation is used. We write [math]\widetilde{\mathcal{O}}^{\prime}[/math] rather than [math]\widetilde{\mathcal{O}}[/math] to indicate that, although there may be other parameters, we only include the dependence on [math]\varepsilon[/math]. We denote by [math]\mbox{poly}(\cdot)[/math] a polynomial function of its arguments.
Sample Complexity. In RL, a sample [math](s,a,r,s^{\prime})[/math] is defined as a tuple consisting of a state [math]s[/math], an action [math]a[/math], an instantaneous reward received by taking action [math]a[/math] at state [math]s[/math], and the next state [math]s^{\prime}[/math] observed afterwards. Sample complexity is defined as the total number of samples required to find an approximately optimal policy. This evaluation criterion can be used for any kind of RL problem. For episodic RL, an algorithm returns, after the end of the [math](k-1)[/math]-th episode with [math]M_{k-1}[/math] samples used in this episode, a policy [math]\pi_k[/math] to be applied in the [math]k[/math]-th episode. The sample complexity of this algorithm is the minimum number [math]\sum_{k=1}^K M_{k-1}(\varepsilon)[/math] of samples such that for all [math]k \ge K[/math], [math]\pi_k[/math] is [math]\varepsilon[/math]-optimal with probability at least [math]1-\delta[/math], i.e., for [math]k \ge K[/math],
For discounted RL, an algorithm returns, after
the end of the [math](n-1)[/math]-th step with [math]M_{n-1}[/math] samples used in this iteration, a policy [math]\pi_n[/math] to be applied in the [math]n[/math]-th step. There are several notions of sample complexity in this setting.
The most commonly used ones are defined through the value function and the [math]Q[/math]-function with all input variables (i.e., [math]s\in \mathcal{S}[/math] for [math]V[/math] and [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] for [math]Q[/math]) and they are referred to as the [math]V[/math]-sample complexity and [math]Q[/math]-sample complexity in the literature. The [math]Q[/math]-sample complexity of a given algorithm is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]Q^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]Q^*[/math], that is
holds either with high probability (that is [math]\mathbb{P}\left(\|Q^{(n)}-Q^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|Q^{(n)}-Q^*\|_{\infty}] \lt \varepsilon[/math]. Similarly, the [math]V[/math]-sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]V^{(n)}[/math] is [math]\varepsilon[/math]-close to [math]V^*[/math], that is
holds either with high probability (in that [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math]) or in the expectation sense [math]\mathbb{E}[\|V^{(n)}-V^*\|_{\infty}] \lt \varepsilon[/math]. Note that \eqref{eq:sample_complexity_infi_Q} implies \eqref{eq:sample_complexity_infi_V_2} since [math]V^{(n)}(s) = \sup_{a\in \mathcal{A}}Q^{(n)}(s,a)[/math] and [math]V^{*}(s) = \sup_{a\in \mathcal{A}}Q^*(s,a)[/math]. In our analysis we do not distinguish between whether the sample complexity bounds for \eqref{eq:sample_complexity_infi_Q} and \eqref{eq:sample_complexity_infi_V_2} are in the high probability sense or in the expectation sense. The second type of sample complexity, the sample complexity of exploration, is defined as the number of samples such that the non-stationary policy [math]\pi_n[/math] at time [math]n[/math] is not [math]\varepsilon[/math]-optimal for the current state [math]s_n[/math]. This measure counts the number of mistakes along the whole trajectory. Mathematically speaking, the sample complexity of exploration for a given algorithm is the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math], [math]\pi_n[/math] is [math]\varepsilon[/math]-optimal when starting from the current state [math]s_n[/math] with probability at least [math]1-\delta[/math], i.e., for [math]n \ge N[/math],
Note that the condition [math]\mathbb{P}\left(\|V^{(n)}-V^*\|_{\infty} \lt \varepsilon\right) \ge 1-\delta[/math] associated with \eqref{eq:sample_complexity_infi_V_2} is stronger and implies the condition in \eqref{eq:sample_complexity_infi_V}. Another weaker criterion, the mean-square sample complexity, measures the sample complexity in the average sense with respect to the steady state distribution or the initial distribution [math]\mu[/math]. In this case, the mean-square sample complexity is defined as the minimum number [math]\sum_{n=1}^N M_{n-1}(\varepsilon)[/math] of samples such that for all [math]n \ge N[/math],
Since \eqref{eq:sample_complexity_infi_V_avg} is an aggregated guarantee via the [math]V[/math] function over a state distribution [math]\mu[/math], it is implied by \eqref{eq:sample_complexity_infi_V_2}.
{\color{blue} What does [math]\widetilde{\mathcal{O}}[/math] mean?}
Rate of Convergence. To emphasize the convergence with respect to the number of iterations required, we introduce the notion of rate of convergence, which uses the relationship between the number of iterations/steps [math]N[/math] and the error term [math]\varepsilon[/math], to quantify how fast the learned policy converges to the optimal solution. This is also called the iteration complexity in the RL and optimization literature. Compared to the notion of sample complexity introduced above, the rate of convergence calculates the number of iterations needed to reach a given accuracy while ignoring the number of samples used within each iteration. When only one sample is used in each iteration, the rate of convergence coincides with the sample complexity. In addition when a constant number of samples (i.e., independent from [math]\varepsilon[/math]) are used within each iteration, the rate of convergence and the sample complexity are of the same order with respect to [math]\varepsilon[/math]. In particular, an algorithm is said to converge at a linear rate if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(1/\varepsilon))[/math]. Similarly, an algorithm is said to converge at a sublinear rate (slower than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^p)[/math] with [math]p\geq 1[/math], and is said to converge at a superlinear (faster than linear) if [math]N\sim \widetilde{\mathcal{O}}^{\prime}(\log(\log(1/\varepsilon))[/math].
Regret Analysis. The regret of a given policy [math]\pi[/math] is defined as the difference between the cumulative reward of the optimal policy and that gathered by [math]\pi[/math]. It quantifies the exploration-exploitation trade-off. For episodic RL with horizon [math]T[/math] in formulation \eqref{equ: singlev_fh}, the regret of an algorithm after [math]K [/math] episodes with [math]M=T\,K[/math] steps is
There is currently no regret analysis for RL problems with infinite time horizon and discounted reward. The natural definition of regret as given above is less meaningful in this case as the cumulative discounted reward is bounded and hence does not scale with [math]T[/math].
Asymptotic Convergence. In addition to sample complexity and regret analysis discussed above, asymptotic convergence is another weaker convergence guarantee, which requires the error term to decay to zero as [math]N[/math] goes to infinity (without specifying the order of [math]N[/math]). This is often a first step in the theoretical analysis of RL algorithms.
Classification of RL Algorithms
An RL algorithm includes one or more of the following components:
- a representation of a value function that provides a prediction of how good each state or each state/action pair is,
- a direct representation of the policy [math]\pi(s)[/math] or [math]\pi(s,a)[/math],
- a model of the environment (the estimated transition function and the estimated reward function) in conjunction with a planning algorithm (any computational process that uses a model to create or improve a policy).
The first two components are related to what is called model-free RL. When the latter component is used, the algorithm is referred to as model-based RL. In the MDP setting, model-based algorithms maintain an approximate MDP model by estimating the transition probabilities and the reward function, and derive a value function from the approximate MDP. The policy is then derived from the value function. Examples include [29][20][30] and [31]. Another line of model-based algorithms make structural assumptions on the model, using some prior knowledge, and utilize the structural information for algorithm design. For example, see [32] for learning linear-quadratic regulators over an infinite time horizon and [33][34] for the finite time horizon case. Unlike the model-based method, model-free algorithms directly learn a value (or state-value) function or the optimal policy without inferring the model. Model-free algorithms can be further divided into two categories, value-based methods and policy-based methods. Policy-based methods explicitly build a representation of a policy and keep it in memory during learning. Examples include policy gradient methods and trust-region policy optimization methods. As an alternative, value-based methods store only a value function without an explicit policy during the learning process. In this case, the policy is implicit and can be derived directly from the value function (by picking the action with the best value). In our discussion of methodology, we focus on model-free RL algorithms for MDP with infinite horizon and discounted reward. In particular, we introduce some classical value-based and policy-based methods in Sections and respectively. For the episodic setting and model-based algorithms, see the discussion in Section.
Value-based Methods
In this section, we introduce the methodologies of several classical value-based methods in the setting of an infinite time horizon with discounting, finite state and action spaces ([math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]), and stationary policies. The setting with finite state and action spaces is also referred to as the tabular case in the RL literature. For more general setups with an infinite number of actions and states, we refer the reader to Section for a discussion of linear functional approximations and to Section for neural network approximations. Recall that for reinforcement learning, the distribution of the reward function [math]r[/math] and the transition function [math]P[/math] are unknown to the agent. Therefore the expectations in the Bellman equations \eqref{equ:classicalV} and \eqref{equ: singleQ} cannot be calculated directly. On the other hand, the agents can observe samples [math](s,a,r,s^{\prime})[/math] by interacting with the system without a model of the system's dynamics [math]P[/math] or any prior knowledge of [math]r[/math]. The agents can then learn the value function or [math]Q[/math]-function using the samples. Following this idea, we next introduce the classical temporal-difference learning algorithm, with [math]Q[/math]-learning and State-Action-Reward-State-Action (SARSA) as two special cases.
Temporal-Difference Learning
Given some samples [math](s,a,r,s')[/math] obtained by following a policy [math]\pi[/math], the agent can update her estimate of the value function [math]V^{\pi}[/math] (defined in \eqref{equ: singlev}) at the [math](n+1)[/math]-th iteration by
with some initialization of the value function [math]V^{\pi,(0)}[/math]. Here [math]\beta_n(s,a)[/math] is the learning rate at iteration [math]n+1[/math] which balances the weighting between the current estimate and the new estimate. The quantity [math]\beta_n[/math] can be a constant, can depend on the current state [math]s[/math], or even depend on the observed samples up to iteration [math]n+1[/math]. Algorithm provides some pseudo-code for an implementation of the TD method. Equation \eqref{eq:TD_update} is sometimes referred to as a TD(0) method. This is a special case of a TD[math](\lambda)[/math] method (for some [math]\lambda \in [0,1][/math]) which is a combination of TD learning (with weight [math]\lambda[/math]) and a Monte Carlo method (with weight [math]1-\lambda[/math]). Here a Monte Carlo method indicates a simulation-based method to calculate the value function with a longer period of data (in the episode-by-episode sense) rather than a single sample in each update. See Section for a detailed discussion of the REINFORCE method which is an example of such a Monte Carlo method. {\color{blue} I still dont understand. A Monte Carlo method is a simulation method for calculating expectations of functions of a stochastic process. Here we have a method, TD, which uses a single sample to update the value function. We could use more samples to update with some average of the samples from a given [math](s,a)[/math] - is this what you mean? Or do we run sample the policy for a longer time period and average these to get an approximation using the definition of [math]V^{\pi}[/math] as an expectation over an infinite discounted sum?} Equation \eqref{eq:TD_update} is also referred to as a one-step TD method as it only uses information from one update step of the system. There are multi-step TD methods; see Chapter 7 and Chapter 12 in [18] for more details. The TD update in \eqref{eq:TD_update} can also be written as
The term [math]\delta_n[/math], commonly called the TD error, measures the difference between the estimated value at [math]s[/math] and the better estimate [math]r+\gamma V^{\pi,(n)}(s')[/math], which is often called the TD target in the literature. The TD error and TD target are two important components in analyzing the convergence of the approximate value function and arise in various forms in the reinforcement learning literature.
[math]Q[/math]-learning Algorithm
A version of TD learning is [math]Q[/math]-learning. This is a stochastic approximation to the Bellman equation \eqref{equ: singleQ} for the [math]Q[/math]-function with samples observed by the agent. At iteration [math]n[/math] ([math]1 \leq n \leq N-1[/math]), the [math]Q[/math]-function is updated using a sample [math](s,a,r,s^{\prime})[/math] where [math]s' \sim P(s,a)[/math],
Here [math]\beta_n(s,a)[/math] is the learning rate which balances the weight between the current estimate [math]Q^{(n)}[/math] from iteration [math]n[/math] and the new estimate [math]r(s,a)+\gamma\max_{a'}Q^{(n)}(s',a')[/math] calculated using the sample [math](s,a,r,s^{\prime})[/math]. Algorithm provides pseudo-code for an implementation of Q-learning.
The following theorem guarantees the asymptotic convergence of the [math]Q[/math]-learning update \eqref{eq:Q_learning_update} to the [math]Q[/math]-function defined in \eqref{defsingleQ}.
\label{thm:Q_conv} Assume [math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]. Let [math]n^i(s, a)[/math] be the index of the [math]i[/math]-th time that the action [math]a[/math] is used in state [math]s[/math]. Let [math]R \lt \infty[/math] be a constant. Given bounded rewards [math]|r|\leq R[/math], learning rate [math]0\leq\beta_n \lt 1[/math] and
This is a classical result (one of the first few theoretical results in RL) and the proof of the convergence is based on the action-replay process (ARP), which is an artificial controlled Markov process constructed from the episode sequence and the learning rate sequence [math]\beta[/math]. Theorem implies that eventually [math]Q^{(n)}[/math] will convergence to the true value function [math]Q^\star[/math] when condition \eqref{exploration} is satisfied. However, this asymptotic result does not provide any insights on how many samples are needed to reach a given accuracy. More results on the sample complexity for [math]Q[/math]-learning related algorithms are discussed in Section~subsubsec:value_theo.
SARSA
In contrast to the [math]Q[/math]-learning algorithm, which takes samples from any policy [math]\pi[/math] as the input where these samples could be collected in advance before performing the [math]Q[/math]-learning algorithm, SARSA adopts a policy which is based on the agent's current estimate of the [math]Q[/math]-function. The different source of samples is indeed the key difference between on-policy and off-policy learning, which will be discussed after the SARSA algorithm.
[b] The policy derived from [math]Q^{(n)}[/math] using [math]\varepsilon[/math]-greedy (line 4 and line 8 in Algorithm) suggests the following choice of action [math]a[/math] when in state [math]s[/math]:
In Algorithm, SARSA uses the [math]Q^{(n)}[/math] function with an [math]\epsilon[/math]-greedy policy to choose [math]a'[/math] and to perform exploration. In contrast, Q-learning uses the maximum value of [math]Q^{(n)}[/math] over all possible actions for the next step, which is equivalent to setting [math]\varepsilon=0[/math] when choosing [math]a'[/math].
In addition to the [math]\varepsilon[/math]-greedy policy, other exploration methods such as the softmax operator (resulting in a Boltzmann policy) [36][37][38] can also be applied. See Section for a more detailed discussion.
On-policy Learning vs. Off-policy Learning. An off-policy agent learns the value of the optimal policy independently of the agent's actions. An on-policy agent learns the value of the policy being carried out by the agent including the exploration steps. [math]Q[/math]-learning is an off-policy agent as the samples [math](s,a,r,s')[/math] used in updating \eqref{eq:Q_learning_update} may be collected from any policy and may be independent of the agent's current estimate of the [math]Q[/math]-function. The convergence of the [math]Q[/math]-function relies on the exploration scheme of the policy [math]\pi[/math] (see Theorem). In contrast to [math]Q[/math]-learning, SARSA is an on-policy learning algorithm and uses only its own estimation of the [math]Q[/math]-function to select an action and receive a real-time sample in each iteration. The difference is seen in how the new estimate in the update is calculated. In the update step of [math]Q[/math]-learning \eqref{eq:Q_learning_update}, we have [math]\max_{a'} Q^{(n)}(s',a')[/math] which is the maximum value of the [math]Q[/math]-function at a given state [math]s'[/math]. On the other hand, the new estimate in the update of SARSA \eqref{eq:SARSA_update} takes [math]Q^{(n)}(s',a')[/math] where [math]a'[/math] is selected based on a policy derived from [math]Q^{(n)}[/math] (via the [math]\varepsilon[/math]-greedy policy \eqref{epsilon_policy}).
%{\color{RedOrange}Note that there are several differences between the [math]Q[/math]-learning Algorithm
Discussion
In this section, we provide a brief overview of sample complexity results for model-free and value-based RL with infinite-time horizon and discounted reward. Sample complexity results for the model-based counterpart are deferred to Section. There has been very little non-asymptotic analysis of TD(0) learning. Existing results have focused on linear function approximations. For example, [39] showed that the mean-squared error converges at a rate of [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon^{1/\sigma}})[/math] with decaying learning rate [math]\beta_n= \frac{1}{(1+n)^{\sigma}}[/math] for some [math]\sigma\in (0,1)[/math] where i.i.d. samples are drawn from a stationary distribution; [40] provided an improved [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] bound for iterate-averaged TD(0) with constant step-size; and [41] reached the same mean-square sample complexity [math]\widetilde{\mathcal{O}}^{\prime}(\frac{1}{\varepsilon})[/math] with a simpler proof and extends the framework to the case where non-i.i.d. but Markov samples are drawn from the online trajectory. A similar analysis was applied by [42] to SARSA and [math]Q[/math]-learning algorithms for a continuous state space using linear function approximation (see the end of Section).
The [math]Q[/math]-sample complexity for the [math]Q[/math]-learning algorithm of [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\mbox{poly}(\log \delta^{-1})\right)[/math] was first established in [43] with decaying learning rate [math]\beta_n = \frac{1}{(1+n)^{4/5}}[/math] and access to a simulator. This was followed by [44] which showed that the same order of [math]Q[/math]-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{5/2}}\right)[/math] could be achieved with a constant learning rate [math]\beta_n \equiv \frac{(1-\gamma)^4\varepsilon^2}{|\mathcal{A}||\mathcal{S}|}[/math] [math](n\leq N)[/math]. Without access to a simulator, delayed [math]Q[/math]-learning [45] was shown to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^8\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] assuming a known reward. An improvement to this result to [math]\widetilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^7\varepsilon^{4}}\mbox{poly}(\log \delta^{-1})\right)[/math] has recently been obtained using an Upper-Confidence-Bound (UCB) exploration policy [46] and a single trajectory. %{\color{RedOrange}In particular, the dependence on the discount rate [math]\gamma[/math] is improved from [math]\frac{1}{(1-\gamma)^8}[/math] to [math]\frac{1}{(1-\gamma)^7}[/math].} In addition, sample complexities of [math]Q[/math]-learning variants with (linear) function approximations have been established in several recent papers. [16] assumed a linear MDP framework (see the end of Section) and the authors provided a V-sample complexity [math]\widetilde{\mathcal{O}}\left(\frac{d}{(1-\gamma)^3\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math] with [math]d[/math] denoting the dimension of the features. This result matches the information-theoretic lower bound up to [math]\log(\cdot)[/math] factors. Later work, [47] considered the setting where the transition kernel can be approximated by linear functions with small approximation errors to make the model more robust to model mis-specification. In this case, the authors provided a V-sample complexity of order [math]\widetilde{\mathcal{O}}\left(\frac{K}{(1-\gamma)^4\varepsilon^{2}}\mbox{poly}(\log \delta^{-1})\right)[/math]. Although the dependence on [math]\gamma[/math] is not as good as in [16], this framework can be applied to a more general class of models which are “near” linear. Although [math]Q[/math]-learning is one of the most successful algorithms for finding the best action-value function [math]Q[/math], its implementation often suffers from large high biased estimates of the [math]Q[/math]-function values incurred by random sampling. The double [math]Q[/math]-learning algorithm proposed in [48] tackled this high estimation issue by randomly switching the update between two [math]Q[/math]-estimators, and has thus gained significant popularity in practice. The first sample complexity result for the double [math]Q[/math]-learning algorithm was established in [49] with a Q-sample complexity on the order of [math]\widetilde{\mathcal{O}}\left(\left(\frac{1}{(1-\gamma)^6\varepsilon^2}\ln\,\frac{|\mathcal{A}||\mathcal{S}|}{(1-\gamma)^7\varepsilon^2\delta}\right)^{1/\omega}+\left(\frac{1}{1-\gamma}\ln\,\frac{1}{(1-\gamma)^2\varepsilon}\right)^{1/(1-\omega)}\right)[/math] and learning rate [math]\beta_n = \frac{1}{n^{w}}[/math] for some constant [math]\omega \in (0,1)[/math]. %{\color{RedOrange}[TODO Renyuan: add some comments on these orders -- which one is better than the others under what situation.]}
Policy-based Methods
%{\color{ForestGreen}[TODO Huining: describe the setup: infinite horizon and Markov policy.]} So far we have focused on value-based methods where we learn the value function to generate the policy using, for instance, the [math]\varepsilon[/math]-greedy approach. However, these methods may suffer from high computational complexity as the dimensionality of the state or action space increases (for example in continuous or unbounded spaces). In this case, it may be more effective to directly parametrize the policy rather than the value function. Furthermore, it is empirically observed that policy-based methods have better convergence properties than value-based methods [50][51][52]. This stems from the fact that value-based methods can have big oscillations during the training process since the choice of actions may need to change dramatically in order to have an arbitrarily small change in the estimated value functions. In this section, we focus on model-free policy-based methods, where we learn a parametrized policy without inferring the value function in the setting of infinite time horizon with discounting, finite state and action spaces, and stationary policies. Examples of policy-based methods include REINFORCE [53], Actor-Critic methods [54], Trust Region Policy Optimization (TRPO) [55], Proximal Policy Optimization (PPO) [56] among others. We first parametrize the policy [math]\pi[/math] by a vector [math]\theta[/math], thus we define [math]\pi(s,\cdot;\theta)\in \mathcal{P}(\mathcal{A})[/math], the probability distribution parameterized by [math]\theta[/math] over the action space given the state [math]s[/math] at time [math]t[/math]. Here we will use [math]\pi(s,a;\theta)[/math] and [math]\pi_{\theta}[/math] interchangeably to represent the policy parameterized by [math]\theta[/math]. We write [math]\mu^{\pi_{\theta}}[/math] for the stationary distribution of the state under policy [math]\pi(s,a;\theta)[/math]. Then the policy objective function is defined to be
for RL with infinite time horizon, or
for RL with finite time horizon. In order to maximize this function, the policy parameter [math]\theta[/math] is updated using the gradient ascent rule:
where [math]\beta[/math] is the learning rate, and [math]\widehat{\nabla_{\theta}J(\theta)}[/math] is an estimate of the gradient of [math]J(\theta)[/math] with respect to [math]\theta[/math].
Policy Gradient Theorem
Our task now is to estimate the gradient [math]\nabla_{\theta}J(\theta)[/math]. The objective function [math]J(\theta)[/math] depends on both the policy and the corresponding stationary distribution [math]\mu^{\pi_{\theta}}[/math], and the parameter [math]\theta[/math] affects both of them. Calculating the gradient of the action with respect to [math]\theta[/math] is straightforward given the parametrization [math]\pi(s,a;\theta)[/math], whereas it might be challenging to analyze the effect of [math]\theta[/math] on the state when the system is unknown. Fortunately the well-known policy gradient theorem provides an expression for [math]\nabla_{\theta}J(\theta)[/math] which does not involve the derivatives of the state distribution with respect to [math]\theta[/math]. We write the [math]Q[/math]-function under policy [math]\pi(s,a;\theta)[/math] as [math]Q^{\pi_{\theta}}(s,a)[/math], then we have the following theorem (for more details see [18]).
Assume that [math]\pi(s,a;\theta)[/math] is differentiable with respect to [math]\theta[/math] and that there exists [math]\mu^{\pi_{\theta}}[/math], the stationary distribution of the dynamics under policy [math]\pi_{\theta}[/math], which is independent of the initial state [math]s_0[/math]. Then the policy gradient is
For MDPs with finite state and action space, the stationary distribution exists under mild assumptions. For MDPs with infinite state and action space, more assumptions are needed, for example uniform geometric ergodicity [58].
REINFORCE: Monte Carlo Policy Gradient
Based on the policy gradient expression in \eqref{eqn:policy_gra_thm}, it is natural to estimate the expectation by averaging over samples of actual rewards, which is essentially the spirit of a standard Monte Carlo method that repeatedly simulates a trajectory of length [math]M[/math] within each iteration. Now we introduce our first policy-based algorithm, called REINFORCE [53] or Monte Carlo policy gradient (see Algorithm). We use a simple empirical return function [math]G_t[/math] to approximate the [math]Q[/math]-function in \eqref{eqn:policy_gra_thm}. The return [math]G_t[/math] is defined as the sum of the discounted rewards and given by [math]G_t:=\sum_{s=t+1}^{M} \gamma^{s-t-1}r_{s}[/math]. Note that REINFORCE is a Monte Carlo algorithm since it employs the complete sample trajectories, that is, when estimating the return from time [math]t[/math], it uses all future rewards up until the end of the trajectory (see line 7 in Algorithm). Within the [math]n[/math]-iteration the policy parameter [math]\theta^{(n+1)}[/math] is updated [math]M[/math] times with [math]G_t[/math] for [math]t=0,1,\cdots,M-1[/math]. As a standard Monte Carlo method, REINFORCE provides an unbiased estimate of the policy gradient, however it suffers from high variance which therefore may lead to slow convergence. A popular variant of REINFORCE is to subtract a baseline, which is a function of the state [math]s[/math], from the return [math]G_t[/math]. This reduces the variance of the policy gradient estimate while keeping the mean of the estimate unchanged. A commonly used baseline is an estimate of the value function, [math]\widehat{V}(s)[/math], which can be learnt using one of the methods introduced in Section. Replacing the [math]G_t[/math] in \eqref{eqn:alg_REINFORCE_update} by [math]G_t-\widehat{V}(s)[/math] gives a REINFORCE algorithm with baseline.
Actor-Critic Methods
The fact that REINFORCE provides unbiased policy estimates but may suffer from high variance is an example of the bias-variance dilemma (see, e.g., [59] and [60]) which occurs in many machine learning problems. Also, as a Monte Carlo algorithm, REINFORCE requires complete trajectories so we need to wait until the end of each episode to collect the data. Actor-Critic methods can resolve the above issues by learning both the value function and the policy. The learned value function can improve the policy update through, for example, introducing bias into the estimation. Actor-Critic methods can also learn from incomplete experience in an online manner. In Actor-Critic methods the value function is parametrized by [math]w[/math] and the policy is parametrized by [math]\theta[/math]. These methods consist of two models: a critic which updates the value function or [math]Q[/math]-function parameter [math]w[/math], and an actor which updates the policy parameter [math]\theta[/math] based on the information given by the critic. A natural idea is to use policy-based methods for the actor and use one of the methods introduced in Section for the critic; for example SARSA or [math]Q[/math]-learning. We give the pseudocode for a simple Actor-Critic method in Algorithm. Here for the critic, we approximate the [math]Q[/math]-function by a linear combination of features, that is, [math]Q(s,a;w)=\phi(s,a)^\top w[/math] for some known feature [math]\phi:\mathcal{S}\times \mathcal{A}\rightarrow\mathbb{R}^d[/math], and use TD(0) to update the parameter [math]w[/math] to minimize the least-square error of the TD error with the gradient of the least-square error taking the form [math]\delta \phi(s,a)[/math] (see lines 9-10 in Algorithm). For the actor, we use the (vanilla) policy-based method to update the policy parameter [math]\theta[/math] (see line 11 in Algorithm). There are three main ways to execute the algorithm. In the nested-loop setting (see, e.g. [61][62]), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time-scale setting (see, e.g. [63]), where the actor and the critic update their parameters simultaneously with different learning rates. Note that the learning rate for the actor is typically much smaller than that of the critic in this setting. In the third way, the single-scale setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. [64][65]).
Discussion
Variants of Policy-based Methods. There have been many variations of policy-based methods with improvements in various directions. For example, the natural policy-based algorithms [66][67] modify the search direction of the vanilla version by involving a Fisher information matrix in the gradient ascent updating rule, which speeds the convergence significantly. To enhance the stability of learning, we may request the updated policy estimate to be not too far away from the previous estimate. Along this line, TRPO [55] uses the Kullback-Leibler (KL)-divergence to measure the change between the consecutive updates as a constraint, while PPO [56] incorporate an adaptive KL-penalty or a clipped objective in the objective function to eliminate the constraint. When using the KL-penalty, PPO can be viewed as a Lagrangian relaxation of the TRPO algorithm. To reduce the sample complexity, Deterministic Policy Gradient (DPG) [68] uses a deterministic function (rather than the stochastic policy [math]\pi[/math]) to represent the policy to avoid sampling over actions (although enough exploration needs to be guaranteed in other ways); Actor-Critic with Experience Replay (ACER) [69] reuses some experience (tuples of data collected in previous iterations/episodes, see the discussion of Deep [math]Q[/math]-Networks in Section) which improves the sample efficiency and decreases the data correlation. In addition we mention Asynchronous Advantage Actor-Critic (A3C) and Advantage Actor-Critic (A2C), two popular Actor-Critic methods with a special focus on parallel training [70]. The latter is a synchronous, deterministic version of the former. For continuous-time framework, see developments in [71][72].
Convergence Guarantees. Now we discuss the convergence guarantees of policy-based algorithms. [57] provided the first asymptotic convergence result for policy gradient methods with arbitrary differentiable function approximation for MDPs with bounded reward. They showed that such algorithms, including REINFORCE and Actor-Critic methods, converge to a locally stationary point. For more examples of asymptotic convergence, see for example [54] and [73]. Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [61] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates [math]k[/math] required to attain [math]\inf_{0\leq m\leq k}\|\nabla J(\theta^{(k)})\|^2 \lt \varepsilon[/math]. We denote this smallest number as [math]K[/math]. When the actor uses a policy gradient method, the critic achieves [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^4)[/math] by employing TD(0), [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{3})[/math] by employing the Gradient Temporal Difference, and [math]K\leq \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^{5/2})[/math] by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces. [74] investigated a policy gradient method with Monte Carlo estimation of the policy gradient, called the RPG algorithm, where the rollout length of trajectories is drawn from a Geometric distribution. This generates unbiased estimates of the policy gradient with bounded variance, which was shown to converge to the first order stationary point with diminishing or constant learning rate at a sublinear convergence rate. They also proposed a periodically enlarged learning rate scheme and proved that the modified RPG algorithm with this scheme converges to the second order stationary point in a polynomial number of steps. For more examples of sample complexity analysis for convergence to a stationary point, see for example [75][76][77][78][79]. The global optimality of stationary points was studied in [80] where they identified certain situations under which the policy gradient objective function has no sub-optimal stationary points despite being non-convex. Recently there have been results on the sample complexity analysis of the convergence to the global optimum. [81] proved that the entropy regularized natural policy gradient method achieves a [math]Q[/math]-sample complexity (and [math]\varepsilon[/math]-optimal policies) with linear convergence rate. [82] showed that with high probability at least [math]1-\delta[/math], the TRPO algorithm has sublinear rate [math] \widetilde{\mathcal{O}}^{\prime}(1/\varepsilon^2)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^3\varepsilon^4}\big)[/math] in the unregularized (tabular) MDPs, and has an improved rate [math]\widetilde{\mathcal{O}}^{\prime}(1/\varepsilon)[/math] with mean-squared sample complexity [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{A}|^2(|\mathcal{S}|+\log(1/\delta))}{(1-\gamma)^4\varepsilon^3}\big)[/math] in MDPs with entropy regularization. [63] gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order [math]\widetilde{\mathcal{O}}(\frac{1}{(1-\gamma)^9\varepsilon^{4}})[/math]. For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both [64] and [65]. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see [27] for the regret analysis of the REINFORCE algorithm for discounted MDPs and [83][84] for policy gradient methods in the setting of known model parameters.
General Discussion
So far we have focused on model-free RL with discounting over an infinite time horizon. In this section, we discuss two other cases: RL over a finite time horizon with both model-based and model-free methods, and model-based RL with an infinite time horizon.
RL with Finite Time Horizon
As discussed earlier, episodic RL with finite time horizon has also been widely used in many financial applications. In this section, we discuss some episodic RL algorithms (with both model-based and model-free methods) and their performance through both sample complexity and regret analysis. [85] proposed a model-based algorithm, known as Posterior Sampling for Reinforcement Learning (PSRL) which is a model-based algorithm, and establishes an [math]\widetilde{\mathcal{O}}(T|\mathcal{S}|\sqrt{|\mathcal{A}|M})[/math] expected regret bound. [25] proposed a so-called Upper-Confidence Fixed-Horizon (UCFH) RL algorithm that is model-based and has [math]V[/math]-sample complexity[c] of order [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^3}{\varepsilon^2})[/math] assuming known rewards. [86] considered two model-based algorithms in the setting of known reward, called the UCBVI-CH and UCBVI-BF algorithms, which were proved to achieve a regret bound of [math]\widetilde{\mathcal{O}}(T\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]) and [math]\widetilde{\mathcal{O}}(\sqrt{T|\mathcal{S}|\,|\mathcal{A}|M})[/math] (when [math]M\geq T^3|\mathcal{S}|^3|\mathcal{A}|[/math] and [math]|\mathcal{S}||\mathcal{A}|\geq T[/math]), respectively. [26] proposed an Upper Bounding the Expected Next State Value (UBEV) algorithm which achieves a sample complexity\footnotemark[2] of [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|\,T^5}{\varepsilon^2}{\rm polylog}(\frac{1}{\delta})\big)[/math]. They also proved that the algorithm has a regret bound of [math]\widetilde{\mathcal{O}}(T^2\sqrt{|\mathcal{S}|\,|\mathcal{A}|M})[/math] with [math]M\ge |\mathcal{S}|^5|\mathcal{A}|^3[/math]. [87] proved that [math]Q[/math]-learning with UCB exploration achieves regret of [math]\widetilde{\mathcal{O}}(\sqrt{T^3|\mathcal{S}|\,|\mathcal{A}|M})[/math] without requiring access to a simulator. The above regret bounds depend on the size of the state and action space and thus may suffer from the curse of dimensionality as [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math] increase. To tackle this issue, [88] learned a low-dimensional representation of the probability transition model and proved that their MatrixRL algorithm achieves a regret bound of [math]\widetilde{\mathcal{O}}(T^2\,d\sqrt{M})[/math] which depends on the number of features [math]d[/math] rather than [math]|\mathcal{S}|[/math] and [math]|\mathcal{A}|[/math]. [15] provided a [math]\widetilde{\mathcal{O}}(\sqrt{d^3T^3M})[/math] regret bound with high probability for a modified version of the Least-Squares Value Iteration (LSVI) algorithm with UCB exploration.
Model-based RL with Infinite Time Horizon
To derive policies by utilizing estimated transition probabilities and reward functions under the infinite time horizon setting, [29] proposed an R-MAX algorithm which can also be applied in zero-sum stochastic games. The R-MAX algorithm was proved to achieve a sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|^2|\mathcal{A}|}{(1-\lambda)^6\varepsilon^3}\big)[/math] by [89]. The Model-based Interval Estimation (MBIE) in [90] was proved to achieve a sample complexity of exploration of the same order as R-MAX. [31] further modified the R-MAX algorithm to an MoRmax algorithm with an improved sample complexity of exploration of order [math]\widetilde{\mathcal{O}}\big(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\lambda)^6\varepsilon^2}\big)[/math]. [91] proved that the model-based [math]Q[/math]-Value Iteration (QVI) achieves the [math]Q[/math]-sample complexity of [math]\widetilde{\mathcal{O}}(\frac{|\mathcal{S}|\,|\mathcal{A}|}{(1-\gamma)^3\varepsilon^2)})[/math] for some small [math]\varepsilon[/math] with high probability at least [math]1-\delta[/math]. [92] studied the approximate MDP model with any accurate black-box planning algorithm and showed that this has the same sample complexity as [91], but with a larger range of [math]\varepsilon[/math]. See also [45][20][30] for model-based RL along this line. Linear-quadratic (LQ) problems are another type of model-based RL problem that assumes linear state dynamics and quadratic objective functions with continuous state and action space. These problems are one of the few optimal control problems for which there exists a closed-form analytical representation of the optimal feedback control. Thus LQ frameworks, including the (single-agent) linear-quadratic regulator (LQR) and (multi-agent) LQ games, can be used in a wide variety of applications, such as the optimal execution problems which we will discuss later. For a theoretical analysis of RL algorithms within the LQ framework, see for example [32] and [33] for the global linear convergence of policy gradient methods in LQR problems with infinite and finite horizon, respectively. See also [34] and [93] for the regret analysis of linear-quadratic reinforcement learning algorithms in continuous time.
Exploration Exploitation Trade-offs
As mentioned earlier, exploitation versus exploration is a critical topic in RL. RL agents want to find the optimal solution as fast as possible, meanwhile committing to solutions too fast without enough exploration could lead to local minima or total failure. Modern RL algorithms that optimize for the best returns can achieve good exploitation quite efficiently, while efficient exploration remains a challenging and less understood topic. For multi-armed bandit problems or MDP with finite state and action spaces, classic exploration algorithms such as [math]\varepsilon[/math]-greedy, UCB, Boltzmann exploration and Thompson sampling show promising performance in different settings. For the [math]\varepsilon[/math]-greedy algorithm [94][95], the agent randomly explores occasionally with probability [math]\varepsilon[/math] and takes the optimal action most of the time with probability [math]1-\varepsilon[/math]. For example, see Equation \eqref{epsilon_policy} for the [math]\varepsilon[/math]-greedy method combined with a value-based algorithm. The UCB type of algorithms are mainly designed for value-based algorithms [46][86][87][15]. The agent selects the greediest action to maximize the upper confidence bound [math]Q(s,a) + U(s,a)[/math], where [math]Q(s,a)[/math] is the Q value function and [math]U(s,a)[/math] is a function inversely proportional to how many times the state-action pair [math](s,a)[/math] has been taken (hence proportional to the agent's confidence in her estimation of the Q function). For Boltzmann exploration [36][37][38], the agent draws actions from a Boltzmann distribution (softmax) over the learned Q value function, regulated by a temperature parameter. For the Thompson sampling method [96][97][98], the agent keeps track of a belief of the optimal action via a distribution over the set of admissible actions (for each given state).
For deep RL training when neural networks are used for function approximation, entropy regularization (adding an entropy term into the loss function) and noise-based exploration (adding noise into the observation, action or even parameter space) are often used for better exploration of the parameter space.
Regularization in Reinforcement Learning
Many recent developments in RL benefit from regularization, which has been shown to improve not only the exploration but also the robustness. Examples include both policy-based methods and value-based methods in various settings. Here we provide a general overview of different regularization techniques and their advantages along with several examples. For example, TRPO and PPO use the KL-divergence between two consecutive policies as a penalty in policy improvement [55][56]. Soft-[math]Q[/math]-learning uses Shannon entropy as a penalty in value iteration [37]. The authors confirmed in simulated experiments that entropy regularization helps to improve exploration and allows transferring skills between tasks. [99] proposed a unified framework to analyze the above algorithms via a regularized Bellman operator. [81] showed that entropy regularization can also help to speed up the convergence rate from a theoretical perspective. It is worth noting that [38] explained the exploitation-exploration trade-off with entropy regularization from a continuous-time stochastic control perspective and provided theoretical support for Gaussian exploration from the special case of linear-quadratic regulator. Recently [100] and [101] extended this idea to a general control framework beyond the linear-quadratic case. [102] extended the idea of Shannon's entropy regularization to mutual-information regularization and showed its superior performance when actions have significantly different importance. When functional approximations are applied in RL training, the regularization technique is shown to be a powerful, flexible, and principled way to balance approximation and estimation errors [103][104]. The idea of regularization is that one starts from a large function space and controls the solution's complexity by a regularization (or penalization) term. Examples of regularization include the Hilbert norm (for [math]Q[/math]-functions) and the [math]l_1[/math] norm (for parameters).
Reinforcement Learning in Non-stationary Environment
Most existing work on reinforcement learning considers a stationary environment and aims to find the optimal policy or a policy with low (static) regret. In many financial applications, however, the environment is far from being stationary. We introduce the mathematical formulations of non-stationary RL (in both episodic and infinite horizon settings) below.
Non-stationary Episodic RL. For the episodic (or finite-time horizon) setting, an agent interacts with a non-stationary MDP for [math]K[/math] episodes, with each episode containing [math]T[/math] timestamps. Let [math](t, k)[/math] denote a (time,episode) index for the [math]t[/math]-th timestamp in the [math]k[/math]-th episode. The non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, T, \pmb{P}, \pmb{r})[/math], where [math]\mathcal{S}[/math] is the state space, [math]\mathcal{A}[/math] is the action space, [math]T[/math] is the number of timestamps in one episode, [math]\pmb{P} = \{P_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of transition kernels with [math]P_{k,t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})[/math] for all [math]1\leq k\leq K[/math] and [math]0\leq t\leq T[/math], and [math]\pmb{r} = \{r_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T}[/math] is the set of reward functions with [math]r_{k,t}(s,a)[/math] a random variable in [math]\mathbb{R}[/math] (depending on both the time-step and the episode number) for each pair [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math] and for [math]0\leq t\leq T-1[/math]. Similarly, the terminal reward [math]r_{k,T}(s)[/math] is a real-valued random variable for each [math]s\in\mathcal{S}[/math]. Denote by [math]\Pi=\{\pi_{k,t} \}_{1 \leq k\leq K, 0\leq t\leq T-1}[/math] the Markovian policy in which [math]\pi_{k,t}[/math] can be either deterministic or randomized. The value function for this policy at the [math]k[/math]-th episode can be written as
where [math]s_{u+1} \sim P_{k,u}(s_u,a_u)[/math] and [math]a_u\sim \pi_{k,u}(s_u)[/math]. \iffalse
Non-stationary RL with Infinite Time Horizon. For the infinite horizon setting, an agent interacts with a non-stationary MDP starting from an initial timestamp [math]t[/math]. The non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})[/math], where [math]\mathcal{S}[/math] is the state space, [math]\mathcal{A}[/math] is the action space, [math]\pmb{P} = \{P_{t} \}_{t \ge 0}[/math] is the set of transition kernels with [math]P_{t}: \mathcal{S}\times \mathcal{A}\rightarrow\mathcal{P}(\mathcal{S})[/math], and [math]\pmb{r} = \{r_{t} \}_{t \ge 0}[/math] is the set of reward functions with [math]r_{t}(s,a)[/math] a random variable in [math]\mathbb{R}[/math] (depending on both the time-step and the episode number) for each pair [math](s,a)\in \mathcal{S}\times \mathcal{A}[/math]. Denote by [math]\Pi=\{\pi_{t} \}_{t \ge 0}[/math] the Markovian policy in which [math]\pi_{t}[/math] can be either deterministic or randomized. Now we consider the associated value function with a discount factor [math]\gamma[/math]:
where [math]s_{u+1} \sim P_{u}(\cdot|s_u,a_u)[/math] and [math]a_u\sim \pi_{u}(s_u)[/math].\\ \ben{This repeats too much how about:} \fi
Non-stationary RL with Infinite Time Horizon. For the infinite horizon setting, we drop the index indicating episodes and let the finite horizon [math]T[/math] become infinite. Then the non-stationary environment can be denoted by a tuple [math](\mathcal{S},\mathcal{A}, \pmb{P}, \pmb{r})[/math], with the obvious changes from the episodic case. Again we write [math]\Pi=\{\pi_{t} \}_{t \ge 0}[/math] for the Markovian policy in which [math]\pi_{t}[/math] can be either deterministic or randomized and consider the associated value function with a discount factor [math]\gamma[/math]:
where [math]s_{u+1} \sim P_{u}(\cdot|s_u,a_u)[/math] and [math]a_u\sim \pi_{u}(s_u)[/math].\\
Indeed, there has been a recent surge of theoretical studies on this topic for various settings such as infinite-horizon MDPs with finite state and action spaces [105][106], episodic MDPs with finite state and action spaces [107], and episodic linear MDPs [108]. However, all these papers assume the agent has prior knowledge of the degree of nonstationarity such as the number of changes in the environment, which is not very practical for many applications. To relax this assumption, [109] proposes a Bandit-over-Reinforcement-Learning (BoRL) algorithm to extend their earlier result [106]. However, it introduces extra overheads and leads to suboptimal regret. More recently, [110] proposes a general approach that is applicable to various reinforcement learning settings (including both episodic MDPs and infinite-horizon MDPs) and achieves optimal dynamic regret without any prior knowledge of the degree of non-stationarity.
General references
Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].
Notes
- Sample complexity with respect to the number of episodes has also been adopted in the literature [25][26]. The translation between these two criteria is straightforward.
- The learning rate can depend on the number of iterations, the state-action pair and other parameters in the algorithm. For notational simplicity, we demonstrate the SARSA Algorithm (and the rest of the algorithms mentioned) with a constant learning rate [math]\beta[/math]. In the convergence analysis, we use [math]\beta_n[/math] to denote a learning rate which depends on the number of iterations.
- In the original papers the sample complexity is defined in terms of the number of episodes. Here we multiply the original order in these papers by [math]T[/math] to match the definition of sample complexity in this paper.
References
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedsilver2017mastering
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedvinyals2019grandmaster
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedradford2019language
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlillicrap2015continuous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgoodfellow2016deep
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmnih2013playing
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedvan2016deep
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgu2016continuous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfan2020theoretical
- 10.0 10.1 10.2 10.3 Cite error: Invalid
<ref>
tag; no text was provided for refs namedputerman2014markov
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedabbasi2019politex
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwei2020model
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbradtke1996linear
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmelo2007q
- 15.0 15.1 15.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedjin2020provably
- 16.0 16.1 16.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedyang2019sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddai2018sbeed
- 18.0 18.1 18.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedsuttonbarto
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedpowell2021reinforcement
- 20.0 20.1 20.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedkearns2002near
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkoenig1993complexity
- 22.0 22.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedazar2012sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedrobbins1952some
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedberry1985bandit
- 25.0 25.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedDannBrunskill2015
- 26.0 26.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedDannUBEV
- 27.0 27.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedzhang2020sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedliu2020improved
- 29.0 29.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedbrafman2002r
- 30.0 30.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedlattimore2012pac
- 31.0 31.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedMoRmax2010
- 32.0 32.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedfazel2018global
- 33.0 33.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhambly2020policy
- 34.0 34.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedbasei2021logarithmic
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwatkins1992q
- 36.0 36.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedasadi2017alternative
- 37.0 37.1 37.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhaarnoja2017reinforcement
- 38.0 38.1 38.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedWZZ2018
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddalal2018finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlakshminarayanan2018linear
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbhandari2018finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedzou2019finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedeven2003learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbeck2012error
- 45.0 45.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedstrehl2009reinforcement
- 46.0 46.1 Cite error: Invalid
<ref>
tag; no text was provided for refs nameddong2019q
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlattimore2020learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedhasselt2010double
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxiong2020finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedsewak2019policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedyu2020policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddaberius2019deep
- 53.0 53.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedwilliams1992
- 54.0 54.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedkonda2000
- 55.0 55.1 55.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedschulman2015
- 56.0 56.1 56.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedschulman2017proximal
- 57.0 57.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedsutton2000policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkonda2002
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfranccois2019
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedvon2011statistical
- 61.0 61.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedkumar2019sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020improving
- 63.0 63.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020non
- 64.0 64.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedfu2020single
- 65.0 65.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2021doubly
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkakade2001natural
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedpeters2008natural
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedsilver2014deterministic
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwang2016
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmnih2016asynchronous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedjia2021policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedjia2022policy
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbhatnagar2010actor
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedzhang2020global
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedpapini2018stochastic
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020improved
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2019sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedshen2019hessian
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedxiong2020non
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbhandari2019
- 81.0 81.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedcen2020fast
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedshani2020adaptive
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedagarwal2021theory
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmei2020
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedPSRL2013
- 86.0 86.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedazar2017minimax
- 87.0 87.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedjin2018q
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedyang2020reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedli2009rmax
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedStrehl2005
- 91.0 91.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedazar2013minimax
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedagarwal2020model
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedguo2021reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddann2022guarantees
- Cite error: Invalid
<ref>
tag; no text was provided for refs nameddabney2020temporally
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedthompson1933likelihood
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedouyang2017learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgopalan2015thompson
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgeist2019theory
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgao2020state
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedtang2021exploratory
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgrau2018soft
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmassoud2009regularized
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfarahmand2008regularized
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgajane2018sliding
- 106.0 106.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedcheung2019learning
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmao2020model
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedtouati2020efficient
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedcheung2020reinforcement
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedwei2021non