Deep Reinforcement Learning
label{sec:deep_value_based} In the setting where the state and action spaces are very large, or high dimensional or they are continuous it is often challenging to apply classical value-based and policy-based methods. %The classical settings of value-based and policy-based methods introduced in Sections and are often not applicable in the situation with high-dimensional (possibly continuous) state-action space. In this context, parametrized value functions ([math]Q(s,a;\theta)[/math] and [math]V(s,a;\theta)[/math]) or policies ([math]\pi(s,a;\theta)[/math]) are more practical. Here we focus on neural-network-based parametrizations and functional approximations, which have the following advantages:
- Neural networks are well suited for dealing with high-dimensional %sensory inputs such as time series data, and, in practice, they do not require an exponential increase in data when adding extra dimensions to the state or action space.
- In addition, they can be trained incrementally and make use of additional samples obtained as learning happens.
In this section, we will introduce deep RL methods in the setting of an infinite time horizon with discounting, finite state and action spaces ([math]|\mathcal{A}| \lt \infty[/math] and [math]|\mathcal{S}| \lt \infty[/math]), and stationary policies.
Neural Networks
A typical neural network is a nonlinear function, represented by a collection of neurons. These are typically arranged as a number of layers connected by operators such as filters, poolings and gates, that map an input variable in [math]\mathbb{R}^{n} [/math] to an output variable in [math]\mathbb{R}^{m}[/math] for some [math]n,m \in \mathbb{Z}^+[/math]. In this subsection, we introduce three popular neural network architectures including fully-connected neural networks, convolutional neural networks [1] and recurrent neural networks [2].
Fully-connected Neural Networks. A fully-connected Neural Network (FNN) is the simplest neural network architecture where any given neuron is connected to all neurons in the previous layer. To describe the setup we fix the number of layers [math]L \in \mathbb{Z}^{+}[/math] in the neural network and the width of the [math]l[/math]-th layer [math]n_l \in \mathbb{Z}^+[/math] for [math]l=1,2,\ldots, L[/math]. Then for an input variable [math]\mathbf{z} \in \mathbb{R}^{n}[/math], the functional form of the FNN is
in which [math](\mathbf{W},\mathbf{b})[/math] represents all the parameters in the neural network, with [math]\mathbf{W} = (\mathbf{W}_1,\mathbf{W}_2,\ldots,\mathbf{W}_L)[/math] and [math]\mathbf{b} = (\mathbf{b}_1,\mathbf{b}_2,\ldots,\mathbf{b}_L)[/math]. Here [math]\mathbf{W}_l \in \mathbb{R}^{n_l \times n_{l-1}}[/math] and [math]\mathbf{b}_l \in \mathbb{R}^{n_l \times 1}[/math] for [math]l=1,2,\ldots,L[/math], where [math]n_0 = n[/math] is set as the dimension of the input variable. %The matrix [math]\mathbf{W}_l \in \mathbb{R}^{n_l \times n_{l-1}}[/math] is called a weight matrix and the vector [math]\mathbf{b}_l \in \mathbb{R}^{n_l \times 1}[/math] is called the bias for [math]l=1,2,\ldots,L[/math] where [math]n_0 = n[/math] is the dimension of the input variable{\color{blue} is this terminology for weight and bias right? We use it in the next section}. The operator [math]\sigma(\cdot)[/math] takes a vector of any dimension as input, and applies a function component-wise. Specifically, for any [math]q\in \mathbb{Z}^+[/math] and any vector [math]\mathbf{u}=(u_1,u_2,\cdots,u_q)^{\top} \in \mathbb{R}^{q}[/math], we have that
In the neural networks literature, the [math]\mathbf{W}_l[/math]'s are often called the weight matrices, the [math]\mathbf{b}_l[/math]'s are called bias vectors, and [math]\sigma(\cdot)[/math] is referred to as the activation function. Several popular choices for the activation function include ReLU with [math]\sigma(u) = \max(u,0)[/math], Leaky ReLU with [math]\sigma(u) = a_1\,\max(u,0) - a_2\, \max(-u,0)[/math] where [math]a_1,a_2 \gt 0[/math], and smooth functions such as [math]\sigma(\cdot) = \tanh(\cdot)[/math]. In \eqref{eq:generator}, information propagates from the input layer to the output layer in a feed-forward manner in the sense that connections between the nodes do not form a cycle. Hence \eqref{eq:generator} is also referred to as fully-connected feed-forward neural network in the deep learning literature.
Convolutional Neural Networks. In addition to the FNNs above, convolutional neural networks (CNNs) are another type of feed-foward neural network that are especially popular for image processing. In the finance setting CNNs have been successfully applied to price prediction based on inputs which are images containing visualizations of price dynamics and trading volumes [3]. The CNNs have two main building blocks -- convolutional layers and pooling layers. Convolutional layers are used to capture local patterns in the images and pooling layers are used to reduce the dimension of the problem and improve the computational efficiency. Each convolutional layer uses a number of trainable filters to extract features from the data. We start with the simple case of a single filter [math]\pmb{H}[/math], which applies the following convolution operation [math]\pmb{z}\ast \pmb{H}:\mathbb{R}^{n_x\times n_y}\times \mathbb{R}^{k_x\times k_y}\rightarrow \mathbb{R}^{(n_x-k_x+1)\times(n_y-k_y+1)}[/math] to the input [math]\pmb{z}[/math] through
The output of the convolutional layer [math]\widehat{\pmb{z}}[/math] is followed by the activation function [math]\sigma(\cdot)[/math], that is
The weights and bias introduced for FNNs can also be incorporated in this setting thus, in summary, the above simple CNN can be represented by
This simple convolutional layer can be generalized to the case of multiple channels with multiple filters, where the input is a 3D tensor [math]\pmb{z}\in\mathbb{R}^{n_x\times n_y\times n_z}[/math] where [math]n_z[/math] denotes the number of channels. Each channel represents a feature of the input, for example, an image as an input typically has three channels, red, green, and blue. Then each filter is also represented by a 3D tensor [math]\pmb{H}_k\in\mathbb{R}^{k_x\times k_y\times n_z}[/math] ([math]k=1,\ldots,K[/math]) with the same third dimension [math]n_z[/math] as the input and [math]K[/math] is the total number of filters. In this case, the convolution operation becomes
and the output is given by
Pooling layers are used to aggregate the information and reduce the computational cost, typically after the convolutional layers. A commonly used pooling layer is the max pooling layer, which computes the maximum value of a small neighbourhood in the spatial coordinates. Note that in addition, fully-connected layers, as introduced above, are often used in the last few layers of a CNN.
Recurrent Neural Networks. Recurrent Neural Networks (RNNs) are a family of neural networks that are widely used in processing sequential data, including speech, text and financial time series data. Unlike feed-forward neural networks, RNNs are a class of artificial neural networks where connections between units form a directed cycle. RNNs can use their internal memory to process arbitrary sequences of inputs and hence are applicable to tasks such as sequential data processing. For RNNs, our input is a sequence of data [math]\pmb{z}_1,\pmb{z}_2,\ldots,\pmb{z}_T[/math]. An RNN models the internal state [math]\pmb{h}_t[/math] by a recursive relation
where [math]F[/math] is a neural network with parameter [math]\theta[/math] (for example [math]\theta=(\pmb{W},\pmb{b})[/math]). Then the output is given by
where [math]G[/math] is another neural network with parameter [math]\phi[/math] ([math]\phi[/math] can be the same as [math]\theta[/math]). There are two important variants of the vanilla RNN introduced above -- the long short-term memory (LSTM) and the gated recurrent units (GRUs). Compared to vanilla RNNs, LSTM and GRUs are shown to have better performance in handling sequential data with long-term dependence due to their flexibility in propagating information flows. Here we introduce the LSTM. Let [math]\odot[/math] denote element-wise multiplication. The LSTM network architecture is given by
where [math]\pmb{U}, \pmb{U}^f, \pmb{U}^g, \pmb{U}^q, \pmb{W}, \pmb{W}^f, \pmb{W}^g, \pmb{W}^q[/math] are trainable weights matrices, and [math]\pmb{b}, \pmb{b}^f, \pmb{b}^g, \pmb{b}^q[/math] are trainable bias vectors. In addition to the internal state [math]\pmb{h}_t[/math], the LSTM also uses the gates and a cell state [math]\pmb{c}_t[/math]. The forget gate [math]\pmb{f}_t[/math] and the input gate [math]\pmb{g}_t[/math] determine how much information can be transmitted from the previous cell state [math]\pmb{c}_{t-1}[/math] and from the update [math]\sigma(\pmb{U}\pmb{z}_t+\pmb{W}\pmb{h}_{t-1}+\pmb{b})[/math], to the current cell state [math]\pmb{c}_t[/math]. The output gate [math]\pmb{q}_t[/math] controls how much [math]\pmb{c}_t[/math] reveals to [math]\pmb{h}_t[/math]. For more details see [2] and [4].
Training of Neural Networks. Mini-batch stochastic gradient descent is a popular choice for training neural networks due to its sample and computational efficiency. In this approach the parameters [math]\theta=(\pmb{W},\pmb{b})[/math] are updated in the descent direction of an objective function [math]\mathcal{L}(\theta)[/math] by selecting a mini-batch of samples at random to estimate the gradient of the objective function [math]\nabla_{\theta}\mathcal{L}(\theta)[/math] with respect to the parameter [math]\theta[/math]. For example, in supervised learning, we aim to learn the relationship between an input [math]X[/math] and an output [math]Y[/math], and the objective function [math]\mathcal{L}(\theta)[/math] measures the distance between the model prediction (output) and the actual observation [math]Y[/math]. Assume that the dataset contains [math]M[/math] samples of [math](X,Y)[/math]. Then the mini-batch stochastic gradient descent method is given as
where [math]\beta[/math] is a constant learning rate and [math]\widehat{\nabla_{\theta}\mathcal{L}(\theta^{(n)})}[/math] is an estimate of the true gradient [math]\nabla_{\theta}\mathcal{L}(\theta^{(n)})[/math], which is computed by averaging over [math]m[/math] ([math]m\ll M[/math]) samples of [math](X,Y)[/math]. It is called (vanilla) stochastic gradient descent when [math]m=1[/math],and it is the (traditional) gradient descent when [math]m=M[/math]. Compared with gradient descent, mini-batch stochastic gradient descent is noisy but computationally efficient since it only uses a subset of the data, which is advantageous when dealing with large datasets. It is also worth noting that, in addition to the standard mini-batch stochastic gradient descent methods \eqref{eqn:mini_batch_gd}, momentum methods are popular extensions which take into account the past gradient updates in order to accelerate the learning process. We give the updating rules for two examples of gradient descent methods with momentum -- the standard momentum and the Nesterov momentum [5],
where [math]\alpha[/math] and [math]\beta[/math] are constant learning rates. Such methods are particularly useful when the algorithms enter into a region where the gradient changes dramatically and thus the learned parameters can bounce around the region which slows down the progress of the search. Additionally, there are many other variants such as RMSprop [6] and ADAM [7], both of which employ adaptive learning rates.
Deep Value-based RL Algorithms
%References: survey on deep RL [8], DQN [9], neural TD learning [10], In this section, we introduce several [math]Q[/math]-learning algorithms with neural network approximations. We refer the reader to [11] for other deep value-based RL algorithms such as Neural TD learning and Dueling [math]Q[/math]-Networks.
Neural Fitted [math]Q[/math]-learning.
Fitted [math]Q[/math]-learning [12] is a generalization of the classical [math]Q[/math]-learning algorithm with functional approximations and it is applied in an off-line setting with a pre-collected dataset in the form of tuples [math](s, a, r, s')[/math] with [math]s'\sim P(s,a)[/math] and [math]r = r(s,a)[/math] which is random. When the class of approximation functionals is constrained to neural networks, the algorithm is referred to as Neural Fitted [math]Q[/math]-learning and the [math]Q[/math]-function is parameterized by [math]Q(s,a;\theta) = F((s,a);\theta)[/math] with [math]F[/math] a neural network function parameterized by [math]\theta[/math] [13]. For example, [math]F[/math] can be set as \eqref{eq:generator} with [math]\theta = (\pmb{W},\pmb{b})[/math].
In Neural Fitted [math]Q[/math]-learning, the algorithm starts with some random initialization of the [math]Q[/math]-values [math]Q(s, a; \theta^{(0)})[/math] where [math]\theta^{(0)}[/math] refers to the initial parameters. Then, an approximation of the [math]Q[/math]-values at the [math]n[/math]-th iteration [math]Q(s,a;\theta^{(n)})[/math] is updated towards the target value
where [math]\theta^{(n)}[/math] are the neural network parameters in the [math]n[/math]-th iteration, updated by stochastic gradient descent (or a variant) by minimizing the square loss:
Thus, the [math]Q[/math]-learning update amounts to updating the parameters:
where [math]\beta[/math] is a learning rate. This update resembles stochastic gradient descent, updating the current value [math]Q(s, a; \theta^{(n)})[/math] towards the target value [math]Y_n^Q[/math]. When neural networks are applied to approximate the [math]Q[/math]-function, it has been empirically observed that Neural Fitted [math]Q[/math]-learning may suffer from slow convergence or even divergence [13]. In addition, the approximated [math]Q[/math]-values tend to be overestimated due to the max operator [14].
Deep [math]Q[/math]-Network (DQN). To overcome the instability issue and the risk of overestimation mentioned above, [15] proposed a Deep [math]Q[/math]-Network (DQN) algorithm in an online setting with two novel ideas. One idea is the slow-update of the target network and the second is the use of ‘experience replay’. Both these ideas dramatically improve the empirical performance of the algorithm and DQN has been shown to have a strong performance for a variety of ATARI games [16]. %As demonstrated in the experiments in [15], two tricks are pivotal for the empirical success of DQN. We first discuss the use of experience replay [17]. That is we introduce a replay memory [math]\mathcal{B}[/math]. At each time [math]t[/math], the tuple [math](s_t, a_t, r_t,s_{t+1})[/math] is stored in [math]\mathcal{B}[/math] and a mini-batch of {[math]B[/math]} independent samples is randomly selected from [math]\mathcal{B}[/math] to train the neural network via stochastic gradient descent. Since the trajectory of an MDP has strong temporal correlation, the goal of experience replay is to obtain uncorrelated samples that are more similar to the i.i.d data (often assumed in many optimization algorithms), which can give more accurate gradient estimation for the stochastic optimization problem and enjoy better convergence performance. For experience replay, the replay memory size is usually very large in practice. For example, the replay memory size is [math]10^6[/math] in [15]. Moreover, DQN uses the [math]\varepsilon[/math]-greedy policy, which enables exploration over the state-action space [math]\mathcal{S}\times \mathcal{A}[/math]. Thus, when the replay memory is large, experience replay is close to sampling independent transitions from an explorative policy. This reduces the variance of the gradient %[math]\nabla L_{DQN}(\theta)[/math], which is used to update [math]\theta[/math]. Thus, experience replay stabilizes the training of DQN, which benefits the algorithm in terms of computational efficiency. % {\color{blue}We note that the DDPG algorithm involves a replay buffer (see lines 9-10 in Algorithm) which has several advantages. For example, the training data will be more similar to the i.i.d data (often assumed in many optimization algorithms) which could lead to better convergence behavior. Also, learning in mini-batches drawn from the replay buffer is more efficient than online learning. % } We now discuss using a target network [math]Q(\cdot,\cdot;{\theta}^{-})[/math] with parameter [math]{\theta}^{-}[/math] (the current estimate of the parameter). With independent samples [math]\{(s_{(i)}, a_{(i)}, r_{(i)}, s_{(i)}^{\prime})\}_{i=0}^{B}[/math] from the replay memory (we use [math]s_{(i)}^{\prime}[/math] instead of [math]s_{(i+1)}[/math] for the state after [math](s_{(i)},a_{(i)})[/math] to avoid notational confusion with the next independent sample [math]s_{(i+1)}[/math] in the state space), to update the parameter [math]\theta[/math] of the [math]Q[/math]-network, we compute the target
and update [math]\theta[/math] using the gradient of
Whereas the parameter [math]\theta^{(n)-}[/math] is updated once every [math]T_{\rm target}[/math] steps by letting [math]\theta^{(n)-} = \theta^{(n)}[/math], if [math]n = m T_{\rm target}[/math] for some [math]m\in \mathbb{Z}^{+}[/math], and [math]\theta^{(n)-} = \theta^{(n-1)-}[/math] otherwise. That is, the target network is held fixed for [math]T_{\rm target}[/math] steps and then updated using the current weights of the [math]Q[/math]-network. The introduction of a target network prevents the rapid propagation of instabilities and it reduces the risk of divergence %as the target values [math]\widetilde{Y}_n^Q[/math] are kept fixed for [math]T_{\rm target}[/math] iterations. The idea of target networks can be seen as an instantiation of Fitted [math]Q[/math]-learning, where each period between target network updates corresponds to a single Fitted [math]Q[/math]-iteration.
Double Deep [math]Q[/math]-network (DQN). The max operator in Neural Fitted [math]Q[/math]-learning and DQN, in \eqref{eq:Y_update} and \eqref{eq:Y_update2}, uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in over optimistic value estimates. To prevent this, Double [math]Q[/math]-learning [18] decouples the selection from the evaluation and this is further extended to the neural network setting [14]. In double [math]Q[/math]-learning [18] and double deep [math]Q[/math]-network [14], two value functions are learned by assigning experiences randomly to update one of the two value functions, resulting in two sets of weights, [math]\theta[/math] and [math]\eta[/math]. For each update, one set of weights is used to determine the greedy policy and the other to determine its value. For a clear comparison, we can untangle the selection and evaluation in Neural Fitted [math]Q[/math]-learning and rewrite its target \eqref{eq:Y_update} as
The target of Double (deep) [math]Q[/math]-learning can then be written as
Notice that the selection of the action, in the argmax, is still due to the online weights [math]\theta^{(n)}[/math]. This means that, as in [math]Q[/math]-learning, we are still estimating the value of the greedy policy according to the current values, as defined by [math]\theta^{(n)}[/math]. However, we use the second set of weights [math]\eta^{(n)}[/math] to fairly compute the value of this policy. This second set of weights can be updated symmetrically by switching the roles of [math]\theta[/math] and [math]\eta[/math].
Convergence Guarantee.
For DQN, [19] characterized the approximation error of the [math]Q[/math]-function by the sum of a statistical error and an algorithmic error, and the latter decays to zero at a geometric rate as the algorithm proceeds. The statistical error characterizes the bias and variance arising from the [math]Q[/math]-function approximation using the neural network. [10] parametrized the [math]Q[/math]-function by a two-layer neural network and provided a mean-squared sample complexity with sublinear convergence rate for neural TD learning.
% [10] parametrizes the [math]Q[/math]-function by a two-layer neural network and proves the [math]\widetilde{\mathcal{O}}(1/\varepsilon)[/math] mean-squared sample complexity of TD learning with population semi-gradients {\color{blue} not sure what these are} and [math]\widetilde{\mathcal{O}}(1/\varepsilon^2)[/math] mean-squared sample complexity of that {\color{blue}TD learning?} with stochastic semi-gradient.
The two-layer network in [10] with width [math]m[/math] is given by
where the activation function [math]\sigma(\cdot)[/math] is set to be the ReLU function, and the parameter [math]\pmb{c}=(c_1,\ldots,c_m)[/math] is fixed at the initial parameter during the training process and only the weights [math]\pmb{W}[/math] are updated. [20] considered a more challenging setting than [10], where the input data are non-i.i.d and the neural network has multiple (more than two) layers and obtained the same sublinear convergence rate. Furthermore, [21] also employed the two-layer neural network in \eqref{eqn:two_layer_NN} and proved that two algorithms used in practice, the projection-free and max-norm regularized [22] neural TD, achieve mean-squared sample complexity of [math]\widetilde{\mathcal{O}}^\prime(1/\varepsilon^6)[/math] and [math]\widetilde{\mathcal{O}}^\prime(1/\varepsilon^4)[/math], respectively. % {\color{ForestGreen}[Huining TODO: cite [20] and [10] ]}
Deep Policy-based RL Algorithms
% Similar to the case of vanilla policy-based algorithms in Section, there are two main branches of deep policy-based methods -- deep policy gradient methods and deep derivative-free methods (see, e.g., evolutionary algorithms [23][24]). In this section we focus on deep policy-based methods, which are extensions of policy-based methods using neural network approximations. We parameterize the policy [math]\pi[/math] by a neural network [math]F[/math] with parameter [math]\theta=(\pmb{W}, \pmb{b})[/math], that is, [math]a\sim\pi(s,a;\theta)=f(F((s,a);\theta))[/math] for some function [math]f[/math]. A popular choice of [math]f[/math] is given by
for some parameter [math]\tau[/math], which gives an energy-based policy (see, e.g. [25][26]). % The parameter [math]\phi[/math] can be updated using methods such as Neural Fitted [math]Q[/math]-learning, DQN, and double DQN introduced in Section, by letting the loss function [math]\mathcal{L}_{DPG}(\phi^{(n)})[/math] in \eqref{eqn:DDPG_loss} be the same as that in these deep value-based methods %
The policy parameter [math]\theta[/math] is updated using the gradient ascent rule given by
where [math]\widehat{\nabla_{\theta}J(\theta^{(n)})}[/math] is an estimate of the policy gradient.
% In deep/neural Actor-Critic methods, the value function/[math]Q[/math]-function involved in the policy gradient estimation is also represented by a neural network with weights [math]\pmb{U}[/math] and bias [math]\pmb{d}[/math], i.e. [math]V(s,a;\phi)[/math] or [math]Q(s,a;\phi)[/math] with [math]\phi=(\pmb{U},\pmb{d})[/math]. Using neural networks to parametrize the policy and/or value functions in the vanilla version of policy-based methods discussed in Section leads to neural Actor-Critic algorithms [26], neural PPO/TRPO [27], and deep DPG (DDPG) [28]. % Another popular deep policy-based method is the asynchronous advantage Actor-Critic (A3C) [29], in which multiple actors can be trained in parallel to decorrelate the agents' data and speed the training process. In addition, since introducing an entropy term in the objective function encourages policy exploration [25] and speeds the learning process [30][31] (as discussed in Section), there have been some recent developments in (off-policy) soft Actor-Critic algorithms [30][32] using neural networks, which solve the RL problem with entropy regularization. Below we introduce the DDPG algorithm, which is one of the most popular deep policy-based methods, and which has been applied in many financial problems.
DDPG. DDPG is a model-free off-policy Actor-Critic algorithm, first introduced in [28], which combines the DQN and DPG algorithms. Since its structure is more complex than DQN and DPG, we provide the pseudocode for DDPG in Algorithm. DDPG extends the DQN to continuous action spaces by incorporating DPG to learn a deterministic strategy. To encourage exploration, DDPG uses the following action
where [math]\pi^D[/math] is a deterministic policy and [math]\epsilon[/math] is a random variable sampled from some distribution [math]\mathcal{N}[/math], which can be chosen according to the environment. Note that the algorithm requires a small learning rate [math]\bar{\beta}\ll 1[/math] (see line 14 in Algorithm) to improve the stability of learning the target networks. In the same way as DQN (see Section), the DDPG algorithm also uses a replay buffer to improve the performance of neural networks.
[math]\mathcal{L}_{\rm DDPG}(\phi^{(n)})=\frac{1}{B}\sum_{i=1}^B (Y_i-Q(s_i,a_i;\phi^{(n)}))^2[/math].
Convergence Guarantee. By parameterizing the policy and/or value functions using a two-layer neural network given in \eqref{eqn:two_layer_NN}, [27] provided a mean-squared sample complexity for neural PPO and TRPO algorithms with sublinear convergence rate; [26] studied neural Actor-Critic methods where the actor updates using (1) vanilla policy gradient or (2) natural policy gradient, and in both cases the critic updates using TD(0). They proved that in case (1) the algorithm converges to a stationary point at a sublinear rate and they also established the global optimality of all stationary points under mild regularity conditions. In case (2) the algorithm was proved to achieve a mean-squared sample complexity with sublinear convergence rate. To the best of our knowledge, no convergence guarantee has been established for the DDPG (and DPG) algorithms.
%===Open Problems}{\color{blue}[TODO Renyuan: add current theoretical developments, future directions and open problems.]===
General references
Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].
References
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlecun1995CNN
- 2.0 2.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedsutskever2011RNN
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedjiang2020re
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfan2019selective
- Cite error: Invalid
<ref>
tag; no text was provided for refs namednesterov1983
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedtieleman2012lecture
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedkingma2014adam
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedarulkumaran2017surveyDRL
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfan2020DQN
- 10.0 10.1 10.2 10.3 10.4 10.5 Cite error: Invalid
<ref>
tag; no text was provided for refs namedcai2019neural
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfranccois2018introduction
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgordon1996stable
- 13.0 13.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedriedmiller2005neural
- 14.0 14.1 14.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedvan2016deep
- 15.0 15.1 15.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedmnih2015human
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedbellemare2013arcade
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedlin1992self
- 18.0 18.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhasselt2010double
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedfan2020theoretical
- 20.0 20.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedxu2020finite
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedcayci2021sample
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgoodfellow2013maxout
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedcuccu2011
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedgomez2005
- 25.0 25.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhaarnoja2017reinforcement
- 26.0 26.1 26.2 Cite error: Invalid
<ref>
tag; no text was provided for refs namedwang2019neural
- 27.0 27.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedliu2019neural
- 28.0 28.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedlillicrap2015continuous
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmnih2016asynchronous
- 30.0 30.1 Cite error: Invalid
<ref>
tag; no text was provided for refs namedhaarnoja2018soft
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedmei2020
- Cite error: Invalid
<ref>
tag; no text was provided for refs namedhaarnoja2018softappl