guide:812d89983e: Difference between revisions

Revision as of 02:22, 13 May 2024

The availability of data from electronic markets and the recent developments in RL together have led to a rapidly growing recent body of work applying RL algorithms to decision-making problems in electronic markets. Examples include optimal execution, portfolio optimization, option pricing and hedging, market making, order routing, as well as robot advising. In this section, we start with a brief overview of electronic markets and some discussion of market microstructure in Section. We then introduce several applications of RL in finance. In particular, optimal execution for a single asset is introduced in Section and portfolio optimization problems across multiple assets is discussed in Section. This is followed by sections on option pricing, robo-advising, and smart order routing. In each we introduce the underlying problem and basic model before looking at recent RL approaches used in tackling them. It is worth noting that there are some open source projects that provide full pipelines for implementing different RL algorithms in financial applications ^[1]^[2].

Preliminary: Electronic Markets and Market Microstructure

Many recent decision-making problems in finance are centered around electronic markets. We give a brief overview of this type of market and discuss two popular examples -- central limit order books and electronic over-the-counter markets. For more in-depth discussion of electronic markets and market microstructure, we refer the reader to the books ^[3] and ^[4].

Electronic Markets. Electronic markets have emerged as popular venues for the trading of a wide variety of financial assets. Stock exchanges in many countries, including Canada, Germany, Israel, and the United Kingdom, have adopted electronic platforms to trade equities, as has Euronext, the market combining several former European stock exchanges. In the United States, electronic communications networks (ECNs) such as Island, Instinet, and Archipelago (now ArcaEx) use an electronic order book structure to trade as much as 45\trading of currencies. Eurex, the electronic Swiss-German exchange, is now the world's largest futures market, while options have been traded in electronic markets since the opening of the International Securities Exchange in 2000. Many such electronic markets are organized as electronic Limit Order Books (LOBs). In this structure, there is no designated liquidity provider such as a specialist or a dealer. Instead, liquidity arises endogenously from the submitted orders of traders. Traders who submit orders to buy or sell the asset at a particular price are said to “make” liquidity, while traders who choose to hit existing orders are said to “take” liquidity. Another major type of electronic market without a central LOB is the Over-the-Counter (OTC) market where the orders are executed directly between two parties, the dealer and the client, without the supervision of an exchange. Examples include BrokerTec and MTS for trading corporate bonds and government bonds.

Limit Order Books. An LOB is a list of orders that a trading venue, for example the NASDAQ exchange, uses to record the interest of buyers and sellers in a particular financial asset or instrument. There are two types of orders the buyers (sellers) can submit: a limit buy (sell) order with a preferred price for a given volume of the asset or a market buy (sell) order with a given volume which will be immediately executed with the best available limit sell (buy) orders. Therefore limit orders have a price guarantee but are not guaranteed to be executed, whereas market orders are executed immediately at the best available price. The lowest price of all sell limit orders is called the ask price, and the highest price of all buy limit orders is called the bid price. The difference between the ask and bid is known as the spread and the average of the ask and bid is known as the mid-price. A snapshot of an LOB^[a] with 10 levels of bid/ask prices is shown in Figure.

A snapshot of the LOB of MSFT(Microsoft) stock at 9:30:08.026 am on 21 June, 2012 with ten levels of bid/ask prices.

A matching engine is used to match the incoming buy and sell orders. This typically follows the price-time priority rule ^[5], whereby orders are first ranked according to their price. Multiple orders having the same price are then ranked according to the time they were entered. If the price and time are the same for the incoming orders, then the larger order gets executed first. The matching engine uses the LOB to store pending orders that could not be executed upon arrival.

Over-the-counter Markets. OTC or off-exchange trading is done directly between two parties, without the supervision of an exchange. Many OTC markets organized around dealers, including corporate bond markets in many countries, have also undergone upheaval linked to electronification over the last ten years. The electronification process is dominated by Multi-dealer-to-client (MD2C) platforms enabling clients to send the same request for a quote (RFQ) to several dealers simultaneously and therefore instantly put the dealers into competition with one another. These dealers can then each provide a price to the client for the transaction (not necessarily the price the dealer has streamed). Dealers know the identity of the client (which differs from most of the systems organized around a central LOB) and the number of requested dealer prices. However, they do not see the prices that are streamed by the other dealers. They only see a composite price at the bid and offer, based on some of the best streamed prices. The client progressively receives the answers to the RFQ and can deal at any time with the dealer who has proposed the best price, or decide not to trade. Each dealer knows whether a deal was done (with her/him, but also with another dealer - without knowing the identity of this dealer) or not. If a transaction occurred, the best dealer usually knows the cover price (the second best bid price in the RFQ), if there is one. We refer the reader to ^[6] for a more in-depth discussion of MD2C bond trading platforms.

Market Participants. When considering different market participants it is sometimes helpful to classify them based on their objectives and trading strategies. Three primary classes are the following ^[3]:

Fundamental (or noise or liquidity) traders: those who are driven by economic fundamentals outside the exchange.
Informed traders: traders who profit from leveraging information not reflected in market prices by trading assets in anticipation of their appreciation or depreciation.
Market makers: professional traders who profit from facilitating exchange in a particular asset and exploit their skills in executing trades.

The effects of the interactions amongst all these traders is one of the key issues studied in the field of market microstructure. How to improve trading decisions from the perspective of one class of market participants, while strategically interacting with the others, is one of the big challenges in the field. The recent literature has seen many attempts to exploit RL techniques to tackle these problems.

Optimal Execution

Optimal execution is a fundamental problem in financial modeling. The simplest version is the case of a trader who wishes to buy or sell a given amount of a single asset within a given time period. The trader seeks strategies that maximize their return from, or alternatively, minimize the cost of, the execution of the transaction.

The Almgren--Chriss Model. A classical framework for optimal execution is due to Almgren--Chriss ^[7]. In this setup a trader is required to sell an amount [math]q_0[/math] of an asset, with price [math]S_0[/math] at time 0, over the time period [math][0,T][/math] with trading decisions made at discrete time points [math]t=1,\ldots,T[/math]. The final inventory [math]q_T[/math] is required to be zero. Therefore the goal is to determine the liquidation strategy [math]u_1,u_2,\ldots,u_{T}[/math], where [math]u_t[/math] ([math]t=1,2,\ldots,T[/math]) denotes the amount of the asset to sell at time [math]t[/math]. It is assumed that selling quantities of the asset will have two types of price impact -- a temporary impact which refers to any temporary price movement due to the supply-demand imbalance caused by the selling, and a permanent impact, which is a long-term effect on the ‘equilibrium’ price due to the trading, that will remain at least for the trading period. We write [math]S_t[/math] for the asset price at time [math]t[/math]. The Almgren--Chriss model assumes that the asset price evolves according to the discrete arithmetic random walk

[[math]] S_t = S_{t-1}+\sigma\xi_{t}-g(u_t),\qquad t=1,2,\ldots,T [[/math]]

where [math]\sigma[/math] is the (constant) volatility parameter, [math]\xi_t[/math] are independent random variables drawn from a distribution with zero mean and unit variance, and [math]g[/math] is a function of the trading strategy [math]u_t[/math] that measures the permanent impact. The inventory process [math]\{q_t\}_{t=0}^T[/math] records the current holding in the asset, [math]q_t[/math], at each time [math]t[/math]. Thus we have

[[math]] q_t = q_{t-1}-u_t. [[/math]]

Selling the asset may cause a temporary price impact, that is a drop in the average price per share, thus the actual price per share received is

[[math]] \widetilde{S}_t = S_{t-1}-h(u_t), [[/math]]

where [math]h[/math] is a function which quantifies this temporary price impact. The cost of this trading trajectory is defined to be the difference between the initial book value and the revenue, given by [math]C = q_0S_0 - \sum_{t=1}^T q_t\widetilde{S}_t[/math] with mean and variance

[[math]] \mathbb{E}[C] = \sum_{t=1}^T \big(q_tg(u_t)+u_th(u_t)\big),\qquad {\rm Var}(C)=\sigma^2\sum_{t=1}^Tq_t^2. [[/math]]

A trader thus would like to minimize her expected cost as well as the variance of the cost, which gives the optimal execution problem in this model as

[[math]] \min_{\{u_t\}_{t=1}^{T}}\big(\mathbb{E}[C]+\lambda {\rm Var}(C)\big), [[/math]]

where [math]\lambda\in\mathbb{R}[/math] is a measure of risk-aversion. When we assume that both the price impacts are linear, that is

[[math]] g(u_t) = \gamma u_t, \qquad h(u_t) = \eta u_t, [[/math]]

where [math]\gamma[/math] and [math]\eta[/math] are the permanent and temporary price impact parameters, ^[7] derive the general solution for the Almgren--Chriss model. This is given by

[[math]] \begin{equation*} u_j = \frac{2\,{\rm sinh}\big(\frac{1}{2}\kappa \big)}{{\rm sinh} (\kappa T)}{\rm cosh}\left(\kappa\left(T-t_{j-\frac{1}{2}}\right)\right)q_0, \qquad j=1,2,\ldots,T, \end{equation*} [[/math]]

with

[[math]] \kappa = {\rm cosh}^{-1}\left(\frac{\tilde{\kappa}^2}{2}+1\right), \qquad \tilde{\kappa}^2 = \frac{\lambda\sigma^2}{\eta(1-\frac{\gamma}{2\eta})}. [[/math]]

The corresponding optimal inventory trajectory is

[[math]] \begin{equation}\label{eq:amgren-chriss} q_j = \frac{{\rm sinh}\big(\kappa (T-t_j)\big)}{{\rm sinh} (\kappa T)}q_0, \qquad j=0,1,\ldots,T. \end{equation} [[/math]]

The above Almgren--Chriss framework for liquidating a single asset can be extended to the case of multiple assets ^[7]^{(Appendix A)}. We also note that the general solution to the Almgren-Chriss model had been constructed previously in ^[8]^{(Chapter 16)}. This simple version of the Almgren--Chriss framework has a closed-form solution but it relies heavily on the assumptions of the dynamics and the linear form of the permanent and temporary price impact. The mis-specification of the dynamics and market impacts may lead to undesirable strategies and potential losses. In addition, the solution in \eqref{eq:amgren-chriss} is a pre-planned strategy that does not depend on real-time market conditions. Hence this strategy may miss certain opportunities when the market moves. This motivates the use of an RL approach which is more flexible and able to incorporate market conditions when making decisions.

Evaluation Criteria and Benchmark Algorithms. Before discussing the RL approach, we introduce several widely-used criteria to evaluate the performance of execution algorithms in the literature such as the Profit and Loss (PnL), Implementation Shortfall, and the Sharp ratio. The PnL is the final profit or loss induced by a given execution algorithm over the whole time period, which is made up of transactions at all time points. The Implementation Shortfall ^[9] for an execution algorithm is defined as the difference between the {PnL} of the algorithm and the {PnL} received by trading the entire amount of the asset instantly. The Sharpe ratio ^[10] is defined as the ratio of expected return to standard deviation of the return; thus it measures return per unit of risk. Two popular variants of the Sharpe ratio are the differential Sharpe ratio ^[11] and the Sortino ratio ^[12]. In addition, some classical pre-specified strategies are used as benchmarks to evaluate the performance of a given RL-based execution strategy. Popular choices include executions strategies based on Time-Weighted Average Price (TWAP) and Volume-Weighted Average Price (VWAP) as well as the Submit and Leave (SnL) policy where a trader places a sell order for all shares at a fixed limit order price, and goes to the market with any unexecuted shares remaining at time [math]T[/math].

RL Approach. We first provide a brief overview of the existing literature on RL for optimal execution. The most popular types of RL methods that have been used in optimal execution problems are [math]Q[/math]-learning algorithms and (double) DQN ^[13]^[14]^[15]^[16]^[17]^[18]^[19]^[20]. Policy-based algorithms are also popular in this field, including (deep) policy gradient methods ^[21]^[15], A2C ^[15], PPO ^[17]^[22], and DDPG ^[23]. The benchmark strategies studied in these papers include the Almgren--Chriss solution ^[13]^[21], the TWAP strategy ^[14]^[17]^[22], the VWAP strategy ^[22], and the SnL policy ^[19]^[23]. In some models the trader is allowed to buy or sell the asset at each time point ^[16]^[15]^[24]^[25], whereas there are also many models where only one trading direction is allowed ^[19]^[13]^[21]^[14]^[17]^[18]^[23]^[22]. The state variables are often composed of time stamp, the market attributes including (mid-)price of the asset and/or the spread, the inventory process and past returns. The control variables are typically set to be the amount of asset (using market orders) to trade and/or the relative price level (using limit orders) at each time point. Examples of reward functions include cash inflow or outflow (depending on whether we sell or buy) ^[18]^[19], implementation shortfall ^[13], profit ^[25], Sharpe ratio ^[25], return ^[16], and PnL ^[24]. Popular choices of performance measure include Implementation Shortfall ^[13]^[21], PnL (with a penalty term of transaction cost) ^[14]^[24]^[25], trading cost ^[19]^[18], profit ^[25]^[16], Sharpe ratio ^[25]^[15], Sortino ratio ^[15], and return ^[15]. We now discuss some more details of the RL algorithms and experimental settings in the above papers.

For value-based algorithms, ^[19] provided the first large scale empirical analysis of a RL method applied to optimal execution problems. They focused on a modified [math]Q[/math]-learning algorithm to select price levels for limit order trading, which leads to significant improvements over simpler forms of optimization such as the SnL policies in terms of trading costs. ^[18] proposed a risk-averse RL algorithm for optimal liquidation, which can be viewed as a generalization of ^[19]. This algorithm achieves substantially lower trading costs over the period when the 2010 flash-crash happened compared with the risk-neutral RL algorithm in ^[19]. ^[13] combined the Almgren--Chriss solution and the [math]Q[/math]-learning algorithm and showed that they are able to improve the Implementation Shortfall of the Almgren--Chriss solution by up to 10.3% on average, using LOB data which includes five price levels. ^[14] proposed a modified double DQN algorithm for optimal execution and showed that the algorithm outperforms the TWAP strategy on seven out of nine stocks using PnL as the performance measure. They added a single one-second time step [math]\Delta T[/math] at the end of the horizon to guarantee that all shares are liquidated over [math]T+\Delta T[/math]. ^[16] designed a trading system based on DQN which determines both the trading direction and the number of shares to trade. Their approach was shown to increase the total profits by at least four times in four different index stocks compared with a benchmark trading model which trades a fixed number of shares each time. They also used transfer learning to avoid overfitting, where some knowledge/information is reused when learning in a related or similar situation.

For policy-based algorithms, ^[25] combined deep learning with RL to determine whether to sell, hold, or buy at each time point. In the first step of their model, neural networks are used to summarize the market features and in the second step the RL part makes trading decisions. The proposed method was shown to outperform several other deep learning and deep RL models in terms of PnL, total profits, and Sharpe ratio. They suggested that in practice, the Sharpe ratio was more recommended as the reward function compared with total profits. ^[23] used the DDPG algorithm for optimal execution over a short time horizon (two minutes) and designed a network to extract features from the market data. Experiments on real LOB data with 10 price levels show that the proposed approach significantly outperforms the existing methods, including the SnL policy (as baseline), the [math]Q[/math]-learning algorithm, and the method in ^[13]. ^[22] proposed an adaptive framework based on PPO with neural networks including LSTM and fully-connected networks, and showed that the framework outperforms the baseline models including TWAP and VWAP, as well as several deep RL models on most of 14 US equities. ^[21] applied the (vanilla) policy gradient method to the LOB data of five stocks in different sectors and showed that they improve the Implementation Shortfall of the Almgren--Chriss solution by around 20%. ^[26] used neural networks to learn the mapping between the risk-aversion parameter and the optimal control with potential market impacts incorporated.

For comparison between value-based and policy-based algorithms, ^[17] explored double DQN and PPO algorithms in different market environments -- when the benchmark TWAP is optimal, PPO is shown to converge to TWAP whereas double DQN may not; when TWAP is not optimal, both algorithms outperform this benchmark. ^[15] showed that DQN, policy gradient, and A2C outperform several baseline models including classical time-series momentum strategies, on test data of 50 liquid futures contracts. Both continuous and discrete action spaces were considered in their work. They observed that DQN achieves the best performance and the second best is the A2C approach.

In addition, model-based RL algorithms have also been used for optimal execution. ^[24] built a profitable electronic trading agent to place buy and sell orders using model-based RL, which outperforms two benchmarks strategies in terms of PnL on LOB data. They used a recurrent neural network to learn the state transition probability. We note that multi-agent RL has also been used to address the optimal execution problem, see for example ^[27]^[28].

Portfolio Optimization

In portfolio optimization problems, a trader needs to select and trade the best portfolio of assets in order to maximize some objective function, which typically includes the expected return and some measure of the risk. The benefit of investing in such portfolios is that the diversification of investments achieves higher return per unit of risk than only investing in a single asset (see, e.g. ^[29]).

Mean-Variance Portfolio Optimization. The first significant mathematical model for portfolio optimization is the Markowitz model ^[30], also called the mean-variance model, where an investor seeks a portfolio to maximize the expected total return for any given level of risk measured by variance. This is a single-period optimization problem and is then generalized to multi-period portfolio optimization problems in ^[31]^[32]^[33]^[34]^[35]^[36]. In this mean-variance framework, the risk of a portfolio is quantified by the variance of the wealth and the optimal investment strategy is then sought to maximize the final wealth penalized by a variance term. The mean-variance framework is of particular interest because it not only captures both the portfolio return and the risk, but also suffers from the time-inconsistency problem ^[37]^[38]^[39]. That is the optimal strategy selected at time [math]t[/math] is no longer optimal at time [math]s \gt t[/math] and the Bellman equation does not hold. A breakthrough was made in ^[36], in which they were the first to derive the analytical solution to the discrete-time multi-period mean-variance problem. They applied an embedding approach, which transforms the mean-variance problem to an LQ problem where classical approaches can be used to find the solution. The same approach was then used to solve the continuous-time mean-variance problem ^[40]. In addition to the embedding scheme, other methods including the consistent planning approach ^[41]^[42] and the dynamically optimal strategy ^[43] have also been applied to solve the time-inconsistency problem arising in the mean-variance formulation of portfolio optimization. Here we introduce the multi-period mean-variance portfolio optimization problem as given in ^[36]. Suppose there are [math]n[/math] risky assets in the market and an investor enters the market at time 0 with initial wealth [math]x_0[/math]. The goal of the investor is to reallocate his wealth at each time point [math]t=0,1,\ldots,T[/math] among the [math]n[/math] assets to achieve the optimal trade off between the return and the risk of the investment. The random rates of return of the assets at [math]t[/math] is denoted by [math]\pmb{e}_t =(e_t^1,\ldots,e_t^n)^\top[/math], where [math]e_t^i[/math] ([math]i=1,\ldots,n[/math]) is the rate of return of the [math]i[/math]-th asset at time [math]t[/math]. The vectors [math]\pmb{e}_t[/math], [math]t=0,1,\ldots,T-1[/math], are assumed to be statistically independent (this independence assumption can be relaxed, see, e.g. ^[39]) with known mean [math]\pmb{\mu}_t=(\mu_t^1,\ldots,\mu_t^n)^\top\in\mathbb{R}^{n}[/math] and known standard deviation [math]\sigma_{t}^i[/math] for [math]i=1,\ldots,n[/math] and [math]t=0,\ldots,T-1[/math]. The covariance matrix is denoted by [math]\pmb{\Sigma}_t\in\mathbb{R}^{n\times n}[/math], where [math][\pmb{\Sigma}_{t}]_{ii}=(\sigma_{t}^i)^2[/math] and [math][\pmb{\Sigma}_t]_{ij}=\rho_{t}^{ij}\sigma_{t}^i\sigma_{t}^j[/math] for [math]i,j=1,\ldots,n[/math] and [math]i\neq j[/math], where [math]\rho_{t}^{ij}[/math] is the correlation between assets [math]i[/math] and [math]j[/math] at time [math]t[/math]. We write [math]x_t[/math] for the wealth of the investor at time [math]t[/math] and [math]u_t^i[/math] ([math]i=1,\ldots,n-1[/math]) for the amount invested in the [math]i[/math]-th asset at time [math]t[/math]. Thus the amount invested in the [math]n[/math]-th asset is [math]x_t-\sum_{i=1}^{n-1} u_t^i[/math]. An investment strategy is denoted by [math]\pmb{u}_t= (u_t^1,u_t^2,\ldots,u_t^{n-1})^\top[/math] for [math]t=0,1,\ldots,T-1[/math], and the goal is to find an optimal strategy such that the portfolio return is maximized while minimizing the risk of the investment, that is,

[[math]] \begin{eqnarray}\label{eqn:MV_obj} \max_{\{\pmb{u}_t\}_{t=0}^{T-1}}\mathbb{E}[x_T] - \phi {\rm Var}(x_T), \end{eqnarray} [[/math]]

subject to

[[math]] \begin{equation}\label{eqn:MV_state} x_{t+1} = \sum_{i=1}^{n-1} e_t^i u_t^i + \left(x_t-\sum_{i=1}^{n-1} u_t^i\right)e_t^n,\qquad t=0,1,\ldots,T-1, \end{equation} [[/math]]

where [math]\phi[/math] is a weighting parameter balancing risk, represented by the variance of [math]x_T[/math], and return. As mentioned above, this framework is embedded into an LQ problem in ^[36], and the solution to the LQ problem gives the solution to the above problem \eqref{eqn:MV_obj}-\eqref{eqn:MV_state}. The analytical solution derived in ^[36] is of the following form

[[math]] \pmb{u}_t^* (x_t) = \alpha_t x_t + \beta_t, [[/math]]

where [math]\alpha_t[/math] and [math]\beta_t[/math] are explicit functions of [math]\pmb{\mu}_t[/math] and [math]\pmb{\Sigma}_t[/math], which are omitted here and can be found in ^[36]. The above framework has been extended in different ways, for example, the risk-free asset can also be involved in the portfolio, and one can maximize the cumulative form of \eqref{eqn:MV_obj} rather than only focusing on the final wealth [math]x_T[/math]. For more details about these variants and solutions, see ^[39]. In addition to the mean-variance framework, other major paradigms in portfolio optimization are the Kelly Criterion and Risk Parity. We refer to ^[44] for a review of these optimal control frameworks and popular model-free RL algorithms for portfolio optimization. Note that the classical stochastic control approach for portfolio optimization problems across multiple assets requires both a realistic representation of the temporal dynamics of individual assets, as well as an adequate representation of their co-movements. This is extremely difficult when the assets belong to different classes (for example, stocks, options, futures, interest rates and their derivatives). On the other hand, the model-free RL approach does not rely on the specification of the joint dynamics across assets.

RL Approach. Both value-based methods such as [math]Q[/math]-learning ^[45]^[46], SARSA ^[46], and DQN ^[47], and policy-based algorithms such as DPG and DDPG ^[48]^[49]^[50]^[51]^[52]^[53] have been applied to solve portfolio optimization problems. The state variables are often composed of time, asset prices, asset past returns, current holdings of assets, and remaining balance. The control variables are typically set to be the amount/proportion of wealth invested in each component of the portfolio. Examples of reward signals include portfolio return ^[49]^[46]^[50], (differential) Sharpe ratio ^[45]^[46], and profit ^[45]. The benchmark strategies include Constantly Rebalanced Portfolio (CRP) ^[50]^[49] where at each period the portfolio is rebalanced to the initial wealth distribution among the assets, and the buy-and-hold or do-nothing strategy ^[47]^[52] which does not take any action but rather holds the initial portfolio until the end. The performance measures studied in these papers include the Sharpe ratio ^[50]^[54]^[48]^[49]^[51]^[47]^[55], the Sortino ratio ^[50], portfolio returns ^[55]^[48]^[51]^[47]^[50]^[52], portfolio values ^[49]^[48]^[46], and cumulative profits ^[45]. Some models incorporate the transaction costs ^[49]^[51]^[45]^[47]^[50]^[52] and investments in the risk-free asset ^[50]^[54]^[55]^[45]^[49].

For value-based algorithms, ^[45] considered the portfolio optimization problems of a risky asset and a risk-free asset. They compared the performance of the [math]Q[/math]-learning algorithm and a Recurrent RL (RRL) algorithm under three different value functions including the Sharpe ratio, differential Sharpe ratio, and profit. The RRL algorithm is a policy-based method which uses the last action as an input. They concluded that the [math]Q[/math]-learning algorithm is more sensitive to the choice of value function and has less stable performance than the RRL algorithm. They also suggested that the (differential) Sharpe ratio is preferred rather than the profit as the reward function. ^[46] studied a two-asset personal retirement portfolio optimization problem, in which traders who manage retirement funds are restricted from making frequent trades and they can only access limited information. They tested the performance of three algorithms: SARSA and [math]Q[/math]-learning methods with discrete state and action space that maximize either the portfolio return or differential Sharpe ratio, and a TD learning method with discrete state space and continuous action space that maximizes the portfolio return. The TD method learns the portfolio returns by using a linear regression model and was shown to outperform the other two methods in terms of portfolio values. ^[47] proposed a portfolio trading strategy based on DQN which chooses to either hold, buy or sell a pre-specified quantity of the asset at each time point. In their experiments with two different three-asset portfolios, their trading strategy is superior to four benchmark strategies including the do-nothing strategy and a random strategy (take an action in the feasible space randomly) using performance measures including the cumulative return and the Sharpe ratio. ^[56] applies a G-learning-based algorithm, a probabilistic extension of [math]Q[/math]-learning which scales to high dimensional portfolios while providing a flexible choice of utility functions, for wealth management problems. In addition, the authors also extend the G-learning algorithm to the setting of Inverse Reinforcement Learning (IRL) where rewards collected by the agent are not observed, and should instead be inferred.

For policy-based algorithms, ^[49] proposed a framework combining neural networks with DPG. They used a so-called Ensemble of Identical Independent Evaluators (EIIE) topology to predict the potential growth of the assets in the immediate future using historical data which includes the highest, lowest, and closing prices of portfolio components. The experiments using real cryptocurrency market data showed that their framework achieves higher Sharpe ratio and cumulative portfolio values compared with three benchmarks including CRP and several published RL models. ^[48] explored the DDPG algorithm for the portfolio selection of 30 stocks, where at each time point, the agent can choose to buy, sell, or hold each stock. The DDPG algorithm was shown to outperform two classical strategies including the min-variance portfolio allocation method ^[57] in terms of several performance measures including final portfolio values, annualized return, and Sharpe ratio, using historical daily prices of the 30 stocks. ^[51] considered the DDPG, PPO, and policy gradient method with an adversarial learning scheme which learns the execution strategy using noisy market data. They applied the algorithms for portfolio optimization to five stocks using market data and the policy gradient method with adversarial learning scheme achieves higher daily return and Sharpe ratio than the benchmark CRP strategy. They also showed that the DDPG and PPO algorithms fail to find the optimal policy even in the training set. ^[50] proposed a model using DDPG, which includes prediction of the assets future price movements based on historical prices and synthetic market data generation using a Generative Adversarial Network (GAN) ^[58]. The model outperforms the benchmark CRP and the model considered in ^[49] in terms of several performance measures including Sharpe ratio and Sortino ratio. ^[53] embedded alpha portfolio strategies into a deep policy-based method and designed a framework which is easier to interpret. Using Actor-Critic methods, ^[52] combined the mean-variance framework (the actor determines the policy using the mean-variance framework) and the Kelly Criterion framework (the critic evaluates the policy using their growth rate). They studied eight policy-based algorithms including DPG, DDPG, and PPO, among which DPG was shown to achieve the best performance.

In addition to the above discrete-time models, ^[54] studied the entropy-regularized continuous-time mean-variance framework with one risky asset and one risk-free asset. They proved that the optimal policy is Gaussian with decaying variance and proposed an Exploratory Mean-Variance (EMV) algorithm, which consists of three procedures: policy evaluation, policy improvement, and a self-correcting scheme for learning the Lagrange multiplier. They showed that this EMV algorithm outperforms two benchmarks, including the analytical solution with estimated model parameters obtained from the classical maximum likelihood method ^[59]^{(Section 9.3.2)} and a DDPG algorithm, in terms of annualized sample mean and sample variance of the terminal wealth, and Sharpe ratio in their simulations. The continuous-time framework in ^[54] was then generalized in ^[55] to large scale portfolio selection setting with [math]d[/math] risky assets and one risk-free asset. The optimal policy is shown to be multivariate Gaussian with decaying variance. They tested the performance of the generalized EMV algorithm on price data from the stocks in the S\&P 500 with [math]d\geq 20[/math] and it outperforms several algorithms including DDPG.

Option Pricing and Hedging

Understanding how to price and hedge financial derivatives is a cornerstone of modern mathematical and computational finance due to its importance in the finance industry. A financial derivative is a contract that derives its value from the performance of an underlying entity. For example, a call or put option is a contract which gives the holder the right, but not the obligation, to buy or sell an underlying asset or instrument at a specified strike price prior to or on a specified date called the expiration date. Examples of option types include European options which can only be exercised at expiry, and American options which can be exercised at any time before the option expires.

The Black-Scholes Model. One of the most important mathematical models for option pricing is the Black-Scholes or Black-Scholes-Merton (BSM) model ^[60]^[61], in which we aim to find the price of a European option [math]V(S_t,t)[/math] ([math]0\leq t\leq T[/math]) with underlying stock price [math]S_t[/math], expiration time [math]T[/math], and payoff at expiry [math]P(S_T)[/math]. The underlying stock price [math]S_t[/math] is assumed to be non-dividend paying and to follow a geometric Brownian motion

[[math]] d S_t = \mu S_t dt+\sigma S_tdW_t, [[/math]]

where [math]\mu[/math] and [math]\sigma[/math] are called the drift and volatility parameters of the underlying asset, and [math]W = \{W_t\}_{0\leq t \leq T}[/math] is a standard Brownian motion defined on a filtered probability space [math](\Omega,\mathcal{F},\{\mathcal{F}\}_{0\leq t \leq T},\mathbb{P})[/math]. The key idea of the BSM is that European options can be perfectly replicated by a continuously rebalanced portfolio of the underlying asset and a riskless asset, under the assumption that there are no market frictions, and that trading can take place continuously and in arbitrarily small quantities. If the derivative can be replicated, by analysing the cost of the hedging strategy, the derivative's price must satisfy the following Black-Scholes partial differential equation

[[math]] \begin{equation}\label{eqn:BS_equation} \frac{\partial V}{\partial t}(s,t)+\frac{1}{2}\sigma^2s^2\frac{\partial^2 V}{\partial s^2}(s,t)+rs\,\frac{\partial V}{\partial s}(s,t) -rV(s,t)=0, \end{equation} [[/math]]

with terminal condition [math]V(s,T)=P(s)[/math] and where [math]r[/math] is the known (constant) risk-free interest rate. When we have a call option with payoff [math]P(s)=\max(s-K,0)[/math], giving the option buyer the right to buy the underlying asset, the solution to the Black-Scholes equation is given by

[[math]] \begin{equation*} V(s,t) = N(d_1) s - N(d_2)Ke^{-r(T-t)}, \end{equation*} [[/math]]

where [math]N(\cdot)[/math] is the standard normal cumulative distribution function, and [math]d_1[/math] and [math]d_2[/math] are given by

[[math]] d_1=\frac{1}{\sigma\sqrt{T-t}}\left(\ln\left(\frac{s}{K}\right)+\left(r+\frac{\sigma^2}{2}\right)(T-t)\right),\qquad d_2=d_1-\sigma\sqrt{T-t}. [[/math]]

We refer to the survey ^[62] for details about extensions of the BSM model, other classical option pricing models, and numerical methods such as the Monte Carlo method. In a complete market one can hedge a given derivative contract by buying and selling the underlying asset in the right way to eliminate risk. In the Black-Scholes analysis delta hedging is used, in which we hedge the risk of a call option by shorting [math]\Delta_t[/math] units of the underlying asset with [math]\Delta_t:=\frac{\partial V}{\partial S}(S_t,t)[/math] (the sensitivity of the option price with respect to the asset price). It is also possible to use financial derivatives to hedge against the volatility of given positions in the underlying assets. However, in practice, we can only rebalance portfolios at discrete time points and frequent transactions may incur high costs. Therefore an optimal hedging strategy depends on the tradeoff between the hedging error and the transaction costs. It is worth mentioning that this is in a similar spirit to the mean-variance portfolio optimization framework introduced in Section. Much effort has been made to include the transaction costs using classical approaches such as dynamic programming, see, for example ^[63]^[64]^[65]. We also refer to ^[66] for the details about delta hedging under different model assumptions. However, the above discussion addresses option pricing and hedging in a model-based way since there are strong assumptions made on the asset price dynamics and the form of transaction costs. In practice some assumptions of the BSM model are not realistic since: 1) transaction costs due to commissions, market impact, and non-zero bid-ask spread exist in the real market; 2) the volatility is not constant; 3) short term returns typically have a heavy tailed distribution ^[67]^[68]. Thus the resulting prices and hedges may suffer from model mis-specification when the real asset dynamics is not exactly as assumed and the transaction costs are difficult to model. Thus we will focus on a model-free RL approach that can address some of these issues.

RL Approach. RL methods including (deep) [math]Q[/math]-learning ^[69]^[70]^[71]^[72]^[73], PPO ^[70], and DDPG ^[71] have been applied to find hedging strategies and price financial derivatives. The state variables often include asset price, current positions, option strikes, and time remaining to expiry. The control variable is often set to be the change in holdings. Examples of reward functions are (risk-adjusted) expected wealth/return ^[69]^[73]^[70]^[74] (as in the mean-variance portfolio optimization), option payoff ^[72], and (risk-adjusted) hedging cost ^[71]. The benchmarks for pricing models are typically the BSM model ^[69]^[73] and binomial option pricing model ^[75] (introduced in ^[76]). The learned hedging strategies are typically compared to a discrete approximation to the delta hedging strategy for the BSM model ^[71]^[70]^[77]^[78], or the hedging strategy for the Heston model ^[77]. In contrast to the BSM model where the volatility is assumed to be constant, the Heston model assumes that the volatility of the underlying asset follows a particular stochastic process which leads to a semi-analytical solution. The performance measures for the hedging strategy used in RL papers include (expected) hedging cost/error/loss ^[71]^[74]^[77]^[78], PnL ^[70]^[74]^[78], and average payoff ^[72]. Some practical issues have been taken into account in RL models, including transaction costs ^[77]^[70]^[71]^[74] and position constraints such as round lotting ^[70] (where a round lot is a standard number of securities to be traded, such as 100 shares) and limits on the trading size (for example buy or sell up to 100 shares) ^[74]. For European options, ^[69] developed a discrete-time option pricing model called the QLBS ([math]Q[/math]-Learner in Black-Scholes) model based on Fitted [math]Q[/math]-learning (see Section). This model learns both the option price and the hedging strategy in a similar spirit to the mean-variance portfolio optimization framework. ^[73] extended the QLBS model in ^[69] by using Fitted [math]Q[/math]-learning. They also investigated the model in a different setting where the agent infers the risk-aversion parameter in the reward function using observed states and actions. However, ^[69] and ^[73] did not consider transaction costs in their analysis. By contrast, ^[77] used deep neural networks to approximate an optimal hedging strategy under market frictions, including transaction costs, and convex risk measures ^[79] such as conditional value at risk. They showed that their method can accurately recover the optimal hedging strategy in the Heston model ^[80] without transaction costs and it can be used to numerically study the impact of proportional transaction costs on option prices. The learned hedging strategy is also shown to outperform delta hedging in the (daily recalibrated) BSM model in a simple setting, this is hedging an at-the-money European call option on the S\&P 500 index. The method in ^[77] was extended in ^[81] to price and hedge a very large class of derivatives including vanilla options and exotic options in more complex environments (e.g. stochastic volatility models and jump processes). ^[74] found the optimal hedging strategy by optimizing a simplified version of the mean-variance objective function \eqref{eqn:MV_obj}, subject to discrete trading, round lotting, and nonlinear transaction costs. They showed in their simulations that the learned hedging strategy achieves a much lower cost, with no significant difference in volatility of total PnL, compared to the delta hedging strategy. ^[70] extended the framework in ^[74] and tested the performance of DQN and PPO for European options with different strikes. Their simulation results showed that these models are superior to delta hedging in general, and out of all models, PPO achieves the best performance in terms of PnL, training time, and the amount of data needed for training. ^[78] formulated the optimal hedging problem as a Risk-averse Contextual [math]k[/math]-Armed Bandit (R-CMAB) model (see the discussion of the contextual bandit problem in Section) and proposed a deep CMAB algorithm involving Thompson Sampling ^[82]. They showed that their algorithm outperforms DQN in terms of sample efficiency and hedging error when compared to delta hedging (in the setting of the BSM model). Their learned hedging strategy was also shown to converge to delta hedging in the absence of risk adjustment, discretization error and transaction costs. ^[71] considered [math]Q[/math]-learning and DDPG for the problem of hedging a short position in a call option when there are transaction costs. The objective function is set to be a weighted sum of the expected hedging cost and the standard deviation of the hedging cost. They showed that their approach achieves a markedly lower expected hedging cost but with a slightly higher standard deviation of the hedging cost when compared to delta hedging. In their simulations, the stock price is assumed to follow either geometric Brownian motion or a stochastic volatility model.

For American options, the key challenge is to find the optimal exercise strategy which determines when to exercise the option as this determines the price. ^[72] used the Least-Squares Policy Iteration (LSPI) algorithm ^[83] and the Fitted [math]Q[/math]-learning algorithm to learn the exercise policy for American options. In their experiments for American put options using both real and synthetic data, the two algorithms gain larger average payoffs than the benchmark Longstaff-Schwartz method ^[84], which is the standard Least-Squares Monte Carlo algorithm. ^[72] also analyzed their approach from a theoretical perspective and derived a high probability, finite-time bound on their method. ^[75] then extended the work in ^[72] by combining random forests, a popular machine learning technique, with Monte Carlo simulation for pricing of both American options and convertible bonds, which are corporate bonds that can be converted to the stock of the issuing company by the bond holder. They showed that the proposed algorithm provides more accurate prices than several other methods including LSPI, Fitted [math]Q[/math]-learning, and the Longstaff-Schwartz method.

Market Making

A market maker in a financial instrument is an individual trader or an institution that provides liquidity to the market by placing buy and sell limit orders in the LOB for that instrument while earning the bid-ask spread.

The objective in market making is different from problems in optimal execution (to target a position) or portfolio optimization (for long-term investing). Instead of profiting from identifying the correct price movement direction, the objective of a market maker is to profit from earning the bid-ask spread without accumulating undesirably large positions (known as inventory) ^[85]. A market maker faces three major sources of risk ^[86]. The inventory risk ^[87] refers to the risk of accumulating an undesirable large net inventory, which significantly increases volatility due to market movements. The execution risk ^[88] is the risk that limit orders may not get filled over a desired horizon. Finally, the adverse selection risk refers to the situation where there is a directional price movement that sweeps through the limit orders submitted by the market marker such that the price does not revert back by the end of the trading horizon. This may lead to a huge loss as the market maker in general needs to clear their inventory at the end of the horizon (typically the end of the day to avoid overnight inventory).

Stochastic Control Approach. Traditionally the theoretical study of market making strategies follows a stochastic control approach, where the LOB dynamics are modeled directly by some stochastic process and an optimal market making strategy that maximizes the market maker's expected utility can be obtained by solving the Hamilton--Jacobi--Bellman equation. See ^[87], ^[89] and ^[90], and ^[3]^{(Chapter 10)} for examples. We follow the framework in ^[87] as an example to demonstrate the control formulation in continuous time and discuss its advantages and disadvantages. Consider a high-frequency market maker trading on a single stock over a finite horizon [math]T[/math]. Suppose the mid-price of this stock follows an arithmetic Brownian motion in that

[[math]] \begin{eqnarray}\label{eq:MM_dynamics} d S_t = \sigma d W_t, \end{eqnarray} [[/math]]

where [math]\sigma[/math] is a constant and [math]W = \{W_t\}_{0\leq t \leq T}[/math] is a standard Brownian motion defined on a filtered probability space [math](\Omega,\mathcal{F},\{\mathcal{F}\}_{0\leq t \leq T},\mathbb{P})[/math]. The market maker will continuously propose bid and ask prices, denoted by [math]S^b_t[/math] and [math]S^a_t[/math] respectively, and hence will buy and sell shares according to the rate of arrival of market orders at the quoted prices. Her inventory, which is the number of shares she holds, is therefore given by

[[math]] \begin{eqnarray} q_t = N_t^b - N_t^a, \end{eqnarray} [[/math]]

where [math]N^b[/math] and [math]N^a[/math] are the point processes (independent of [math]W[/math]) giving the number of shares the market maker respectively bought and sold. Arrival rates obviously depend on the prices [math]S^b_t[/math] and [math]S^a_t[/math] quoted by the market maker and we assume that the intensities [math]\lambda^b[/math] and [math]\lambda^a[/math] associated respectively to [math]N^b[/math] and [math]N^a[/math] depend on the difference between the quoted prices and the reference price (i.e. [math]\delta^b_t = S_t - S^b_t[/math] and [math]\delta^a_t = S_t - S^a_t[/math]) and are of the following form

[[math]] \begin{eqnarray} \lambda^b(\delta^b)=A\exp(-k\delta^b) \mbox{ and } \lambda^a(\delta^a)=A\exp(-k\delta^a), \end{eqnarray} [[/math]]

where [math]A[/math] and [math]k[/math] are positive constants that characterize the liquidity of the stock. Consequently the cash process of the market maker follows

[[math]] \begin{eqnarray} d X_t = (S_t+\delta_t^a) d N_t^a - (S_t-\delta_t^b) d N_t^b. \end{eqnarray} [[/math]]

Finally, the market maker optimizes a constant absolute risk aversion (CARA) utility function:

[[math]] \begin{eqnarray} V(s,x,q,t) = \sup_{\{\delta_u^a,\delta_{u}^b\}_{t\leq u \leq T}\in \mathcal{U}} \mathbb{E} \bigg[ -\exp (-\gamma(X_T+q_T S_T))\bigg| X_t = x, S_t = s, \mbox{ and } q_t = q\bigg], \end{eqnarray} [[/math]]

where [math]\mathcal{U}[/math] is the set of predictable processes bounded from below, [math]\gamma[/math] is the absolute risk aversion coefficient characterizing the market maker, [math]X_T[/math] is the amount of cash at time [math]T[/math] and [math]q_T S_T[/math] is the evaluation of the (signed) remaining quantity of shares in the inventory at time [math]T[/math]. By applying the dynamic programming principle, the value function [math]V[/math] solves the following Hamilton--Jacobi--Bellman equation:

[[math]] \begin{eqnarray}\label{avalaneda-hjb} \begin{cases} &\partial_t V +\frac{1}{2} \partial_{ss} V + \max_{\delta^b} \lambda^b (\delta^b) [V(s,x-s+\delta^b,q+1,t) - u(s,x,q,t)]\\ &\qquad\qquad\qquad + \max_{\delta^a} \lambda^a (\delta^a) [V(s,x+s+\delta^a,q-1,t) - u(s,x,q,t)] = 0,\\ &V(s,x,q,T) = -\exp(-\gamma(x+qs)). \end{cases} \end{eqnarray} [[/math]]

While \eqref{avalaneda-hjb}, derived in ^[87], admits a (semi) closed-form solution which leads to nice insights about the problem, it all builds on the full analytical specification of the market dynamics. In addition, there are very few utility functions (e.g. exponential (CARA), power (CRRA), and quadratic) known in the literature that could possibly lead to closed-form solutions. The same issues arise in other work along this line (^[89] and ^[90], and ^[3]^{(Chapter 10)}) where strong model assumptions are made about the prices or about the LOB or both. This requirement of full analytical specification means these papers are quite removed from realistic market making, as financial markets do not conform to any simple parametric model specification with fixed parameters.

RL Approach. For market making problems with an RL approach, both value-based methods (such as the [math]Q[/math]-learning algorithm ^[91]^[92] and SARSA ^[92]) and policy-based methods (such as deep policy gradient method ^[93]) have been used. The state variables are often composed of bid and ask prices, current holdings of assets, order-flow imbalance, volatility, and some sophisticated market indices. The control variables are typically set to be the spread to post a pair of limit buy and limit sell orders. Examples of reward include PnL with inventory cost ^[91]^[92]^[94]^[93] or Implementation Shortfall with inventory cost ^[95].

The first RL method for market making was explored by ^[96] and the authors applied three RL algorithms and tested them in simulation environments: a Monte Carlo method, a SARSA method and an actor-critic method. For all three methods, the state variables include inventory of the market-maker, order imbalance on the market, and market quality measures. The actions are the changes in the bid/ask price to post the limit orders and the sizes of the limit buy/sell orders. The reward function is set as a linear combination of profit (to maximize), inventory risk (to minimize), and market qualities (to maximize). The authors found that SARSA and Monte Carlo methods work well in a simple simulation environment. The actor-critic method is more plausible in complex environments and generates stochastic policies that correctly adjust bid/ask prices with respect to order imbalance and effectively control the trade-off between the profit and the quoted spread (defined as the price difference to post limit buy orders and limit sell orders). Furthermore, the stochastic policies are shown to outperform deterministic policies in achieving a lower variance of the resulting spread. Later on, ^[91] designed a group of “spread-based” market making strategies parametrized by a minimum quoted spread. The strategies bet on the mean-reverting behavior of the mid-price and utilize the opportunities when the mid-price deviates from the price during the previous period. Then an online algorithm is used to pick in each period a minimum quoted spread. The states of the market maker are the current inventory and price data. The actions are the quantities and at what prices to offer in the market. It is assumed that the market maker interacts with a continuous double auction via an order book. The market maker can place both market and limit orders and is able to make and cancel orders after every price fluctuation. The authors provided structural properties of these strategies which allows them to obtain low regret relative to the best such strategy in hindsight which maximizes the realized rewards. ^[92] generalized the results in ^[96] and adopted several reinforcement learning algorithms (including [math]Q[/math]-learning and SARSA) to improve the decisions of the market maker. In their framework, the action space contains ten actions. The first nine actions correspond to a pair of orders with a particular spread and bias in their prices. The final action allows the agent to clear their inventory using a market order. The states can be divided into two groups: agent states and market states. Agent states include inventory level and active quoting distances and market states include market (bid-ask) spread, mid-price move, book/queue imbalance, signed volume, volatility, and relative strength index. The reward function is designed as the sum of a symmetrically dampened PnL and an inventory cost. The idea of the symmetric damping is to disincentivize the agent from trend chasing and direct the agent towards spread capturing. A simulator of a financial market is constructed via direct reconstruction of the LOB from historical data. Although, since the market is reconstructed from historical data, simulated orders placed by an agent cannot impact the market. The authors compared their algorithm to a modified version of ^[91] and showed a significant empirical improvement from ^[91]. To address the adverse selection risk that market makers are often faced with in a high-frequency environment, ^[93] proposes a high-frequency feature Book Exhaustion Rate (BER) and shows theoretically and empirically that the BER can serve as a direct measurement of the adverse selection risk from an equilibrium point of view. The authors train a market making algorithm via a deep policy-based RL algorithm using three years of LOB data for the Chicago Mercantile Exchange (CME) S\&P 500 and 10-year Treasury note futures. The numerical performance demonstrates that utilizing the BER allows the algorithm to avoid large losses due to adverse selection and achieve stable performance.

Some recent papers have focused on improving the robustness of market markers' strategies with respect to adversarial and volatile market conditions. ^[95] used perturbations by an opposing agent - the adversary - to render the market maker more robust to model uncertainty and consequently improve generalization. Meanwhile, ^[95] incorporated additional predictive signals in the states to improve the learning outcomes of market makers. In particular, the states include the agent's inventory, price range and trend predictions. In a similar way to the setting in ^[91] and ^[92], actions are represented by a certain choice of bid/ask orders relative to the current best bid/ask. The reward function both incentivizes spread-capturing (making round-trips) and discourages holding inventory. The first part of the reward is inspired by utility functions containing a running inventory-based penalty with an absolute value inventory penalty term which has a convenient Value at Risk (VAR) interpretation. The second part is a symmetrically dampened PnL which follows ^[92]. Experimental results on historical data demonstrate the superior reward-to-risk performance of the proposed framework over several standard market making benchmarks. More specifically, in these experiments, the resulting reinforcement learning agent achieves between 20-30\terminal wealth than the benchmarks while being exposed to only around 60\ ^[94] applied Adversarial Reinforcement Learning (ARL) to a zero-sum game version of the control formulation \eqref{eq:MM_dynamics}-\eqref{avalaneda-hjb}. The adversary acts as a proxy for other market participants that would like to profit at the market maker's expense. In addition to designing learning algorithms for the market maker in an unknown financial environment, RL algorithms can also be used to solve high-dimensional control problems for the market maker or to solve the control problem with the presence of a time-dependent rebate in the full information setting. In particular, ^[97] focused on a setting where a market maker needs to decide the optimal bid and ask quotes for a given universe of bonds in an OTC market. This problem is high-dimensional and other classical numerical methods including finite differences are inapplicable. The authors proposed a model-based Actor-Critic-like algorithm involving a deep neural network to numerically solve the problem. Similar ideas have been applied to market making problems in dark pools ^[98] (see the discussion on dark pools in Section). With the presence of a time-dependent rebate, there is no closed-form solution for the associated stochastic control problem of the market maker. Instead, ^[99] proposed a Hamiltonian-guided value function approximation algorithm to solve for the numerical solutions under this scenario. Multi-agent RL algorithms are also used to improve the strategy for market making with a particular focus on the impact of competition from other market makers or the interaction with other types of market participant. See ^[100] and ^[101].

Robo-advising

Robo-advisors, or automated investment managers, are a class of financial advisers that provide online financial advice or investment management with minimal human intervention. They provide digital financial advice based on mathematical rules or algorithms which can easily take into account different sources of data such as news, social media information, sentiment data and earnings reports. Robo-advisors have gained widespread popularity and emerged prominently as an alternative to traditional human advisers in recent years. The first robo-advisors were launched after the 2008 financial crisis when financial services institutions were facing the ensuing loss of trust from their clients. Examples of pioneering robo-advising firms include Betterment and Wealthfront. As of 2020, the value of assets under robo-management is highest in the United States and exceeded \$650 billion ^[102].

The robo-advisor does not know the client's risk preference in advance but learns it while interacting with the client. The robo-advisor then improves its investment decisions based on its current estimate of the client's risk preference. There are several challenges in the application of robo-advising. Firstly, the client's risk preference may change over-time and may depend on the market returns and economic conditions. Therefore the robo-advisor needs to determine a frequency of interaction with the client that ensures a high level of consistency in the risk preference when adjusting portfolio allocations. Secondly, the robo-advisor usually faces a dilemma when it comes to either catering to the client's wishes, that is, investing according to the client's risk preference, or going against the client's wishes in order to seek better investment performance. Finally, there is also a subtle trade-off between the rate of information acquisition from the client and the accuracy of the acquired information. On the one hand, if the interaction does not occur at all times, the robo-advisor may not always have access to up-to-date information about the client's profile. On the other hand, information communicated to the robo-advisor may not be representative of the client's true risk aversion as the client is subject to behavioral biases.

Stochastic Control Approach. To address the above mentioned challenges, ^[102] proposed a stochastic control framework with four components: (i) a regime switching model of market returns, (ii) a mechanism of interaction between the client and the robo-advisor, (iii) a dynamic model (i.e., risk aversion process) for the client's risk preferences, and (vi) an optimal investment criterion. In this framework, the robo-advisor interacts repeatedly with the client and learns about changes in her risk profile whose evolution is specified by (iii). The robo-advisor adopts a multi-period mean-variance investment criterion with a finite investment horizon based on the estimate of the client's risk aversion level. The authors showed the stock market allocation resulting from the dynamic mean-variance optimization consists of two components where the first component is akin to the standard single period Markowitz strategy whereas the second component is intertemporal hedging demand, which depends on the relationship between the current market return and future portfolio returns. Note that although ^[102] focused on the stochastic control approach to tackle the robo-advising problem, the framework is general enough that some components (for example the mean-variance optimization step) may be replaced by an RL algorithm and the dependence on model specification can be potentially relaxed.

RL Approach. There are only a few references on robo-advising with an RL approach since this is still a relatively new topic. We review each paper with details. The first RL algorithm for a robo-advisor was proposed by ^[103] where the authors designed an exploration-exploitation algorithm to learn the investor's risk appetite over time by observing her portfolio choices in different market environments. The set of various market environments of interest is formulated as the state space [math]\mathcal{S}[/math]. In each period, the robo-advisor places an investor's capital into one of several pre-constructed portfolios which can be viewed as the action space [math]\mathcal{A}[/math]. Each portfolio decision reflects the robo-advisor's belief concerning the investor's true risk preference from a discrete set of possible risk aversion parameters [math] \Theta = \{\theta_i\}_{1\leq i \leq |\Theta|}[/math]. The investor interacts with the robo-advisor by portfolio selection choices, and such interactions are used to update the robo-advisor's estimate of the investor's risk profile. The authors proved that, with high probability, the proposed exploration-exploitation algorithm performs near optimally with the number of time steps depending polynomially on various model parameters.

^[104] proposed an investment robo-advising framework consisting of two agents. The first agent, an inverse portfolio optimization agent, infers an investor's risk preference and expected return directly from historical allocation data using online inverse optimization. The second agent, a deep RL agent, aggregates the inferred sequence of expected returns to formulate a new multi-period mean-variance portfolio optimization problem that can be solved using a deep RL approach based on the DDPG method. The proposed investment pipeline was applied to real market data from April 1, 2016 to February 1, 2021 and was shown to consistently outperform the S\&P 500 benchmark portfolio that represents the aggregate market optimal allocation. As mentioned earlier in this subsection, learning the client's risk preference is challenging as the preference may depend on multiple factors and may change over time. ^[105] was dedicated to learning the risk preferences from investment portfolios using an inverse optimization technique. In particular, the proposed inverse optimization approach can be used to measure time varying risk preferences directly from market signals and portfolios. This approach is developed based on two methodologies: convex optimization based modern portfolio theory and learning the decision-making scheme through inverse optimization.

Smart Order Routing

In order to execute a trade of a given asset, market participants may have the opportunity to split the trade and submit orders to different venues, including both lit pools and dark pools, where this asset is traded. This could potentially improve the overall execution price and quantity. Both the decision and hence the outcome are influenced by the characteristics of different venues as well as the structure of transaction fees and rebates across different venues.

Dark Pools vs. Lit Pools. Dark pools are private exchanges for trading securities that are not accessible by the investing public. Also known as “dark pools of liquidity”, the name of these exchanges is a reference to their complete lack of transparency. Dark pools were created in order to facilitate block trading by institutional investors who did not wish to impact the markets with their large orders and obtain adverse prices for their trades. According to recent Securities and Exchange Commission (SEC) data, there were 59 registered Alternative Trading Systems (a.k.a. the “Dark Pools”) with the SEC as of May 2021 of which there are three types: (1) Broker-Dealer-Owned Dark Pools, (2) Agency Broker or Exchange-Owned Dark Pools, and (3) Electronic Market Makers Dark Pools. Lit pools are effectively the opposite of dark pools. Unlike dark pools, where prices at which participants are willing to trade are not revealed, lit pools do display bid offers and ask offers in different stocks. Primary exchanges operate in such a way that available liquidity is displayed at all times and form the bulk of the lit pools available to traders. For smart order routing (SOR) problems, the most important characteristics of different dark pools are the chances of being matched with a counterparty and the price (dis)advantages whereas the relevant characteristics of lit pools include the order flows, queue sizes, and cancellation rates. There are only a few references on using a data-driven approach to tackle SOR problems for dark pool allocations and for allocations across lit pools. We will review each of them with details.

Allocation Across Lit Pools. The SOR problem across multiple lit pools (or primary exchanges) was first studied in ^[106] where the authors formulated the SOR problem as a convex optimization problem. Consider a trader who needs to buy [math]S[/math] shares of a stock within a short time interval [math][0, T][/math]. At time [math]0[/math], the trader may submit [math]K[/math] limit orders with [math]L_k \ge 0[/math] shares to exchanges [math]k = 1,\cdots,K[/math] (joining the queue of the best bid price level) and also market orders for [math]M \ge 0[/math] shares. At time [math]T[/math] if the total executed quantity is less than [math]S[/math] the trader also submits a market order to execute the remaining amount. The trader's order placement decision is thus summarized by a vector [math]X=(M,L_1,\cdots,L_K) \in \mathbb{R}^{K+1}_+[/math] of order sizes. It is assumed for simplicity that a market order of any size up to [math]S[/math] can be filled immediately at any single exchange. Thus a trader chooses the cheapest venue for his market orders. Limit orders with quantities [math](L_1,\cdots, L_K)[/math] join queues of [math](Q_1,\cdots, Q_K)[/math] orders in [math]K[/math] limit order books, where [math]Q_k \ge 0[/math]. Then the filled amounts at the end of the horizon [math]T[/math] can be written as a function of their initial queue position and future order flow:

[[math]] \begin{eqnarray}\label{eq:firfull_amnt} \min(\max(\xi_k - Q_k, 0), L_k) = (\xi_k-Q_k)_+ - (\xi_k - Q_k - L_k)_+ \end{eqnarray} [[/math]]

where the notation [math](x)_+ = \max(0,x)[/math]. Here [math]\xi_k:=D_k+C_k[/math] is the total outflow from the front of the [math]k[/math]-th queue which consists of order cancellations [math]C_k[/math] that occurred before time [math]T[/math] from queue positions in front of an order and market orders [math]D_k[/math] that reach the [math]k[/math]-th exchange before [math]T[/math]. Note that the fulfilled amounts are random because they depend on queue outflows [math]\xi=(\xi_1,\cdots,\xi_K)[/math] during [math][0, T][/math], which are modeled as random variables with a distribution [math]F[/math]. Using the mid-quote price as a benchmark, the execution cost relative to the mid-quote for an order allocation [math]X = (M, L_1,.. ., L_K)[/math] is defined as:

[[math]] \begin{eqnarray} V_{\rm execution}(X, \xi) := (h + f)M -\sum_{k=1}^K (h + r_k)((\xi_k - Q_k)_+ - (\xi_k - Q_k - L_k)_+), \end{eqnarray} [[/math]]

where [math]h[/math] is one-half of the bid-ask spread at time [math]0[/math], [math]f[/math] is a fee for market orders and [math]r_k[/math] are effective rebates for providing liquidity by submitting limit orders on exchanges [math]k = 1,\cdots,K[/math]. Penalties for violations of the target quantity in both directions are included:

[[math]] \begin{eqnarray} V_{\rm penalty}(X, \xi) := \lambda_u (S-A(X, \xi))_{+}+\lambda_o (A(X, \xi)-S)_{+}, \end{eqnarray} [[/math]]

where [math]\lambda_o \ge 0[/math] and [math]\lambda_u \ge 0[/math] are, respectively, the penalty for overshooting and undershooting, and [math]A(X,\xi) = M+ \sum_{k=1}^K (h + r_k)((\xi_k - Q_k)_+ - (\xi_k - Q_k - L_k)_+)[/math] is the total number of shares bought by the trader during [math][0, T)[/math]. The impact cost is paid on all orders placed at times 0 and [math]T[/math], irrespective of whether they are filled, leading to the following total impact:

[[math]] \begin{eqnarray} V_{\rm impact}(X,\xi) = \theta (M+L_k +(S-A(X,\xi)_+)) \end{eqnarray} [[/math]]

where [math]\theta \gt 0[/math] is the impact coefficient. Finally, the cost function is defined as the summation of all three pieces [math]V(X,\xi): = V_{\rm execution}(X, \xi)+ V_{\rm penalty}(X, \xi) + V_{\rm impact}(X,\xi)[/math]. ^[106]^{(Proposition 4)} provides optimality conditions for an order allocation [math]X^* = (M^*,L_1,\cdots,L_K)[/math]. In particular, (semi)-explicit model conditions are given for when [math]L_k^* \gt 0[/math] ([math]M^* \gt 0[/math]), i.e., when submitting limit orders to venue [math]k[/math] (when submitting market orders) is optimal.

In a different approach to the single-period model introduced above, ^[107] formulated the SOR problem as an order allocation problem across multiple lit pools over multiple trading periods. Each venue is characterized by a bid-ask spread process and an imbalance process. The dependencies between the imbalance and spread at the venues are considered through a covariance matrix. A Bayesian learning framework for learning and updating the model parameters is proposed to take into account possibly changing market conditions. Extensions to include short/long trading signals, market impact or hidden liquidity are also discussed.

Allocation Across Dark Pools. As discussed at the beginning of Section, dark pools are a type of stock exchange that is designed to facilitate large transactions. A key aspect of dark pools is the censored feedback that the trader receives. At every iteration the trader has a certain number [math]V_t[/math] of shares to allocate amongst [math]K[/math] different dark pools with [math]v_t^i[/math] the volume allocated to dark pool [math]i[/math]. The dark pool [math]i[/math] trades as many of the allocated shares [math]v_t^i[/math] as it can with the available liquidity [math]s_t^i[/math]. The trader only finds out how many of these allocated shares were successfully traded at each dark pool (i.e., [math]\min(s_t^i,v_t^i)[/math]), but not how many would have been traded if more were allocated (i.e., [math]s_t^i[/math]). Based on this property of censored feedback, ^[108] formulated the allocation problem across dark pools as an online learning problem under the assumption that [math]s_t^i[/math] is an i.i.d. sample from some unknown distribution [math]P_i[/math] and the total allocation quantity [math]V_t[/math] is an i.i.d. sample from an unknown distribution [math]Q[/math] with [math]V_t[/math] upper bounded by a constant [math]V \gt 0[/math] almost surely. At each iteration [math]t[/math], the learner allocates the orders greedily according to the estimate [math]\widehat{P}^{(t-1)}_i[/math] of the distribution [math]P_i[/math] for all dark pools [math]i=1,2,\cdots, K[/math] derived from previous iteration [math]t-1[/math]. Then the learner can update the estimation [math]\widehat{P}^{(t)}_i[/math] of the distribution [math]P_i[/math] with a modified version of the Kaplan-Meier estimate (a non-parametric statistic used to estimate the cumulative probability) with the new censored observation [math]\min(s_t^i,v_t^i)[/math] from iteration [math]t[/math]. The authors then proved that for any [math]\varepsilon \gt 0[/math] and [math] \delta \gt 0[/math], with probability [math]1-\delta[/math] (over the randomness of draws from Q and [math]\{P_i\}_{i=1}^K[/math]), after running for a time polynomial in [math]K, V , 1/\varepsilon[/math], and [math]\ln(1/\delta)[/math], the algorithm makes an [math]\varepsilon[/math]-optimal allocation on each subsequent time step with probability at least [math]1-\varepsilon[/math].

The setup of ^[108] was generalized in ^[109] where the authors assumed that the sequences of volumes [math]V_t[/math] and available liquidities [math]\{s_t^i\}_{i=1}^K[/math] are chosen by an adversary who knows the previous allocations of their algorithm. An exponentiated gradient style algorithm was proposed and shown to enjoy an optimal regret guarantee [math]\mathcal{O}(V\sqrt{T \ln K})[/math] against the best allocation strategy in hindsight.

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

Notes

We use NASDAQ ITCH data taken from Lobster https://lobsterdata.com/.

References

X.-Y. Liu, H.YANG, J.GAO, and C.D. Wang, Finrl: Deep reinforcement learning framework to automate trading in quantitative finance, in Proceedings of the Second ACM International Conference on AI in Finance, 2021, pp.1--9.
X.-Y. Liu, Z.XIA, J.RUI, J.GAO, H.YANG, M.ZHU, C.D. Wang, Z.WANG, and J.GUO, Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning, arXiv preprint arXiv:2211.03107, (2022).
^3.0 ^3.1 ^3.2 ^3.3 {\'A}.CARTEA, S.JAIMUNGAL, and J.PENALVA, Algorithmic and high-frequency trading, Cambridge University Press, 2015.
C.-A. Lehalle and S.LARUELLE, Market microstructure in practice, World Scientific, 2018.
T.PREIS, Price-time priority and pro rata matching in an order book model of financial markets, in Econophysics of Order-driven Markets, Springer, 2011, pp.65--72.
J.-D. Fermanian, O.GU{\'e}ant, and A.RACHEZ, Agents' Behavior on Multi-Dealer-to-Client Bond Trading Platforms, CREST, Center for Research in Economics and Statistics, 2015.
^7.0 ^7.1 ^7.2 R.ALMGREN and N.CHRISS, Optimal execution of portfolio transactions, Journal of Risk, 3 (2001), pp.5--40.
R.C. Grinold and R.N. Kahn, Active portfolio management, (2000).
A.F. Perold, The implementation shortfall: Paper versus reality, Journal of Portfolio Management, 14 (1988), p.4.
W.F. Sharpe, Mutual fund performance, The Journal of Business, 39 (1966), pp.119--138.
J.MOODY, L.WU, Y.LIAO, and M.SAFFELL, Performance functions and reinforcement learning for trading systems and portfolios, Journal of Forecasting, 17 (1998), pp.441--470.
F.A. Sortino and L.N. Price, Performance measurement in a downside risk framework, The Journal of Investing, 3 (1994), pp.59--64.
^13.0 ^13.1 ^13.2 ^13.3 ^13.4 ^13.5 ^13.6 D.HENDRICKS and D.WILCOX, A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution, in 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), IEEE, 2014, pp.457--464.
^14.0 ^14.1 ^14.2 ^14.3 ^14.4 B.NING, F.H.T. Ling, and S.JAIMUNGAL, Double deep Q-learning for optimal execution, arXiv preprint arXiv:1812.06600, (2018).
^15.0 ^15.1 ^15.2 ^15.3 ^15.4 ^15.5 ^15.6 ^15.7 Z.ZHANG, S.ZOHREN, and S.ROBERTS, Deep reinforcement learning for trading, The Journal of Financial Data Science, 2 (2020), pp.25--40.
^16.0 ^16.1 ^16.2 ^16.3 ^16.4 G.JEONG and H.Y. Kim, Improving financial trading decisions using deep Q-learning: Predicting the number of shares, action strategies, and transfer learning, Expert Systems with Applications, 117 (2019), pp.125--138.
^17.0 ^17.1 ^17.2 ^17.3 ^17.4 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks, Available at SSRN 3374766, (2019).
^18.0 ^18.1 ^18.2 ^18.3 ^18.4 Y.SHEN, R.HUANG, C.YAN, and K.OBERMAYER, Risk-averse reinforcement learning for algorithmic trading, in 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), IEEE, 2014, pp.391--398.
^19.0 ^19.1 ^19.2 ^19.3 ^19.4 ^19.5 ^19.6 ^19.7 Y.NEVMYVAKA, Y.FENG, and M.KEARNS, Reinforcement learning for optimized trade execution, in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp.673--680.
{\'A}.CARTEA, S.JAIMUNGAL, and L.S{\'a}nchez-Betancourt, Deep reinforcement learning for algorithmic trading, Available at SSRN, (2021).
^21.0 ^21.1 ^21.2 ^21.3 ^21.4 B.HAMBLY, R.XU, and H.YANG, Policy gradient methods for the noisy linear quadratic regulator over a finite horizon, SIAM Journal on Control and Optimization, 59 (2021), pp.3359--3391.
^22.0 ^22.1 ^22.2 ^22.3 ^22.4 S.LIN and P.A. Beling, An end-to-end optimal trade execution framework based on proximal policy optimization, in IJCAI, 2020, pp.4548--4554.
^23.0 ^23.1 ^23.2 ^23.3 Z.YE, W.DENG, S.ZHOU, Y.XU, and J.GUAN, Optimal trade execution based on deep deterministic policy gradient, in Database Systems for Advanced Applications, Springer International Publishing, 2020, pp.638--654.
^24.0 ^24.1 ^24.2 ^24.3 H.WEI, Y.WANG, L.MANGU, and K.DECKER, Model-based reinforcement learning for predictions and control for limit order books, arXiv preprint arXiv:1910.03743, (2019).
^25.0 ^25.1 ^25.2 ^25.3 ^25.4 ^25.5 ^25.6 Y.Deng, F.Bao, Y.Kong, Z.Ren, and Q.Dai, Deep direct reinforcement learning for financial signal representation and trading, IEEE Transactions on Neural Networks and Learning Systems, 28 (2017), pp.653--664.
L.LEAL, M.LAURI{\`e}re, and C.-A. Lehalle, Learning a functional control for high-frequency finance, arXiv preprint arXiv:2006.09611, (2020).
W.BAO and X.-y. Liu, Multi-agent deep reinforcement learning for liquidation strategy analysis, arXiv preprint arXiv:1906.11046, (2019).
M.KARPE, J.FANG, Z.MA, and C.WANG, Multi-agent reinforcement learning in a realistic limit order book market simulation, in Proceedings of the First ACM International Conference on AI in Finance, ICAIF '20, 2020.
E.ZIVOT, Introduction to Computational Finance and Financial Econometrics, Chapman & Hall Crc, 2017.
H.M. Markowitz, Portfolio selection, Journal of Finance, 7 (1952), pp.77--91.
J.MOSSIN, Optimal multiperiod portfolio policies, The Journal of Business, 41 (1968), pp.215--229.
N.H. Hakansson, Multi-period mean-variance analysis: Toward a general theory of portfolio choice, The Journal of Finance, 26 (1971), pp.857--884.
P.A. Samuelson, Lifetime portfolio selection by dynamic stochastic programming, Stochastic Optimization Models in Finance, (1975), pp.517--524.
R.C. Merton and P.A. Samuelson, Fallacy of the log-normal approximation to optimal portfolio decision-making over many periods, Journal of Financial Economics, 1 (1974), pp.67--94.
M.STEINBACH, Markowitz revisited: Mean-variance models in financial portfolio analysis, Society for Industrial and Applied Mathematics, 43 (2001), pp.31--85.
^36.0 ^36.1 ^36.2 ^36.3 ^36.4 ^36.5 D.LI and W.-L. Ng, Optimal dynamic portfolio selection: Multiperiod mean-variance formulation, Mathematical Finance, 10 (2000), pp.387--406.
R.H. Strotz, Myopia and inconsistency in dynamic utility maximization, The Review of Economic Studies, 23 (1955), pp.165--180.
E.VIGNA, On time consistency for mean-variance portfolio selection, Collegio Carlo Alberto Notebook, 476 (2016).
^39.0 ^39.1 ^39.2 H.XIAO, Z.ZHOU, T.REN, Y.BAI, and W.LIU, Time-consistent strategies for multi-period mean-variance portfolio optimization with the serially correlated returns, Communications in Statistics-Theory and Methods, 49 (2020), pp.2831--2868.
X.Y. Zhou and D.LI, Continuous-time mean-variance portfolio selection: A stochastic LQ framework, Applied Mathematics and Optimization, 42 (2000), pp.19--33.
S.BASAK and G.CHABAKAURI, Dynamic mean-variance asset allocation, The Review of Financial Studies, 23 (2010), pp.2970--3016.
T.BJORK and A.MURGOCI, A general theory of Markovian time inconsistent stochastic control problems, Available at SSRN 1694759, (2010).
J.L. Pedersen and G.PESKIR, Optimal mean-variance portfolio selection, Mathematics and Financial Economics, 11 (2017), pp.137--160.
Y.SATO, Model-free reinforcement learning for financial portfolios: A brief survey, arXiv preprint arXiv:1904.04973, (2019).
^45.0 ^45.1 ^45.2 ^45.3 ^45.4 ^45.5 ^45.6 X.DU, J.ZHAI, and K.LV, Algorithm trading using Q-learning and recurrent reinforcement learning, Positions, 1 (2016), p.1.
^46.0 ^46.1 ^46.2 ^46.3 ^46.4 ^46.5 P.C. Pendharkar and P.CUSATIS, Trading financial indices with reinforcement learning agents, Expert Systems with Applications, 103 (2018), pp.1--13.
^47.0 ^47.1 ^47.2 ^47.3 ^47.4 ^47.5 H.PARK, M.K. Sim, and D.G. Choi, An intelligent financial portfolio trading strategy using deep Q-learning, Expert Systems with Applications, 158 (2020), p.113573.
^48.0 ^48.1 ^48.2 ^48.3 ^48.4 Z.XIONG, X.-Y. Liu, S.ZHONG, H.YANG, and A.WALID, Practical deep reinforcement learning approach for stock trading, arXiv preprint arXiv:1811.07522, (2018).
^49.0 ^49.1 ^49.2 ^49.3 ^49.4 ^49.5 ^49.6 ^49.7 ^49.8 Z.JIANG, D.XU, and J.LIANG, A deep reinforcement learning framework for the financial portfolio management problem, arXiv preprint arXiv:1706.10059, (2017).
^50.0 ^50.1 ^50.2 ^50.3 ^50.4 ^50.5 ^50.6 ^50.7 ^50.8 P.YU, J.S. Lee, I.KULYATIN, Z.SHI, and S.DASGUPTA, Model-based deep reinforcement learning for dynamic portfolio optimization, arXiv preprint arXiv:1901.08740, (2019).
^51.0 ^51.1 ^51.2 ^51.3 ^51.4 Z.LIANG, H.CHEN, J.ZHU, K.JIANG, and Y.LI, Adversarial deep reinforcement learning in portfolio management, arXiv preprint arXiv:1808.09940, (2018).
^52.0 ^52.1 ^52.2 ^52.3 ^52.4 A.M. Aboussalah, What is the value of the cross-sectional approach to deep reinforcement learning?, Available at SSRN, (2020).
^53.0 ^53.1 L.W. Cong, K.TANG, J.WANG, and Y.ZHANG, Alphaportfolio: Direct construction through deep reinforcement learning and interpretable ai, SSRN Electronic Journal. https://doi. org/10.2139/ssrn, 3554486 (2021).
^54.0 ^54.1 ^54.2 ^54.3 H.WANG and X.Y. Zhou, Continuous-time mean--variance portfolio selection: A reinforcement learning framework, Mathematical Finance, 30 (2020), pp.1273--1308.
^55.0 ^55.1 ^55.2 ^55.3 H.WANG, Large scale continuous-time mean-variance portfolio allocation via reinforcement learning, Available at SSRN 3428125, (2019).
M.DIXON and I.HALPERIN, G-learner and girl: Goal based wealth management with reinforcement learning, arXiv preprint arXiv:2002.10990, (2020).
H.YANG, X.-Y. Liu, and Q.WU, A practical machine learning approach for dynamic stock recommendation, in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), IEEE, 2018, pp.1693--1697.
I.GOODFELLOW, J.POUGET-Abadie, M.MIRZA, B.XU, D.WARDE-Farley, S.OZAIR, A.COURVILLE, and Y.BENGIO, Generative adversarial nets, Advances in Neural Information Processing Systems, 27 (2014).
J.Y. Campbell, A.W. Lo, and A.C. MacKinlay, The econometrics of financial markets, Princeton University Press, 1997.
F.BLACK and M.SCHOLES, The pricing of options and corporate liabilities, Journal of Political Economy, 81 (1973), pp.637--654.
R.C. Merton, Theory of rational option pricing, The Bell Journal of Economics and Management Science, (1973), pp.141--183.
M.BROADIE and J.B. Detemple, Anniversary article: Option pricing: Valuation models and applications, Management Science, 50 (2004), pp.1145--1177.
H.E. Leland, Option pricing and replication with transactions costs, The Journal of Finance, 40 (1985), pp.1283--1301.
S.FIGLEWSKI, Options arbitrage in imperfect markets, The Journal of Finance, 44 (1989), pp.1289--1311.
P.HENROTTE, Transaction costs and duplication strategies, Graduate School of Business, Stanford University, (1993).
A.GIURCA and S.BOROVKOVA, Delta hedging of derivatives using deep reinforcement learning, Available at SSRN 3847272, (2021).
R.CONT, Empirical properties of asset returns: stylized facts and statistical issues, Quantitative Finance, 1 (2001), p.223.
A.CHAKRABORTI, I.M. Toke, M.PATRIARCA, and F.ABERGEL, Econophysics review: I. empirical facts, Quantitative Finance, 11 (2011), pp.991--1012.
^69.0 ^69.1 ^69.2 ^69.3 ^69.4 ^69.5 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, QLBS: Q-learner in the Black-Scholes (-Merton) worlds, The Journal of Derivatives, 28 (2020), pp.99--122.
^70.0 ^70.1 ^70.2 ^70.3 ^70.4 ^70.5 ^70.6 ^70.7 J.DU, M.JIN, P.N. Kolm, G.RITTER, Y.WANG, and B.ZHANG, Deep reinforcement learning for option replication and hedging, The Journal of Financial Data Science, 2 (2020), pp.44--57.
^71.0 ^71.1 ^71.2 ^71.3 ^71.4 ^71.5 ^71.6 J.CAO, J.CHEN, J.HULL, and Z.POULOS, Deep hedging of derivatives using reinforcement learning, The Journal of Financial Data Science, 3 (2021), pp.10--27.
^72.0 ^72.1 ^72.2 ^72.3 ^72.4 ^72.5 Y.LI, C.SZEPESVARI, and D.SCHUURMANS, Learning exercise policies for American options, in Artificial Intelligence and Statistics, PMLR, 2009, pp.352--359.
^73.0 ^73.1 ^73.2 ^73.3 ^73.4 I.HALPERIN, The QLBS Q-learner goes NuQlear: Fitted Q iteration, inverse RL, and option portfolios, Quantitative Finance, 19 (2019), pp.1543--1553.
^74.0 ^74.1 ^74.2 ^74.3 ^74.4 ^74.5 ^74.6 P.N. Kolm and G.RITTER, Dynamic replication and hedging: A reinforcement learning approach, The Journal of Financial Data Science, 1 (2019), pp.159--171.
^75.0 ^75.1 B.DUBROV, Monte Carlo simulation with machine learning for pricing American options and convertible bonds, Available at SSRN 2684523, (2015).
J.C. Cox, S.A. Ross, and M.RUBINSTEIN, Option pricing: A simplified approach, Journal of Financial Economics, 7 (1979), pp.229--263.
^77.0 ^77.1 ^77.2 ^77.3 ^77.4 ^77.5 H.BUEHLER, L.GONON, J.TEICHMANN, and B.WOOD, Deep hedging, Quantitative Finance, 19 (2019), pp.1271--1291.
^78.0 ^78.1 ^78.2 ^78.3 L.CANNELLI, G.NUTI, M.SALA, and O.SZEHR, Hedging using reinforcement learning: Contextual $k$-armed bandit versus Q-learning, arXiv preprint arXiv:2007.01623, (2020).
S.KL{\"o}ppel and M.SCHWEIZER, Dynamic indifference valuation via convex risk measures, Mathematical Finance, 17 (2007), pp.599--627.
S.HESTON, A closed-form solution for options with stochastic volatility with applications to bond and currency options, Review of Financial Studies, 6 (1993), pp.327--343.
A.CARBONNEAU and F.GODIN, Equal risk pricing of derivatives with deep hedging, Quantitative Finance, 21 (2021), pp.593--608.
W.R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), pp.285--294.
M.G. Lagoudakis and R.PARR, Least-squares policy iteration, The Journal of Machine Learning Research, 4 (2003), pp.1107--1149.
F.A. Longstaff and E.S. Schwartz, Valuing American options by simulation: A simple least-squares approach, The Review of Financial Studies, 14 (2001), pp.113--147.
O.GU{\'e}ant, C.-A. Lehalle, and J.FERNANDEZ-Tapia, Optimal portfolio liquidation with limit orders, SIAM Journal on Financial Mathematics, 3 (2012), pp.740--764.
F.GUILBAUD and H.PHAM, Optimal high-frequency trading with limit and market orders, Quantitative Finance, 13 (2013), pp.79--94.
^87.0 ^87.1 ^87.2 ^87.3 M.AVELLANEDA and S.STOIKOV, High-frequency trading in a limit order book, Quantitative Finance, 8 (2008), pp.217--224.
C.K{\"u}hn and M.STROH, Optimal portfolios of a small investor in a limit order market: A shadow price approach, Mathematics and Financial Economics, 3 (2010), pp.45--72.
^89.0 ^89.1 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Dealing with the inventory risk: A solution to the market making problem, Mathematics and Financial Economics, 7 (2013), pp.477--507.
^90.0 ^90.1 A.A. Obizhaeva and J.WANG, Optimal trading strategy and supply/demand dynamics, Journal of Financial Markets, 16 (2013), pp.1--32.
^91.0 ^91.1 ^91.2 ^91.3 ^91.4 ^91.5 J.D. Abernethy and S.KALE, Adaptive market making via online learning, in NIPS, Citeseer, 2013, pp.2058--2066.
^92.0 ^92.1 ^92.2 ^92.3 ^92.4 ^92.5 T.SPOONER, J.FEARNLEY, R.SAVANI, and A.KOUKORINIS, Market making via reinforcement learning, in International Foundation for Autonomous Agents and Multiagent Systems, AAMAS '18, 2018, pp.434--442.
^93.0 ^93.1 ^93.2 M.ZHAO and V.LINETSKY, High frequency automated market making algorithms with adverse selection risk control via reinforcement learning, in Proceedings of the Second ACM International Conference on AI in Finance, 2021, pp.1--9.
^94.0 ^94.1 T.SPOONER and R.SAVANI, Robust Market Making via Adversarial Reinforcement Learning, in Proc.\ of the 29th International Joint Conference on Artificial Intelligence, {IJCAI-20}, 7 2020, pp.4590--4596.
^95.0 ^95.1 ^95.2 B.GA{\vs}perov and Z.KOSTANJ{\vc}ar, Market making with signals through deep reinforcement learning, IEEE Access, 9 (2021), pp.61611--61622.
^96.0 ^96.1 N.T. Chan and C.SHELTON, An electronic market-maker, Technical report, MIT, (2001).
O.GU{\'e}ant and I.MANZIUK, Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality, Applied Mathematical Finance, 26 (2019), pp.387--452.
B.BALDACCI, I.MANZIUK, T.MASTROLIA, and M.ROSENBAUM, Market making and incentives design in the presence of a dark pool: A deep reinforcement learning approach, arXiv preprint arXiv:1912.01129, (2019).
G.ZHANG and Y.CHEN, Reinforcement learning for optimal market making with the presence of rebate, Available at SSRN 3646753, (2020).
S.GANESH, N.VADORI, M.XU, H.ZHENG, P.REDDY, and M.VELOSO, Reinforcement learning for market making in a multi-agent dealer market, arXiv preprint arXiv:1911.05892, (2019).
Y.PATEL, Optimizing market making using multi-agent reinforcement learning, arXiv preprint arXiv:1812.10252, (2018).
^102.0 ^102.1 ^102.2 A.CAPPONI, S.OLAFSSON, and T.ZARIPHOPOULOU, Personalized robo-advising: Enhancing investment through client interaction, Management Science, (2021).
H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, Robo-advising: Learning investors' risk preferences via portfolio choices, Journal of Financial Econometrics, 19 (2021), pp.369--392.
H.WANG and S.YU, Robo-advising: Enhancing investment with inverse optimization and deep reinforcement learning, arXiv preprint arXiv:2105.09264, (2021).
S.YU, H.WANG, and C.DONG, Learning risk preferences from investment portfolios using inverse optimization, arXiv preprint arXiv:2010.01687, (2020).
^106.0 ^106.1 R.CONT and A.KUKANOV, Optimal order placement in limit order markets, Quantitative Finance, 17 (2017), pp.21--39.
B.BALDACCI and I.MANZIUK, Adaptive trading strategies across liquidity pools, arXiv preprint arXiv:2008.07807, (2020).
^108.0 ^108.1 K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, Censored exploration and the dark pool problem, Communications of the ACM, 53 (2010), pp.99--107.
A.AGARWAL, P.BARTLETT, and M.DAMA, Optimal allocation strategies for the dark pool problem, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp.9--16.

[5] We use NASDAQ ITCH data taken from Lobster https://lobsterdata.com/.

[liu2021finrl-1] X.-Y. Liu, H.YANG, J.GAO, and C.D. Wang, Finrl: Deep reinforcement learning framework to automate trading in quantitative finance, in Proceedings of the Second ACM International Conference on AI in Finance, 2021, pp.1--9.

[liu2022finrl-2] X.-Y. Liu, Z.XIA, J.RUI, J.GAO, H.YANG, M.ZHU, C.D. Wang, Z.WANG, and J.GUO, Finrl-meta: Market environments and benchmarks for data-driven financial reinforcement learning, arXiv preprint arXiv:2211.03107, (2022).

[cartea2015algorithmic-3] 3.0 ^3.1 ^3.2 ^3.3 {\'A}.CARTEA, S.JAIMUNGAL, and J.PENALVA, Algorithmic and high-frequency trading, Cambridge University Press, 2015.

[lehalle2018market-4] C.-A. Lehalle and S.LARUELLE, Market microstructure in practice, World Scientific, 2018.

[preis2011price-6] T.PREIS, Price-time priority and pro rata matching in an order book model of financial markets, in Econophysics of Order-driven Markets, Springer, 2011, pp.65--72.

[fermanian2015agents-7] J.-D. Fermanian, O.GU{\'e}ant, and A.RACHEZ, Agents' Behavior on Multi-Dealer-to-Client Bond Trading Platforms, CREST, Center for Research in Economics and Statistics, 2015.

[AC2001-8] 7.0 ^7.1 ^7.2 R.ALMGREN and N.CHRISS, Optimal execution of portfolio transactions, Journal of Risk, 3 (2001), pp.5--40.

[grinold2000active-9] R.C. Grinold and R.N. Kahn, Active portfolio management, (2000).

[perold1988implementation-10] A.F. Perold, The implementation shortfall: Paper versus reality, Journal of Portfolio Management, 14 (1988), p.4.

[sharpe1966-11] W.F. Sharpe, Mutual fund performance, The Journal of Business, 39 (1966), pp.119--138.

[moody1998dSharpe-12] J.MOODY, L.WU, Y.LIAO, and M.SAFFELL, Performance functions and reinforcement learning for trading systems and portfolios, Journal of Forecasting, 17 (1998), pp.441--470.

[sortino1994-13] F.A. Sortino and L.N. Price, Performance measurement in a downside risk framework, The Journal of Investing, 3 (1994), pp.59--64.

[HW2014-14] 13.0 ^13.1 ^13.2 ^13.3 ^13.4 ^13.5 ^13.6 D.HENDRICKS and D.WILCOX, A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution, in 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), IEEE, 2014, pp.457--464.

[NLJ2018-15] 14.0 ^14.1 ^14.2 ^14.3 ^14.4 B.NING, F.H.T. Ling, and S.JAIMUNGAL, Double deep Q-learning for optimal execution, arXiv preprint arXiv:1812.06600, (2018).

[ZZR2020-16] 15.0 ^15.1 ^15.2 ^15.3 ^15.4 ^15.5 ^15.6 ^15.7 Z.ZHANG, S.ZOHREN, and S.ROBERTS, Deep reinforcement learning for trading, The Journal of Financial Data Science, 2 (2020), pp.25--40.

[JK2019-17] 16.0 ^16.1 ^16.2 ^16.3 ^16.4 G.JEONG and H.Y. Kim, Improving financial trading decisions using deep Q-learning: Predicting the number of shares, action strategies, and transfer learning, Expert Systems with Applications, 117 (2019), pp.125--138.

[DGK2019-18] 17.0 ^17.1 ^17.2 ^17.3 ^17.4 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks, Available at SSRN 3374766, (2019).

[SHYO2014-19] 18.0 ^18.1 ^18.2 ^18.3 ^18.4 Y.SHEN, R.HUANG, C.YAN, and K.OBERMAYER, Risk-averse reinforcement learning for algorithmic trading, in 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), IEEE, 2014, pp.391--398.

[NFK2006-20] 19.0 ^19.1 ^19.2 ^19.3 ^19.4 ^19.5 ^19.6 ^19.7 Y.NEVMYVAKA, Y.FENG, and M.KEARNS, Reinforcement learning for optimized trade execution, in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp.673--680.

[cartea2021survey-21] {\'A}.CARTEA, S.JAIMUNGAL, and L.S{\'a}nchez-Betancourt, Deep reinforcement learning for algorithmic trading, Available at SSRN, (2021).

[hambly2020policy-22] 21.0 ^21.1 ^21.2 ^21.3 ^21.4 B.HAMBLY, R.XU, and H.YANG, Policy gradient methods for the noisy linear quadratic regulator over a finite horizon, SIAM Journal on Control and Optimization, 59 (2021), pp.3359--3391.

[LB2020-23] 22.0 ^22.1 ^22.2 ^22.3 ^22.4 S.LIN and P.A. Beling, An end-to-end optimal trade execution framework based on proximal policy optimization, in IJCAI, 2020, pp.4548--4554.

[Ye2020-24] 23.0 ^23.1 ^23.2 ^23.3 Z.YE, W.DENG, S.ZHOU, Y.XU, and J.GUAN, Optimal trade execution based on deep deterministic policy gradient, in Database Systems for Advanced Applications, Springer International Publishing, 2020, pp.638--654.

[WWMD2019-25] 24.0 ^24.1 ^24.2 ^24.3 H.WEI, Y.WANG, L.MANGU, and K.DECKER, Model-based reinforcement learning for predictions and control for limit order books, arXiv preprint arXiv:1910.03743, (2019).

[Deng2016-26] 25.0 ^25.1 ^25.2 ^25.3 ^25.4 ^25.5 ^25.6 Y.Deng, F.Bao, Y.Kong, Z.Ren, and Q.Dai, Deep direct reinforcement learning for financial signal representation and trading, IEEE Transactions on Neural Networks and Learning Systems, 28 (2017), pp.653--664.

[leal2020learning-27] L.LEAL, M.LAURI{\`e}re, and C.-A. Lehalle, Learning a functional control for high-frequency finance, arXiv preprint arXiv:2006.09611, (2020).

[BL2019-28] W.BAO and X.-y. Liu, Multi-agent deep reinforcement learning for liquidation strategy analysis, arXiv preprint arXiv:1906.11046, (2019).

[KFMW2020-29] M.KARPE, J.FANG, Z.MA, and C.WANG, Multi-agent reinforcement learning in a realistic limit order book market simulation, in Proceedings of the First ACM International Conference on AI in Finance, ICAIF '20, 2020.

[zivot2017introduction-30] E.ZIVOT, Introduction to Computational Finance and Financial Econometrics, Chapman & Hall Crc, 2017.

[Markowitz1952-31] H.M. Markowitz, Portfolio selection, Journal of Finance, 7 (1952), pp.77--91.

[mossin1968optimal-32] J.MOSSIN, Optimal multiperiod portfolio policies, The Journal of Business, 41 (1968), pp.215--229.

[hakansson1971multi-33] N.H. Hakansson, Multi-period mean-variance analysis: Toward a general theory of portfolio choice, The Journal of Finance, 26 (1971), pp.857--884.

[samuelson1975lifetime-34] P.A. Samuelson, Lifetime portfolio selection by dynamic stochastic programming, Stochastic Optimization Models in Finance, (1975), pp.517--524.

[merton1974-35] R.C. Merton and P.A. Samuelson, Fallacy of the log-normal approximation to optimal portfolio decision-making over many periods, Journal of Financial Economics, 1 (1974), pp.67--94.

[steinbach1999markowitz-36] M.STEINBACH, Markowitz revisited: Mean-variance models in financial portfolio analysis, Society for Industrial and Applied Mathematics, 43 (2001), pp.31--85.

[li2000multiperiodMV-37] 36.0 ^36.1 ^36.2 ^36.3 ^36.4 ^36.5 D.LI and W.-L. Ng, Optimal dynamic portfolio selection: Multiperiod mean-variance formulation, Mathematical Finance, 10 (2000), pp.387--406.

[strotz1955myopia-38] R.H. Strotz, Myopia and inconsistency in dynamic utility maximization, The Review of Economic Studies, 23 (1955), pp.165--180.

[vigna2016time-39] E.VIGNA, On time consistency for mean-variance portfolio selection, Collegio Carlo Alberto Notebook, 476 (2016).

[xiao2020corre-40] 39.0 ^39.1 ^39.2 H.XIAO, Z.ZHOU, T.REN, Y.BAI, and W.LIU, Time-consistent strategies for multi-period mean-variance portfolio optimization with the serially correlated returns, Communications in Statistics-Theory and Methods, 49 (2020), pp.2831--2868.

[zhou2000continuous-41] X.Y. Zhou and D.LI, Continuous-time mean-variance portfolio selection: A stochastic LQ framework, Applied Mathematics and Optimization, 42 (2000), pp.19--33.

[basak2010dynamic-42] S.BASAK and G.CHABAKAURI, Dynamic mean-variance asset allocation, The Review of Financial Studies, 23 (2010), pp.2970--3016.

[bjork2010general-43] T.BJORK and A.MURGOCI, A general theory of Markovian time inconsistent stochastic control problems, Available at SSRN 1694759, (2010).

[pedersen2017optimal-44] J.L. Pedersen and G.PESKIR, Optimal mean-variance portfolio selection, Mathematics and Financial Economics, 11 (2017), pp.137--160.

[Sato2019Survey-45] Y.SATO, Model-free reinforcement learning for financial portfolios: A brief survey, arXiv preprint arXiv:1904.04973, (2019).

[du2016algorithm-46] 45.0 ^45.1 ^45.2 ^45.3 ^45.4 ^45.5 ^45.6 X.DU, J.ZHAI, and K.LV, Algorithm trading using Q-learning and recurrent reinforcement learning, Positions, 1 (2016), p.1.

[pendharkar2018trading-47] 46.0 ^46.1 ^46.2 ^46.3 ^46.4 ^46.5 P.C. Pendharkar and P.CUSATIS, Trading financial indices with reinforcement learning agents, Expert Systems with Applications, 103 (2018), pp.1--13.

[PSC2020-48] 47.0 ^47.1 ^47.2 ^47.3 ^47.4 ^47.5 H.PARK, M.K. Sim, and D.G. Choi, An intelligent financial portfolio trading strategy using deep Q-learning, Expert Systems with Applications, 158 (2020), p.113573.

[xiong2018practical-49] 48.0 ^48.1 ^48.2 ^48.3 ^48.4 Z.XIONG, X.-Y. Liu, S.ZHONG, H.YANG, and A.WALID, Practical deep reinforcement learning approach for stock trading, arXiv preprint arXiv:1811.07522, (2018).

[JXL2017-50] 49.0 ^49.1 ^49.2 ^49.3 ^49.4 ^49.5 ^49.6 ^49.7 ^49.8 Z.JIANG, D.XU, and J.LIANG, A deep reinforcement learning framework for the financial portfolio management problem, arXiv preprint arXiv:1706.10059, (2017).

[YLKSD2019-51] 50.0 ^50.1 ^50.2 ^50.3 ^50.4 ^50.5 ^50.6 ^50.7 ^50.8 P.YU, J.S. Lee, I.KULYATIN, Z.SHI, and S.DASGUPTA, Model-based deep reinforcement learning for dynamic portfolio optimization, arXiv preprint arXiv:1901.08740, (2019).

[liang2018adversarial-52] 51.0 ^51.1 ^51.2 ^51.3 ^51.4 Z.LIANG, H.CHEN, J.ZHU, K.JIANG, and Y.LI, Adversarial deep reinforcement learning in portfolio management, arXiv preprint arXiv:1808.09940, (2018).

[aboussalah2020value-53] 52.0 ^52.1 ^52.2 ^52.3 ^52.4 A.M. Aboussalah, What is the value of the cross-sectional approach to deep reinforcement learning?, Available at SSRN, (2020).

[cong2021alphaportfolio-54] 53.0 ^53.1 L.W. Cong, K.TANG, J.WANG, and Y.ZHANG, Alphaportfolio: Direct construction through deep reinforcement learning and interpretable ai, SSRN Electronic Journal. https://doi. org/10.2139/ssrn, 3554486 (2021).

[WZ2019-55] 54.0 ^54.1 ^54.2 ^54.3 H.WANG and X.Y. Zhou, Continuous-time mean--variance portfolio selection: A reinforcement learning framework, Mathematical Finance, 30 (2020), pp.1273--1308.

[wang2019-56] 55.0 ^55.1 ^55.2 ^55.3 H.WANG, Large scale continuous-time mean-variance portfolio allocation via reinforcement learning, Available at SSRN 3428125, (2019).

[dixon2020g-57] M.DIXON and I.HALPERIN, G-learner and girl: Goal based wealth management with reinforcement learning, arXiv preprint arXiv:2002.10990, (2020).

[yang2018practical-58] H.YANG, X.-Y. Liu, and Q.WU, A practical machine learning approach for dynamic stock recommendation, in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), IEEE, 2018, pp.1693--1697.

[goodfellow2014GAN-59] I.GOODFELLOW, J.POUGET-Abadie, M.MIRZA, B.XU, D.WARDE-Farley, S.OZAIR, A.COURVILLE, and Y.BENGIO, Generative adversarial nets, Advances in Neural Information Processing Systems, 27 (2014).

[campbell2012econometrics-60] J.Y. Campbell, A.W. Lo, and A.C. MacKinlay, The econometrics of financial markets, Princeton University Press, 1997.

[BS1973-61] F.BLACK and M.SCHOLES, The pricing of options and corporate liabilities, Journal of Political Economy, 81 (1973), pp.637--654.

[merton1973-62] R.C. Merton, Theory of rational option pricing, The Bell Journal of Economics and Management Science, (1973), pp.141--183.

[broadie2004anniversary-63] M.BROADIE and J.B. Detemple, Anniversary article: Option pricing: Valuation models and applications, Management Science, 50 (2004), pp.1145--1177.

[leland1985option-64] H.E. Leland, Option pricing and replication with transactions costs, The Journal of Finance, 40 (1985), pp.1283--1301.

[figlewski1989options-65] S.FIGLEWSKI, Options arbitrage in imperfect markets, The Journal of Finance, 44 (1989), pp.1289--1311.

[henrotte1993transaction-66] P.HENROTTE, Transaction costs and duplication strategies, Graduate School of Business, Stanford University, (1993).

[giurca2021-67] A.GIURCA and S.BOROVKOVA, Delta hedging of derivatives using deep reinforcement learning, Available at SSRN 3847272, (2021).

[cont2001empirical-68] R.CONT, Empirical properties of asset returns: stylized facts and statistical issues, Quantitative Finance, 1 (2001), p.223.

[chakraborti2011econophysics-69] A.CHAKRABORTI, I.M. Toke, M.PATRIARCA, and F.ABERGEL, Econophysics review: I. empirical facts, Quantitative Finance, 11 (2011), pp.991--1012.

[halperin2020bs-70] 69.0 ^69.1 ^69.2 ^69.3 ^69.4 ^69.5 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, QLBS: Q-learner in the Black-Scholes (-Merton) worlds, The Journal of Derivatives, 28 (2020), pp.99--122.

[du2020option-71] 70.0 ^70.1 ^70.2 ^70.3 ^70.4 ^70.5 ^70.6 ^70.7 J.DU, M.JIN, P.N. Kolm, G.RITTER, Y.WANG, and B.ZHANG, Deep reinforcement learning for option replication and hedging, The Journal of Financial Data Science, 2 (2020), pp.44--57.

[cao2021hedging-72] 71.0 ^71.1 ^71.2 ^71.3 ^71.4 ^71.5 ^71.6 J.CAO, J.CHEN, J.HULL, and Z.POULOS, Deep hedging of derivatives using reinforcement learning, The Journal of Financial Data Science, 3 (2021), pp.10--27.

[li2009-73] 72.0 ^72.1 ^72.2 ^72.3 ^72.4 ^72.5 Y.LI, C.SZEPESVARI, and D.SCHUURMANS, Learning exercise policies for American options, in Artificial Intelligence and Statistics, PMLR, 2009, pp.352--359.

[halperin2019qlbs-74] 73.0 ^73.1 ^73.2 ^73.3 ^73.4 I.HALPERIN, The QLBS Q-learner goes NuQlear: Fitted Q iteration, inverse RL, and option portfolios, Quantitative Finance, 19 (2019), pp.1543--1553.

[kolm2019-75] 74.0 ^74.1 ^74.2 ^74.3 ^74.4 ^74.5 ^74.6 P.N. Kolm and G.RITTER, Dynamic replication and hedging: A reinforcement learning approach, The Journal of Financial Data Science, 1 (2019), pp.159--171.

[dubrov2015-76] 75.0 ^75.1 B.DUBROV, Monte Carlo simulation with machine learning for pricing American options and convertible bonds, Available at SSRN 2684523, (2015).

[cox1979option-77] J.C. Cox, S.A. Ross, and M.RUBINSTEIN, Option pricing: A simplified approach, Journal of Financial Economics, 7 (1979), pp.229--263.

[buehler2019hedging-78] 77.0 ^77.1 ^77.2 ^77.3 ^77.4 ^77.5 H.BUEHLER, L.GONON, J.TEICHMANN, and B.WOOD, Deep hedging, Quantitative Finance, 19 (2019), pp.1271--1291.

[cannelli2020hedging-79] 78.0 ^78.1 ^78.2 ^78.3 L.CANNELLI, G.NUTI, M.SALA, and O.SZEHR, Hedging using reinforcement learning: Contextual $k$-armed bandit versus Q-learning, arXiv preprint arXiv:2007.01623, (2020).

[kloppel2007dynamic-80] S.KL{\"o}ppel and M.SCHWEIZER, Dynamic indifference valuation via convex risk measures, Mathematical Finance, 17 (2007), pp.599--627.

[Heston1993-81] S.HESTON, A closed-form solution for options with stochastic volatility with applications to bond and currency options, Review of Financial Studies, 6 (1993), pp.327--343.

[carbonneau2021-82] A.CARBONNEAU and F.GODIN, Equal risk pricing of derivatives with deep hedging, Quantitative Finance, 21 (2021), pp.593--608.

[thompson1933likelihood-83] W.R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), pp.285--294.

[lagoudakis2003least-84] M.G. Lagoudakis and R.PARR, Least-squares policy iteration, The Journal of Machine Learning Research, 4 (2003), pp.1107--1149.

[longstaff2001-85] F.A. Longstaff and E.S. Schwartz, Valuing American options by simulation: A simple least-squares approach, The Review of Financial Studies, 14 (2001), pp.113--147.

[gueant2012optimal-86] O.GU{\'e}ant, C.-A. Lehalle, and J.FERNANDEZ-Tapia, Optimal portfolio liquidation with limit orders, SIAM Journal on Financial Mathematics, 3 (2012), pp.740--764.

[guilbaud2013optimal-87] F.GUILBAUD and H.PHAM, Optimal high-frequency trading with limit and market orders, Quantitative Finance, 13 (2013), pp.79--94.

[avellaneda2008high-88] 87.0 ^87.1 ^87.2 ^87.3 M.AVELLANEDA and S.STOIKOV, High-frequency trading in a limit order book, Quantitative Finance, 8 (2008), pp.217--224.

[kuhn2010optimal-89] C.K{\"u}hn and M.STROH, Optimal portfolios of a small investor in a limit order market: A shadow price approach, Mathematics and Financial Economics, 3 (2010), pp.45--72.

[gueant2013dealing-90] 89.0 ^89.1 \leavevmode\vrule height 2pt depth -1.6pt width 23pt, Dealing with the inventory risk: A solution to the market making problem, Mathematics and Financial Economics, 7 (2013), pp.477--507.

[obizhaeva2013optimal-91] 90.0 ^90.1 A.A. Obizhaeva and J.WANG, Optimal trading strategy and supply/demand dynamics, Journal of Financial Markets, 16 (2013), pp.1--32.

[abernethy2013adaptive-92] 91.0 ^91.1 ^91.2 ^91.3 ^91.4 ^91.5 J.D. Abernethy and S.KALE, Adaptive market making via online learning, in NIPS, Citeseer, 2013, pp.2058--2066.

[spooner2018market-93] 92.0 ^92.1 ^92.2 ^92.3 ^92.4 ^92.5 T.SPOONER, J.FEARNLEY, R.SAVANI, and A.KOUKORINIS, Market making via reinforcement learning, in International Foundation for Autonomous Agents and Multiagent Systems, AAMAS '18, 2018, pp.434--442.

[zhao2021high-94] 93.0 ^93.1 ^93.2 M.ZHAO and V.LINETSKY, High frequency automated market making algorithms with adverse selection risk control via reinforcement learning, in Proceedings of the Second ACM International Conference on AI in Finance, 2021, pp.1--9.

[spooner2020robust-95] 94.0 ^94.1 T.SPOONER and R.SAVANI, Robust Market Making via Adversarial Reinforcement Learning, in Proc.\ of the 29th International Joint Conference on Artificial Intelligence, {IJCAI-20}, 7 2020, pp.4590--4596.

[gavsperov2021market-96] 95.0 ^95.1 ^95.2 B.GA{\vs}perov and Z.KOSTANJ{\vc}ar, Market making with signals through deep reinforcement learning, IEEE Access, 9 (2021), pp.61611--61622.

[chan2001electronic-97] 96.0 ^96.1 N.T. Chan and C.SHELTON, An electronic market-maker, Technical report, MIT, (2001).

[gueant2019deep-98] O.GU{\'e}ant and I.MANZIUK, Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality, Applied Mathematical Finance, 26 (2019), pp.387--452.

[baldacci2019market-99] B.BALDACCI, I.MANZIUK, T.MASTROLIA, and M.ROSENBAUM, Market making and incentives design in the presence of a dark pool: A deep reinforcement learning approach, arXiv preprint arXiv:1912.01129, (2019).

[zhang2020reinforcement-100] G.ZHANG and Y.CHEN, Reinforcement learning for optimal market making with the presence of rebate, Available at SSRN 3646753, (2020).

[ganesh2019reinforcement-101] S.GANESH, N.VADORI, M.XU, H.ZHENG, P.REDDY, and M.VELOSO, Reinforcement learning for market making in a multi-agent dealer market, arXiv preprint arXiv:1911.05892, (2019).

[patel2018optimizing-102] Y.PATEL, Optimizing market making using multi-agent reinforcement learning, arXiv preprint arXiv:1812.10252, (2018).

[capponi2021personalized-103] 102.0 ^102.1 ^102.2 A.CAPPONI, S.OLAFSSON, and T.ZARIPHOPOULOU, Personalized robo-advising: Enhancing investment through client interaction, Management Science, (2021).

[alsabah2021robo-104] H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, Robo-advising: Learning investors' risk preferences via portfolio choices, Journal of Financial Econometrics, 19 (2021), pp.369--392.

[wang2021robo-105] H.WANG and S.YU, Robo-advising: Enhancing investment with inverse optimization and deep reinforcement learning, arXiv preprint arXiv:2105.09264, (2021).

[yu2020learning-106] S.YU, H.WANG, and C.DONG, Learning risk preferences from investment portfolios using inverse optimization, arXiv preprint arXiv:2010.01687, (2020).

[cont2017optimal-107] 106.0 ^106.1 R.CONT and A.KUKANOV, Optimal order placement in limit order markets, Quantitative Finance, 17 (2017), pp.21--39.

[baldacci2020adaptive-108] B.BALDACCI and I.MANZIUK, Adaptive trading strategies across liquidity pools, arXiv preprint arXiv:2008.07807, (2020).

[ganchev2010censored-109] 108.0 ^108.1 K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, Censored exploration and the dark pool problem, Communications of the ACM, 53 (2010), pp.99--107.

[agarwal2010optimal-110] A.AGARWAL, P.BARTLETT, and M.DAMA, Optimal allocation strategies for the dark pool problem, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp.9--16.

[1]

[2]

[3]

[4]

[a]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

@@ Line 44: / Line 44: @@
 The availability of data from electronic markets and the recent developments in RL together have led to a rapidly growing recent body of work applying RL algorithms to decision-making problems in electronic markets. Examples include optimal execution, portfolio optimization, option pricing and hedging, market making, order routing, as well as robot advising.
 In this section, we start with a brief overview of electronic markets and some discussion of market microstructure in [[#sec:electronic_market |Section]]. We then introduce several applications of RL in finance. In particular, optimal execution for a single asset is introduced in [[#sec:optimal_execution |Section]] and portfolio optimization problems across multiple assets  is discussed in [[#sec:portfolio_optimization |Section]]. This is followed by sections on option pricing, robo-advising, and smart order routing. In each we introduce the underlying problem and basic model before looking at recent RL approaches used in tackling them.
-It is worth noting that there are some open source projects that provide full pipelines for implementing different RL algorithms in financial applications <ref name="liu2021finrl"/><ref name="liu2022finrl"/>.
+It is worth noting that there are some open source projects that provide full pipelines for implementing different RL algorithms in financial applications <ref name="liu2021finrl">X.-Y. Liu, H.YANG, J.GAO, and C.D. Wang, ''Finrl: Deep  reinforcement learning framework to automate trading in quantitative  finance'', in Proceedings of the Second ACM International Conference on AI in  Finance, 2021, pp.1--9.</ref><ref name="liu2022finrl">X.-Y. Liu, Z.XIA, J.RUI, J.GAO, H.YANG, M.ZHU, C.D. Wang, Z.WANG,  and J.GUO, ''Finrl-meta: Market environments and benchmarks for  data-driven financial reinforcement learning'', arXiv preprint  arXiv:2211.03107,  (2022).</ref>.
 ===<span id="sec:electronic_market"></span>Preliminary: Electronic Markets and  Market Microstructure===
-Many recent decision-making problems in finance are centered around electronic markets. We give a brief overview of this type of market and discuss two popular examples -- central limit order books and electronic over-the-counter markets. For more in-depth discussion of electronic markets and market microstructure, we refer  the reader to  the books <ref name="cartea2015algorithmic"/> and <ref name="lehalle2018market"/>.
+Many recent decision-making problems in finance are centered around electronic markets. We give a brief overview of this type of market and discuss two popular examples -- central limit order books and electronic over-the-counter markets. For more in-depth discussion of electronic markets and market microstructure, we refer  the reader to  the books <ref name="cartea2015algorithmic">{\'A}.CARTEA, S.JAIMUNGAL, and J.PENALVA, ''Algorithmic and  high-frequency trading'', Cambridge University Press, 2015.</ref> and <ref name="lehalle2018market">C.-A. Lehalle and S.LARUELLE, ''Market microstructure in practice'',  World Scientific, 2018.</ref>.
 '''Electronic Markets.''' Electronic markets have emerged as popular venues for the trading of a wide variety of financial assets. Stock exchanges in many countries, including Canada, Germany, Israel, and the United Kingdom, have adopted electronic platforms to trade equities, as has Euronext, the market combining several former European stock exchanges. In the United States, electronic communications networks (ECNs) such as Island, Instinet, and Archipelago (now ArcaEx) use an electronic order book structure to trade as much as 45\trading of currencies. Eurex, the electronic Swiss-German exchange, is now the world's largest futures market, while options have been traded in electronic markets since the opening of the International Securities Exchange in 2000. Many such electronic markets are organized as electronic Limit Order Books (LOBs). In this structure, there is no designated liquidity provider such as a specialist or a dealer. Instead, liquidity arises endogenously from the submitted orders of traders. Traders who submit orders to buy or sell the asset at a particular price are said to “make”
@@ Line 54: / Line 54: @@
 [[File:guide_b5002_lob_MSFT.png | 400px | thumb | A snapshot of the LOB of MSFT(Microsoft) stock at 9:30:08.026 am on 21 June, 2012 with ten levels of bid/ask prices. ]]
 </div>
-A matching engine is used to match the incoming buy and sell orders. This typically follows the price-time priority rule <ref name="preis2011price"/>, whereby orders are first ranked according to their price. Multiple orders having the same price are then ranked according to the time they were entered. If the price and time are the same for the incoming orders, then the larger order gets executed first. The matching engine uses the LOB to store pending orders that could not be executed upon arrival.
+A matching engine is used to match the incoming buy and sell orders. This typically follows the price-time priority rule <ref name="preis2011price">T.PREIS, ''Price-time priority and pro rata matching in an order book  model of financial markets'', in Econophysics of Order-driven Markets,  Springer, 2011, pp.65--72.</ref>, whereby orders are first ranked according to their price. Multiple orders having the same price are then ranked according to the time they were entered. If the price and time are the same for the incoming orders, then the larger order gets executed first. The matching engine uses the LOB to store pending orders that could not be executed upon arrival.
@@ Line 63: / Line 63: @@
 The client progressively receives the answers to the RFQ and can deal at any time
 with the dealer who has proposed the best price, or decide not to trade. Each dealer knows whether a deal was done (with her/him, but also with another
-dealer - without knowing the identity of this dealer) or not. If a transaction occurred, the best dealer usually knows the cover price (the second best bid price in the RFQ), if there is one. We refer the reader to <ref name="fermanian2015agents"/> for a more in-depth discussion of MD2C bond trading platforms.
+dealer - without knowing the identity of this dealer) or not. If a transaction occurred, the best dealer usually knows the cover price (the second best bid price in the RFQ), if there is one. We refer the reader to <ref name="fermanian2015agents">J.-D. Fermanian, O.GU{\'e}ant, and A.RACHEZ, ''Agents' Behavior on  Multi-Dealer-to-Client Bond Trading Platforms'', CREST, Center for Research  in Economics and Statistics, 2015.</ref> for a more in-depth discussion of MD2C bond trading platforms.
 '''Market Participants.'''
@@ Line 76: / Line 76: @@
 Optimal execution is a fundamental problem in financial modeling. The simplest version is the case of a trader who wishes to buy or sell a given amount of a single asset within a given time period. The trader seeks strategies that maximize their return from, or alternatively, minimize the cost of, the execution of the transaction.
-'''The Almgren--Chriss Model.''' A classical framework for optimal execution is due to Almgren--Chriss <ref name="AC2001"/>. In this setup a trader is required to sell an amount <math>q_0</math> of an asset, with
+'''The Almgren--Chriss Model.''' A classical framework for optimal execution is due to Almgren--Chriss <ref name="AC2001">R.ALMGREN and N.CHRISS, ''Optimal execution of portfolio  transactions'', Journal of Risk, 3 (2001), pp.5--40.</ref>. In this setup a trader is required to sell an amount <math>q_0</math> of an asset, with
 price <math>S_0</math> at time 0, over the time period <math>[0,T]</math> with trading decisions made at discrete time points
 <math>t=1,\ldots,T</math>. The final inventory <math>q_T</math> is required to be zero. Therefore the goal is to determine the liquidation strategy <math>u_1,u_2,\ldots,u_{T}</math>, where <math>u_t</math> (<math>t=1,2,\ldots,T</math>) denotes the amount of the asset to sell at time <math>t</math>. It is assumed that selling quantities of the asset will have two types of price impact -- a temporary impact which refers to any temporary price movement due to the supply-demand imbalance caused by the selling, and a permanent impact, which is a long-term effect on the ‘equilibrium’ price due to the trading, that will remain at least for the trading period.
@@ Line 130: / Line 130: @@
 \end{equation}
 </math>
-The above Almgren--Chriss framework for liquidating a single asset can be extended to the case of multiple assets <ref name="AC2001"/>{{rp|at=Appendix A}}. We also note that the general solution to the Almgren-Chriss model had been constructed previously in <ref name="grinold2000active"/>{{rp|at=Chapter 16}}.
+The above Almgren--Chriss framework for liquidating a single asset can be extended to the case of multiple assets <ref name="AC2001"/>{{rp|at=Appendix A}}. We also note that the general solution to the Almgren-Chriss model had been constructed previously in <ref name="grinold2000active">R.C. Grinold and R.N. Kahn, ''Active portfolio management'',  (2000).</ref>{{rp|at=Chapter 16}}.
 This simple version of the Almgren--Chriss framework has a closed-form solution but it relies heavily on the assumptions of the dynamics and the linear form of the permanent and temporary price impact. The mis-specification of the dynamics and market impacts may lead to undesirable  strategies and potential losses. In addition, the solution in \eqref{eq:amgren-chriss} is a pre-planned strategy that does not depend on real-time market conditions.  Hence this strategy may miss certain opportunities when the market moves. This motivates the use of an RL approach which is more flexible and able to incorporate market conditions when making decisions.
 '''Evaluation Criteria and Benchmark Algorithms.''' Before discussing the RL approach, we introduce several widely-used criteria to evaluate the performance of execution algorithms in the literature such as the Profit and Loss (PnL), Implementation Shortfall, and the Sharp ratio. The PnL is the '' final'' profit or loss induced by a given execution algorithm over the whole time period, which is made up of transactions at all time points.
-The Implementation Shortfall <ref name="perold1988implementation"/> for an execution algorithm is defined as the difference between the {PnL} of the algorithm and the {PnL} received by trading the entire amount of the asset instantly. The Sharpe ratio <ref name="sharpe1966"/> is defined as the ratio of expected return to standard deviation of the return; thus it measures return per unit of risk. Two popular variants of the Sharpe ratio are the differential Sharpe ratio <ref name="moody1998dSharpe"/> and the Sortino ratio <ref name="sortino1994"/>.
+The Implementation Shortfall <ref name="perold1988implementation">A.F. Perold, ''The implementation shortfall: Paper versus reality'',  Journal of Portfolio Management, 14 (1988), p.4.</ref> for an execution algorithm is defined as the difference between the {PnL} of the algorithm and the {PnL} received by trading the entire amount of the asset instantly. The Sharpe ratio <ref name="sharpe1966">W.F. Sharpe, ''Mutual fund performance'', The Journal of Business,  39 (1966), pp.119--138.</ref> is defined as the ratio of expected return to standard deviation of the return; thus it measures return per unit of risk. Two popular variants of the Sharpe ratio are the differential Sharpe ratio <ref name="moody1998dSharpe">J.MOODY, L.WU, Y.LIAO, and M.SAFFELL, ''Performance functions and  reinforcement learning for trading systems and portfolios'', Journal of  Forecasting, 17 (1998), pp.441--470.</ref> and the Sortino ratio <ref name="sortino1994">F.A. Sortino and L.N. Price, ''Performance measurement in a downside  risk framework'', The Journal of Investing, 3 (1994), pp.59--64.</ref>.
 In addition, some classical pre-specified strategies are used as benchmarks to evaluate the performance of a given RL-based execution strategy. Popular choices include executions strategies based on Time-Weighted Average Price (TWAP) and Volume-Weighted Average Price (VWAP) as well as the Submit and Leave (SnL) policy where a trader places a sell order for all shares at a fixed limit order price, and goes to the market with any unexecuted shares remaining at time <math>T</math>.
-'''RL Approach.''' We first provide a brief overview of the existing literature on RL for optimal execution. The most popular types of RL methods that have been used in optimal execution problems are <math>Q</math>-learning algorithms and (double) DQN <ref name="HW2014"/><ref name="NLJ2018"/><ref name="ZZR2020"/><ref name="JK2019"/><ref name="DGK2019"/><ref name="SHYO2014"/><ref name="NFK2006"/><ref name="cartea2021survey"/>. Policy-based algorithms are also popular in this field, including (deep) policy gradient methods <ref name="hambly2020policy"/><ref name="ZZR2020"/>, A2C <ref name="ZZR2020"/>, PPO <ref name="DGK2019"/><ref name="LB2020"/>, and DDPG <ref name="Ye2020"/>. The benchmark strategies studied in these papers include the Almgren--Chriss solution <ref name="HW2014"/><ref name="hambly2020policy"/>, the TWAP strategy <ref name="NLJ2018"/><ref name="DGK2019"/><ref name="LB2020"/>, the VWAP strategy <ref name="LB2020"/>, and the SnL policy <ref name="NFK2006"/><ref name="Ye2020"/>. In some models the trader is allowed to buy or sell the asset at each time point <ref name="JK2019"/><ref name="ZZR2020"/><ref name="WWMD2019"/><ref name="Deng2016"/>, whereas there are also many models where only one trading direction is allowed <ref name="NFK2006"/><ref name="HW2014"/><ref name="hambly2020policy"/><ref name="NLJ2018"/><ref name="DGK2019"/><ref name="SHYO2014"/><ref name="Ye2020"/><ref name="LB2020"/>. The state variables are often composed of time stamp, the market attributes including (mid-)price of the asset and/or the spread, the inventory process and past returns. The control variables are typically set to be the amount of asset (using market orders) to trade and/or the relative price level (using limit orders) at each time point. Examples of reward functions include cash inflow or outflow (depending on whether we sell or buy) <ref name="SHYO2014"/><ref name="NFK2006"/>, implementation shortfall <ref name="HW2014"/>, profit <ref name="Deng2016"/>, Sharpe ratio <ref name="Deng2016"/>, return <ref name="JK2019"/>, and PnL <ref name="WWMD2019"/>. Popular choices of performance measure include Implementation Shortfall <ref name="HW2014"/><ref name="hambly2020policy"/>, PnL (with a penalty term of transaction cost) <ref name="NLJ2018"/><ref name="WWMD2019"/><ref name="Deng2016"/>, trading cost <ref name="NFK2006"/><ref name="SHYO2014"/>, profit <ref name="Deng2016"/><ref name="JK2019"/>,  Sharpe ratio <ref name="Deng2016"/><ref name="ZZR2020"/>, Sortino ratio <ref name="ZZR2020"/>, and return <ref name="ZZR2020"/>.
+'''RL Approach.''' We first provide a brief overview of the existing literature on RL for optimal execution. The most popular types of RL methods that have been used in optimal execution problems are <math>Q</math>-learning algorithms and (double) DQN <ref name="HW2014">D.HENDRICKS and D.WILCOX, ''A reinforcement learning extension to  the Almgren-Chriss framework for optimal trade execution'', in 2014 IEEE  Conference on Computational Intelligence for Financial Engineering &  Economics (CIFEr), IEEE, 2014, pp.457--464.</ref><ref name="NLJ2018">B.NING, F.H.T. Ling, and S.JAIMUNGAL, ''Double deep Q-learning  for optimal execution'', arXiv preprint arXiv:1812.06600,  (2018).</ref><ref name="ZZR2020">Z.ZHANG, S.ZOHREN, and S.ROBERTS, ''Deep reinforcement learning for  trading'', The Journal of Financial Data Science, 2 (2020), pp.25--40.</ref><ref name="JK2019">G.JEONG and H.Y. Kim, ''Improving financial trading decisions using  deep Q-learning: Predicting the number of shares, action strategies, and  transfer learning'', Expert Systems with Applications, 117 (2019),  pp.125--138.</ref><ref name="DGK2019">\leavevmode\vrule height 2pt depth -1.6pt width 23pt, ''Deep execution-value  and policy based reinforcement learning for trading and beating market  benchmarks'', Available at SSRN 3374766,  (2019).</ref><ref name="SHYO2014">Y.SHEN, R.HUANG, C.YAN, and K.OBERMAYER, ''Risk-averse  reinforcement learning for algorithmic trading'', in 2014 IEEE Conference on  Computational Intelligence for Financial Engineering & Economics (CIFEr),  IEEE, 2014, pp.391--398.</ref><ref name="NFK2006">Y.NEVMYVAKA, Y.FENG, and M.KEARNS, ''Reinforcement learning for  optimized trade execution'', in Proceedings of the 23rd International  Conference on Machine Learning, 2006, pp.673--680.</ref><ref name="cartea2021survey">{\'A}.CARTEA, S.JAIMUNGAL, and L.S{\'a}nchez-Betancourt, ''Deep  reinforcement learning for algorithmic trading'', Available at SSRN,  (2021).</ref>. Policy-based algorithms are also popular in this field, including (deep) policy gradient methods <ref name="hambly2020policy">B.HAMBLY, R.XU, and H.YANG, ''Policy gradient methods for the noisy  linear quadratic regulator over a finite horizon'', SIAM Journal on Control  and Optimization, 59 (2021), pp.3359--3391.</ref><ref name="ZZR2020"/>, A2C <ref name="ZZR2020"/>, PPO <ref name="DGK2019"/><ref name="LB2020">S.LIN and P.A. Beling, ''An end-to-end optimal trade execution  framework based on proximal policy optimization'', in IJCAI, 2020,  pp.4548--4554.</ref>, and DDPG <ref name="Ye2020">Z.YE, W.DENG, S.ZHOU, Y.XU, and J.GUAN, ''Optimal trade execution  based on deep deterministic policy gradient'', in Database Systems for  Advanced Applications, Springer International Publishing, 2020, pp.638--654.</ref>. The benchmark strategies studied in these papers include the Almgren--Chriss solution <ref name="HW2014"/><ref name="hambly2020policy"/>, the TWAP strategy <ref name="NLJ2018"/><ref name="DGK2019"/><ref name="LB2020"/>, the VWAP strategy <ref name="LB2020"/>, and the SnL policy <ref name="NFK2006"/><ref name="Ye2020"/>. In some models the trader is allowed to buy or sell the asset at each time point <ref name="JK2019"/><ref name="ZZR2020"/><ref name="WWMD2019">H.WEI, Y.WANG, L.MANGU, and K.DECKER, ''Model-based reinforcement  learning for predictions and control for limit order books'', arXiv preprint  arXiv:1910.03743,  (2019).</ref><ref name="Deng2016">Y.Deng, F.Bao, Y.Kong, Z.Ren, and Q.Dai, ''Deep direct  reinforcement learning for financial signal representation and trading'', IEEE  Transactions on Neural Networks and Learning Systems, 28 (2017),  pp.653--664.</ref>, whereas there are also many models where only one trading direction is allowed <ref name="NFK2006"/><ref name="HW2014"/><ref name="hambly2020policy"/><ref name="NLJ2018"/><ref name="DGK2019"/><ref name="SHYO2014"/><ref name="Ye2020"/><ref name="LB2020"/>. The state variables are often composed of time stamp, the market attributes including (mid-)price of the asset and/or the spread, the inventory process and past returns. The control variables are typically set to be the amount of asset (using market orders) to trade and/or the relative price level (using limit orders) at each time point. Examples of reward functions include cash inflow or outflow (depending on whether we sell or buy) <ref name="SHYO2014"/><ref name="NFK2006"/>, implementation shortfall <ref name="HW2014"/>, profit <ref name="Deng2016"/>, Sharpe ratio <ref name="Deng2016"/>, return <ref name="JK2019"/>, and PnL <ref name="WWMD2019"/>. Popular choices of performance measure include Implementation Shortfall <ref name="HW2014"/><ref name="hambly2020policy"/>, PnL (with a penalty term of transaction cost) <ref name="NLJ2018"/><ref name="WWMD2019"/><ref name="Deng2016"/>, trading cost <ref name="NFK2006"/><ref name="SHYO2014"/>, profit <ref name="Deng2016"/><ref name="JK2019"/>,  Sharpe ratio <ref name="Deng2016"/><ref name="ZZR2020"/>, Sortino ratio <ref name="ZZR2020"/>, and return <ref name="ZZR2020"/>.
 We now discuss some more details of the RL algorithms and experimental settings in the above papers.
 For value-based algorithms, <ref name="NFK2006"/> provided the first large scale empirical analysis of a RL method applied to optimal execution problems. They focused on a modified <math>Q</math>-learning algorithm to select price levels for limit order trading, which leads to significant improvements over simpler forms of optimization such as the SnL policies in terms of trading costs. <ref name="SHYO2014"/> proposed a risk-averse RL algorithm for optimal liquidation, which can be viewed as a generalization of <ref name="NFK2006"/>. This algorithm achieves substantially lower trading costs over the period when the 2010 flash-crash happened compared with the risk-neutral RL algorithm in <ref name="NFK2006"/>. <ref name="HW2014"/> combined the Almgren--Chriss solution and the <math>Q</math>-learning algorithm and showed that they are able to improve the Implementation Shortfall of the Almgren--Chriss solution by up to 10.3% on average, using LOB data which includes five price levels.  <ref name="NLJ2018"/> proposed a modified double DQN algorithm for optimal execution and showed that the algorithm outperforms the TWAP strategy on seven out of nine stocks using PnL as the performance measure. They added a single one-second time step <math>\Delta T</math> at the end of the horizon to guarantee that all shares are liquidated over <math>T+\Delta T</math>. <ref name="JK2019"/> designed a trading system based on DQN which determines both the trading direction and the number of shares to trade. Their approach was shown to increase the total profits by at least four times in four different index stocks compared with a benchmark trading model which trades a fixed number of shares each time. They also used ''transfer'' learning to avoid overfitting, where some knowledge/information is reused when learning in a related or similar situation.
-For policy-based algorithms,  <ref name="Deng2016"/> combined deep learning with RL to determine whether to sell, hold, or buy at each time point. In the first step of their model, neural networks are used to summarize the market features and in the second step the RL part makes trading decisions. The proposed method was shown to outperform several other deep learning and deep RL models in terms of PnL, total profits, and Sharpe ratio. They suggested that in practice, the Sharpe ratio was more recommended as the reward function compared with total profits. <ref name="Ye2020"/> used the DDPG algorithm for optimal execution over a short time horizon (two minutes) and designed a network to extract features from the market data. Experiments on real LOB data with 10 price levels show that the proposed approach significantly outperforms the existing methods, including the SnL policy (as baseline), the <math>Q</math>-learning algorithm, and the method in <ref name="HW2014"/>. <ref name="LB2020"/> proposed an adaptive framework based on PPO with neural networks including LSTM and fully-connected networks, and showed that the framework outperforms the baseline models including TWAP and VWAP, as well as several deep RL models on most of 14 US equities. <ref name="hambly2020policy"/> applied the (vanilla) policy gradient method to the LOB data of five stocks in different sectors and showed that they improve the Implementation Shortfall of the Almgren--Chriss solution by around 20%. <ref name="leal2020learning"/> used  neural networks to learn the mapping between the risk-aversion parameter and the optimal control  with potential market impacts incorporated.
+For policy-based algorithms,  <ref name="Deng2016"/> combined deep learning with RL to determine whether to sell, hold, or buy at each time point. In the first step of their model, neural networks are used to summarize the market features and in the second step the RL part makes trading decisions. The proposed method was shown to outperform several other deep learning and deep RL models in terms of PnL, total profits, and Sharpe ratio. They suggested that in practice, the Sharpe ratio was more recommended as the reward function compared with total profits. <ref name="Ye2020"/> used the DDPG algorithm for optimal execution over a short time horizon (two minutes) and designed a network to extract features from the market data. Experiments on real LOB data with 10 price levels show that the proposed approach significantly outperforms the existing methods, including the SnL policy (as baseline), the <math>Q</math>-learning algorithm, and the method in <ref name="HW2014"/>. <ref name="LB2020"/> proposed an adaptive framework based on PPO with neural networks including LSTM and fully-connected networks, and showed that the framework outperforms the baseline models including TWAP and VWAP, as well as several deep RL models on most of 14 US equities. <ref name="hambly2020policy"/> applied the (vanilla) policy gradient method to the LOB data of five stocks in different sectors and showed that they improve the Implementation Shortfall of the Almgren--Chriss solution by around 20%. <ref name="leal2020learning">L.LEAL, M.LAURI{\`e}re, and C.-A. Lehalle, ''Learning a functional  control for high-frequency finance'', arXiv preprint arXiv:2006.09611,  (2020).</ref> used  neural networks to learn the mapping between the risk-aversion parameter and the optimal control  with potential market impacts incorporated.
 For comparison between value-based and policy-based algorithms, <ref name="DGK2019"/> explored double DQN and PPO algorithms in different market environments -- when the benchmark TWAP is optimal, PPO is shown to converge to TWAP whereas double DQN may not; when TWAP is not optimal, both algorithms outperform this benchmark. <ref name="ZZR2020"/> showed that DQN, policy gradient, and A2C outperform several baseline models including classical time-series momentum strategies, on test data of 50 liquid futures contracts. Both continuous and discrete action spaces were considered in their work. They observed that DQN achieves the best performance and the second best is the A2C approach.
-In addition, model-based RL algorithms have also been used for optimal execution. <ref name="WWMD2019"/> built a profitable electronic trading agent to place buy and sell orders using model-based RL, which outperforms two benchmarks strategies in terms of PnL on LOB data. They used a recurrent neural network to learn the state transition probability. We note that multi-agent RL has also been used to address the optimal execution problem, see for example <ref name="BL2019"/><ref name="KFMW2020"/>.
+In addition, model-based RL algorithms have also been used for optimal execution. <ref name="WWMD2019"/> built a profitable electronic trading agent to place buy and sell orders using model-based RL, which outperforms two benchmarks strategies in terms of PnL on LOB data. They used a recurrent neural network to learn the state transition probability. We note that multi-agent RL has also been used to address the optimal execution problem, see for example <ref name="BL2019">W.BAO and X.-y. Liu, ''Multi-agent deep reinforcement learning for  liquidation strategy analysis'', arXiv preprint arXiv:1906.11046,  (2019).</ref><ref name="KFMW2020">M.KARPE, J.FANG, Z.MA, and C.WANG, ''Multi-agent reinforcement  learning in a realistic limit order book market simulation'', in Proceedings  of the First ACM International Conference on AI in Finance, ICAIF '20, 2020.</ref>.
 ===<span id="sec:portfolio_optimization"></span>Portfolio Optimization===
-In portfolio optimization problems, a trader needs to select and trade the best portfolio of assets in order to maximize some objective function, which typically includes the expected return and some measure of the risk. The benefit of investing in such portfolios is that the diversification of investments achieves higher return per unit of risk than only investing in a single asset (see, e.g. <ref name="zivot2017introduction"/>).
+In portfolio optimization problems, a trader needs to select and trade the best portfolio of assets in order to maximize some objective function, which typically includes the expected return and some measure of the risk. The benefit of investing in such portfolios is that the diversification of investments achieves higher return per unit of risk than only investing in a single asset (see, e.g. <ref name="zivot2017introduction">E.ZIVOT, ''Introduction to Computational Finance and Financial  Econometrics'', Chapman & Hall Crc, 2017.</ref>).
 '''Mean-Variance Portfolio Optimization.'''
-The first significant mathematical model for portfolio optimization is the ''Markowitz'' model <ref name="Markowitz1952"/>, also called the ''mean-variance'' model, where an investor seeks a portfolio to maximize the expected total return for any given level of risk measured by variance. This is a single-period optimization problem and is then generalized to multi-period portfolio optimization problems in <ref name="mossin1968optimal"/><ref name="hakansson1971multi"/><ref name="samuelson1975lifetime"/><ref name="merton1974"/><ref name="steinbach1999markowitz"/><ref name="li2000multiperiodMV"/>. In this mean-variance framework, the risk of a portfolio is quantified by the variance of the wealth and the optimal investment strategy is then sought to maximize the final wealth penalized by a variance term. The mean-variance framework is of particular interest because it not only captures both the portfolio return and the risk, but also suffers from the  ''time-inconsistency'' problem <ref name="strotz1955myopia"/><ref name="vigna2016time"/><ref name="xiao2020corre"/>. That is the optimal strategy selected at time <math>t</math> is no longer optimal at time <math>s > t</math> and the Bellman equation does not hold. A breakthrough was made in <ref name="li2000multiperiodMV"/>, in which they were the first to derive the analytical solution to the discrete-time multi-period mean-variance problem. They applied an ''embedding'' approach, which transforms the mean-variance problem to an LQ problem where classical approaches can be used to find the solution. The same approach was then used to solve the continuous-time mean-variance problem <ref name="zhou2000continuous"/>. In addition to the embedding scheme, other methods including the consistent planning approach <ref name="basak2010dynamic"/><ref name="bjork2010general"/> and the dynamically optimal strategy <ref name="pedersen2017optimal"/> have also been applied to solve the time-inconsistency problem arising in the mean-variance formulation of portfolio optimization.
+The first significant mathematical model for portfolio optimization is the ''Markowitz'' model <ref name="Markowitz1952">H.M. Markowitz, ''Portfolio selection'', Journal of Finance, 7 (1952),  pp.77--91.</ref>, also called the ''mean-variance'' model, where an investor seeks a portfolio to maximize the expected total return for any given level of risk measured by variance. This is a single-period optimization problem and is then generalized to multi-period portfolio optimization problems in <ref name="mossin1968optimal">J.MOSSIN, ''Optimal multiperiod portfolio policies'', The Journal of  Business, 41 (1968), pp.215--229.</ref><ref name="hakansson1971multi">N.H. Hakansson, ''Multi-period mean-variance analysis: Toward a  general theory of portfolio choice'', The Journal of Finance, 26 (1971),  pp.857--884.</ref><ref name="samuelson1975lifetime">P.A. Samuelson, ''Lifetime portfolio selection by dynamic stochastic  programming'', Stochastic Optimization Models in Finance,  (1975),  pp.517--524.</ref><ref name="merton1974">R.C. Merton and P.A. Samuelson, ''Fallacy of the log-normal  approximation to optimal portfolio decision-making over many periods'',  Journal of Financial Economics, 1 (1974), pp.67--94.</ref><ref name="steinbach1999markowitz">M.STEINBACH, ''Markowitz revisited: Mean-variance models in financial  portfolio analysis'', Society for Industrial and Applied Mathematics, 43  (2001), pp.31--85.</ref><ref name="li2000multiperiodMV">D.LI and W.-L. Ng, ''Optimal dynamic portfolio selection: Multiperiod  mean-variance formulation'', Mathematical Finance, 10 (2000), pp.387--406.</ref>. In this mean-variance framework, the risk of a portfolio is quantified by the variance of the wealth and the optimal investment strategy is then sought to maximize the final wealth penalized by a variance term. The mean-variance framework is of particular interest because it not only captures both the portfolio return and the risk, but also suffers from the  ''time-inconsistency'' problem <ref name="strotz1955myopia">R.H. Strotz, ''Myopia and inconsistency in dynamic utility  maximization'', The Review of Economic Studies, 23 (1955), pp.165--180.</ref><ref name="vigna2016time">E.VIGNA, ''On time consistency for mean-variance portfolio  selection'', Collegio Carlo Alberto Notebook, 476 (2016).</ref><ref name="xiao2020corre">H.XIAO, Z.ZHOU, T.REN, Y.BAI, and W.LIU, ''Time-consistent  strategies for multi-period mean-variance portfolio optimization with the  serially correlated returns'', Communications in Statistics-Theory and  Methods, 49 (2020), pp.2831--2868.</ref>. That is the optimal strategy selected at time <math>t</math> is no longer optimal at time <math>s > t</math> and the Bellman equation does not hold. A breakthrough was made in <ref name="li2000multiperiodMV"/>, in which they were the first to derive the analytical solution to the discrete-time multi-period mean-variance problem. They applied an ''embedding'' approach, which transforms the mean-variance problem to an LQ problem where classical approaches can be used to find the solution. The same approach was then used to solve the continuous-time mean-variance problem <ref name="zhou2000continuous">X.Y. Zhou and D.LI, ''Continuous-time mean-variance portfolio  selection: A stochastic LQ framework'', Applied Mathematics and  Optimization, 42 (2000), pp.19--33.</ref>. In addition to the embedding scheme, other methods including the consistent planning approach <ref name="basak2010dynamic">S.BASAK and G.CHABAKAURI, ''Dynamic mean-variance asset allocation'',  The Review of Financial Studies, 23 (2010), pp.2970--3016.</ref><ref name="bjork2010general">T.BJORK and A.MURGOCI, ''A general theory of Markovian time  inconsistent stochastic control problems'', Available at SSRN 1694759,  (2010).</ref> and the dynamically optimal strategy <ref name="pedersen2017optimal">J.L. Pedersen and G.PESKIR, ''Optimal mean-variance portfolio  selection'', Mathematics and Financial Economics, 11 (2017), pp.137--160.</ref> have also been applied to solve the time-inconsistency problem arising in the mean-variance formulation of portfolio optimization.
 Here we introduce the multi-period mean-variance portfolio optimization problem as given in <ref name="li2000multiperiodMV"/>. Suppose there are <math>n</math> risky assets in the market and an investor enters the market at time 0 with initial wealth <math>x_0</math>. The goal of the investor is to reallocate his wealth at each time point <math>t=0,1,\ldots,T</math> among the <math>n</math> assets to achieve the optimal trade off between the return and the risk of the investment. The random rates of return of the assets at <math>t</math> is denoted by <math>\pmb{e}_t =(e_t^1,\ldots,e_t^n)^\top</math>, where <math>e_t^i</math> (<math>i=1,\ldots,n</math>) is the rate of return of the <math>i</math>-th asset at time <math>t</math>. The vectors <math>\pmb{e}_t</math>,  <math>t=0,1,\ldots,T-1</math>, are assumed to be statistically independent (this independence assumption can be relaxed, see, e.g. <ref name="xiao2020corre"/>) with known mean  <math>\pmb{\mu}_t=(\mu_t^1,\ldots,\mu_t^n)^\top\in\mathbb{R}^{n}</math> and known standard deviation <math>\sigma_{t}^i</math> for <math>i=1,\ldots,n</math> and <math>t=0,\ldots,T-1</math>. The covariance matrix is denoted by <math>\pmb{\Sigma}_t\in\mathbb{R}^{n\times n}</math>, where <math>[\pmb{\Sigma}_{t}]_{ii}=(\sigma_{t}^i)^2</math> and <math>[\pmb{\Sigma}_t]_{ij}=\rho_{t}^{ij}\sigma_{t}^i\sigma_{t}^j</math> for <math>i,j=1,\ldots,n</math> and <math>i\neq j</math>, where <math>\rho_{t}^{ij}</math> is the correlation between assets <math>i</math> and <math>j</math> at time <math>t</math>. We write <math>x_t</math> for the wealth of the investor at time <math>t</math> and <math>u_t^i</math> (<math>i=1,\ldots,n-1</math>) for the amount invested in the <math>i</math>-th asset at time <math>t</math>. Thus the amount invested in the <math>n</math>-th asset is <math>x_t-\sum_{i=1}^{n-1} u_t^i</math>. An investment strategy is denoted by <math>\pmb{u}_t= (u_t^1,u_t^2,\ldots,u_t^{n-1})^\top</math> for <math>t=0,1,\ldots,T-1</math>, and the goal is to find an optimal strategy such that the portfolio return is maximized while minimizing the risk of the investment, that is,
@@ Line 173: / Line 173: @@
 </math>
 where <math>\alpha_t</math> and <math>\beta_t</math> are explicit functions of <math>\pmb{\mu}_t</math> and <math>\pmb{\Sigma}_t</math>, which are omitted here and can be found in <ref name="li2000multiperiodMV"/>.
-The above framework has been extended in different ways, for example, the risk-free asset can also be involved in the portfolio, and one can maximize the cumulative form of \eqref{eqn:MV_obj} rather than only focusing on the final wealth <math>x_T</math>. For more details about these variants and solutions, see <ref name="xiao2020corre"/>. In addition to the mean-variance framework, other major paradigms in portfolio optimization are the Kelly Criterion and Risk Parity. We refer to <ref name="Sato2019Survey"/> for a review of these optimal control frameworks and popular model-free RL algorithms for portfolio optimization.
+The above framework has been extended in different ways, for example, the risk-free asset can also be involved in the portfolio, and one can maximize the cumulative form of \eqref{eqn:MV_obj} rather than only focusing on the final wealth <math>x_T</math>. For more details about these variants and solutions, see <ref name="xiao2020corre"/>. In addition to the mean-variance framework, other major paradigms in portfolio optimization are the Kelly Criterion and Risk Parity. We refer to <ref name="Sato2019Survey">Y.SATO, ''Model-free reinforcement learning for financial portfolios:  A brief survey'', arXiv preprint arXiv:1904.04973,  (2019).</ref> for a review of these optimal control frameworks and popular model-free RL algorithms for portfolio optimization.
 Note that the classical stochastic control approach for portfolio optimization problems across multiple  assets requires both a realistic representation of the temporal dynamics  of individual assets, as well as an adequate representation  of their co-movements. This is extremely difficult when the assets belong to different classes (for example, stocks, options, futures, interest rates and their derivatives). On the other hand, the model-free RL approach does not rely on the specification of the joint dynamics across assets.
-'''RL Approach.''' Both value-based methods such as <math>Q</math>-learning <ref name="du2016algorithm"/><ref name="pendharkar2018trading"/>, SARSA <ref name="pendharkar2018trading"/>, and DQN <ref name="PSC2020"/>, and policy-based algorithms such as DPG and DDPG <ref name="xiong2018practical"/><ref name="JXL2017"/><ref name="YLKSD2019"/><ref name="liang2018adversarial"/><ref name="aboussalah2020value"/><ref name="cong2021alphaportfolio"/> have been applied to solve portfolio optimization problems. The state variables are often composed of time, asset prices, asset past returns, current holdings of assets, and remaining balance. The control variables are typically set to be the amount/proportion of wealth invested in each component of the portfolio. Examples of reward signals include  portfolio return  <ref name="JXL2017"/><ref name="pendharkar2018trading"/><ref name="YLKSD2019"/>, (differential) Sharpe ratio <ref name="du2016algorithm"/><ref name="pendharkar2018trading"/>, and profit <ref name="du2016algorithm"/>. The benchmark strategies include Constantly Rebalanced Portfolio (CRP) <ref name="YLKSD2019"/><ref name="JXL2017"/> where at each period the portfolio is rebalanced to the initial wealth distribution among the assets, and the buy-and-hold or do-nothing strategy <ref name="PSC2020"/><ref name="aboussalah2020value"/> which does not take any action but rather holds the initial portfolio until the end. The performance measures studied in these papers include the Sharpe ratio <ref name="YLKSD2019"/><ref name="WZ2019"/><ref name="xiong2018practical"/><ref name="JXL2017"/><ref name="liang2018adversarial"/><ref name="PSC2020"/><ref name="wang2019"/>, the Sortino ratio <ref name="YLKSD2019"/>, portfolio returns <ref name="wang2019"/><ref name="xiong2018practical"/><ref name="liang2018adversarial"/><ref name="PSC2020"/><ref name="YLKSD2019"/><ref name="aboussalah2020value"/>, portfolio values <ref name="JXL2017"/><ref name="xiong2018practical"/><ref name="pendharkar2018trading"/>, and cumulative profits <ref name="du2016algorithm"/>. Some models incorporate the transaction costs <ref name="JXL2017"/><ref name="liang2018adversarial"/><ref name="du2016algorithm"/><ref name="PSC2020"/><ref name="YLKSD2019"/><ref name="aboussalah2020value"/> and investments in the risk-free asset <ref name="YLKSD2019"/><ref name="WZ2019"/><ref name="wang2019"/><ref name="du2016algorithm"/><ref name="JXL2017"/>.
+'''RL Approach.''' Both value-based methods such as <math>Q</math>-learning <ref name="du2016algorithm">X.DU, J.ZHAI, and K.LV, ''Algorithm trading using Q-learning and  recurrent reinforcement learning'', Positions, 1 (2016), p.1.</ref><ref name="pendharkar2018trading">P.C. Pendharkar and P.CUSATIS, ''Trading financial indices with  reinforcement learning agents'', Expert Systems with Applications, 103 (2018),  pp.1--13.</ref>, SARSA <ref name="pendharkar2018trading"/>, and DQN <ref name="PSC2020">H.PARK, M.K. Sim, and D.G. Choi, ''An intelligent financial  portfolio trading strategy using deep Q-learning'', Expert Systems with  Applications, 158 (2020), p.113573.</ref>, and policy-based algorithms such as DPG and DDPG <ref name="xiong2018practical">Z.XIONG, X.-Y. Liu, S.ZHONG, H.YANG, and A.WALID, ''Practical deep  reinforcement learning approach for stock trading'', arXiv preprint  arXiv:1811.07522,  (2018).</ref><ref name="JXL2017">Z.JIANG, D.XU, and J.LIANG, ''A deep reinforcement learning  framework for the financial portfolio management problem'', arXiv preprint  arXiv:1706.10059,  (2017).</ref><ref name="YLKSD2019">P.YU, J.S. Lee, I.KULYATIN, Z.SHI, and S.DASGUPTA, ''Model-based  deep reinforcement learning for dynamic portfolio optimization'', arXiv  preprint arXiv:1901.08740,  (2019).</ref><ref name="liang2018adversarial">Z.LIANG, H.CHEN, J.ZHU, K.JIANG, and Y.LI, ''Adversarial deep  reinforcement learning in portfolio management'', arXiv preprint  arXiv:1808.09940,  (2018).</ref><ref name="aboussalah2020value">A.M. Aboussalah, ''What is the value of the cross-sectional approach  to deep reinforcement learning?'', Available at SSRN,  (2020).</ref><ref name="cong2021alphaportfolio">L.W. Cong, K.TANG, J.WANG, and Y.ZHANG, ''Alphaportfolio: Direct  construction through deep reinforcement learning and interpretable ai'', SSRN  Electronic Journal. https://doi. org/10.2139/ssrn, 3554486 (2021).</ref> have been applied to solve portfolio optimization problems. The state variables are often composed of time, asset prices, asset past returns, current holdings of assets, and remaining balance. The control variables are typically set to be the amount/proportion of wealth invested in each component of the portfolio. Examples of reward signals include  portfolio return  <ref name="JXL2017"/><ref name="pendharkar2018trading"/><ref name="YLKSD2019"/>, (differential) Sharpe ratio <ref name="du2016algorithm"/><ref name="pendharkar2018trading"/>, and profit <ref name="du2016algorithm"/>. The benchmark strategies include Constantly Rebalanced Portfolio (CRP) <ref name="YLKSD2019"/><ref name="JXL2017"/> where at each period the portfolio is rebalanced to the initial wealth distribution among the assets, and the buy-and-hold or do-nothing strategy <ref name="PSC2020"/><ref name="aboussalah2020value"/> which does not take any action but rather holds the initial portfolio until the end. The performance measures studied in these papers include the Sharpe ratio <ref name="YLKSD2019"/><ref name="WZ2019">H.WANG and X.Y. Zhou, ''Continuous-time mean--variance portfolio  selection: A reinforcement learning framework'', Mathematical Finance, 30  (2020), pp.1273--1308.</ref><ref name="xiong2018practical"/><ref name="JXL2017"/><ref name="liang2018adversarial"/><ref name="PSC2020"/><ref name="wang2019">H.WANG, ''Large scale continuous-time mean-variance portfolio  allocation via reinforcement learning'', Available at SSRN 3428125,  (2019).</ref>, the Sortino ratio <ref name="YLKSD2019"/>, portfolio returns <ref name="wang2019"/><ref name="xiong2018practical"/><ref name="liang2018adversarial"/><ref name="PSC2020"/><ref name="YLKSD2019"/><ref name="aboussalah2020value"/>, portfolio values <ref name="JXL2017"/><ref name="xiong2018practical"/><ref name="pendharkar2018trading"/>, and cumulative profits <ref name="du2016algorithm"/>. Some models incorporate the transaction costs <ref name="JXL2017"/><ref name="liang2018adversarial"/><ref name="du2016algorithm"/><ref name="PSC2020"/><ref name="YLKSD2019"/><ref name="aboussalah2020value"/> and investments in the risk-free asset <ref name="YLKSD2019"/><ref name="WZ2019"/><ref name="wang2019"/><ref name="du2016algorithm"/><ref name="JXL2017"/>.
-For value-based algorithms, <ref name="du2016algorithm"/> considered the portfolio optimization problems of a risky asset and a risk-free asset. They compared the performance of the <math>Q</math>-learning algorithm and a Recurrent RL (RRL) algorithm under three different value functions including the Sharpe ratio, differential Sharpe ratio, and profit. The RRL algorithm is a policy-based method which uses the last action as an input. They concluded that the <math>Q</math>-learning algorithm is more sensitive to the choice of value function and has less stable performance than the RRL algorithm. They also suggested that the (differential) Sharpe ratio is preferred rather than the profit as the reward function. <ref name="pendharkar2018trading"/> studied a two-asset personal retirement  portfolio optimization problem, in which traders who manage retirement funds are restricted from making frequent trades and they can only access limited information. They tested the performance of three algorithms: SARSA and <math>Q</math>-learning methods with discrete state and action space that maximize either the portfolio return or differential Sharpe ratio, and a TD learning method with discrete state space and continuous action space that maximizes the portfolio return. The TD method learns the portfolio returns by using a linear regression model and was shown to outperform the other two methods in terms of portfolio values. <ref name="PSC2020"/> proposed a portfolio trading strategy based on DQN which chooses to either hold, buy or sell a pre-specified quantity of the asset at each time point. In their experiments with two different three-asset portfolios, their trading strategy is superior to four benchmark strategies including the do-nothing strategy and a random strategy (take an action in the feasible space randomly) using performance measures including the cumulative return and the Sharpe ratio. <ref name="dixon2020g"/> applies a G-learning-based algorithm, a probabilistic extension of <math>Q</math>-learning
+For value-based algorithms, <ref name="du2016algorithm"/> considered the portfolio optimization problems of a risky asset and a risk-free asset. They compared the performance of the <math>Q</math>-learning algorithm and a Recurrent RL (RRL) algorithm under three different value functions including the Sharpe ratio, differential Sharpe ratio, and profit. The RRL algorithm is a policy-based method which uses the last action as an input. They concluded that the <math>Q</math>-learning algorithm is more sensitive to the choice of value function and has less stable performance than the RRL algorithm. They also suggested that the (differential) Sharpe ratio is preferred rather than the profit as the reward function. <ref name="pendharkar2018trading"/> studied a two-asset personal retirement  portfolio optimization problem, in which traders who manage retirement funds are restricted from making frequent trades and they can only access limited information. They tested the performance of three algorithms: SARSA and <math>Q</math>-learning methods with discrete state and action space that maximize either the portfolio return or differential Sharpe ratio, and a TD learning method with discrete state space and continuous action space that maximizes the portfolio return. The TD method learns the portfolio returns by using a linear regression model and was shown to outperform the other two methods in terms of portfolio values. <ref name="PSC2020"/> proposed a portfolio trading strategy based on DQN which chooses to either hold, buy or sell a pre-specified quantity of the asset at each time point. In their experiments with two different three-asset portfolios, their trading strategy is superior to four benchmark strategies including the do-nothing strategy and a random strategy (take an action in the feasible space randomly) using performance measures including the cumulative return and the Sharpe ratio. <ref name="dixon2020g">M.DIXON and I.HALPERIN, ''G-learner and girl: Goal based wealth  management with reinforcement learning'', arXiv preprint arXiv:2002.10990,  (2020).</ref> applies a G-learning-based algorithm, a probabilistic extension of <math>Q</math>-learning
 which scales to high dimensional portfolios while providing a flexible choice of utility functions, for wealth management problems. In addition, the authors also  extend the G-learning algorithm to the setting of Inverse Reinforcement Learning (IRL) where rewards collected by the agent are not observed, and should instead be inferred.
-For policy-based algorithms, <ref name="JXL2017"/> proposed a framework combining neural networks with DPG. They used a so-called Ensemble of Identical Independent Evaluators (EIIE) topology to predict the potential growth of the assets in the immediate future using historical data which includes the highest, lowest, and closing prices of portfolio components. The experiments using real cryptocurrency market data showed that their framework achieves higher Sharpe ratio and cumulative portfolio values compared with three benchmarks including CRP and several published RL models. <ref name="xiong2018practical"/> explored the DDPG algorithm for the portfolio selection of 30 stocks, where at each time point, the agent can choose to buy, sell, or hold each stock. The DDPG algorithm was shown to outperform two classical strategies including the min-variance portfolio allocation method <ref name="yang2018practical"/> in terms of several performance measures including final portfolio values, annualized return, and Sharpe ratio, using historical daily prices of the 30 stocks. <ref name="liang2018adversarial"/> considered the DDPG, PPO, and policy gradient method with an adversarial learning scheme which learns the execution strategy using noisy market data. They applied the algorithms for portfolio optimization to five stocks using market data and the policy gradient method with adversarial learning scheme achieves higher daily return and Sharpe ratio than the benchmark CRP strategy. They also showed that the DDPG and PPO algorithms fail to find the optimal policy even in the training set. <ref name="YLKSD2019"/> proposed a model using DDPG, which includes prediction of the assets future price movements based on historical prices and synthetic market data generation using a ''Generative Adversarial Network'' (GAN) <ref name="goodfellow2014GAN"/>. The model outperforms the benchmark CRP and the model considered in <ref name="JXL2017"/> in terms of several performance measures including Sharpe ratio and Sortino ratio. <ref name="cong2021alphaportfolio"/> embedded alpha portfolio strategies into a deep policy-based method and designed a framework which is easier to interpret.
+For policy-based algorithms, <ref name="JXL2017"/> proposed a framework combining neural networks with DPG. They used a so-called Ensemble of Identical Independent Evaluators (EIIE) topology to predict the potential growth of the assets in the immediate future using historical data which includes the highest, lowest, and closing prices of portfolio components. The experiments using real cryptocurrency market data showed that their framework achieves higher Sharpe ratio and cumulative portfolio values compared with three benchmarks including CRP and several published RL models. <ref name="xiong2018practical"/> explored the DDPG algorithm for the portfolio selection of 30 stocks, where at each time point, the agent can choose to buy, sell, or hold each stock. The DDPG algorithm was shown to outperform two classical strategies including the min-variance portfolio allocation method <ref name="yang2018practical">H.YANG, X.-Y. Liu, and Q.WU, ''A practical machine learning approach  for dynamic stock recommendation'', in 2018 17th IEEE International Conference  On Trust, Security And Privacy In Computing And Communications/12th IEEE  International Conference On Big Data Science And Engineering  (TrustCom/BigDataSE), IEEE, 2018, pp.1693--1697.</ref> in terms of several performance measures including final portfolio values, annualized return, and Sharpe ratio, using historical daily prices of the 30 stocks. <ref name="liang2018adversarial"/> considered the DDPG, PPO, and policy gradient method with an adversarial learning scheme which learns the execution strategy using noisy market data. They applied the algorithms for portfolio optimization to five stocks using market data and the policy gradient method with adversarial learning scheme achieves higher daily return and Sharpe ratio than the benchmark CRP strategy. They also showed that the DDPG and PPO algorithms fail to find the optimal policy even in the training set. <ref name="YLKSD2019"/> proposed a model using DDPG, which includes prediction of the assets future price movements based on historical prices and synthetic market data generation using a ''Generative Adversarial Network'' (GAN) <ref name="goodfellow2014GAN">I.GOODFELLOW, J.POUGET-Abadie, M.MIRZA, B.XU, D.WARDE-Farley,  S.OZAIR, A.COURVILLE, and Y.BENGIO, ''Generative adversarial nets'',  Advances in Neural Information Processing Systems, 27 (2014).</ref>. The model outperforms the benchmark CRP and the model considered in <ref name="JXL2017"/> in terms of several performance measures including Sharpe ratio and Sortino ratio. <ref name="cong2021alphaportfolio"/> embedded alpha portfolio strategies into a deep policy-based method and designed a framework which is easier to interpret.
 Using Actor-Critic methods, <ref name="aboussalah2020value"/> combined the mean-variance framework (the actor determines the policy using the mean-variance framework)  and the Kelly Criterion framework (the critic evaluates the policy using their growth rate). They studied eight policy-based algorithms including DPG, DDPG, and PPO, among which DPG was shown to achieve the best performance.
-In addition to the above discrete-time models, <ref name="WZ2019"/> studied the entropy-regularized continuous-time mean-variance framework with one risky asset and one risk-free asset. They proved that the optimal policy is Gaussian with decaying variance and proposed an Exploratory Mean-Variance (EMV) algorithm, which consists of three procedures: policy evaluation, policy improvement, and a self-correcting scheme for learning the Lagrange multiplier. They showed that this EMV algorithm outperforms two benchmarks, including the analytical solution with estimated model parameters obtained from the classical maximum likelihood method <ref name="campbell2012econometrics"/>{{rp|at=Section 9.3.2}} and a DDPG algorithm, in terms of annualized sample mean and sample variance of the terminal wealth, and Sharpe ratio in their simulations. The continuous-time framework in <ref name="WZ2019"/> was then generalized in <ref name="wang2019"/> to large scale portfolio selection setting with <math>d</math> risky assets and one risk-free asset. The optimal policy is shown to be multivariate Gaussian with decaying variance. They tested the performance of the generalized EMV algorithm on price data from the stocks in the S\&P 500 with <math>d\geq 20</math> and it outperforms several algorithms including DDPG.
+In addition to the above discrete-time models, <ref name="WZ2019"/> studied the entropy-regularized continuous-time mean-variance framework with one risky asset and one risk-free asset. They proved that the optimal policy is Gaussian with decaying variance and proposed an Exploratory Mean-Variance (EMV) algorithm, which consists of three procedures: policy evaluation, policy improvement, and a self-correcting scheme for learning the Lagrange multiplier. They showed that this EMV algorithm outperforms two benchmarks, including the analytical solution with estimated model parameters obtained from the classical maximum likelihood method <ref name="campbell2012econometrics">J.Y. Campbell, A.W. Lo, and A.C. MacKinlay, ''The econometrics of  financial markets'', Princeton University Press, 1997.</ref>{{rp|at=Section 9.3.2}} and a DDPG algorithm, in terms of annualized sample mean and sample variance of the terminal wealth, and Sharpe ratio in their simulations. The continuous-time framework in <ref name="WZ2019"/> was then generalized in <ref name="wang2019"/> to large scale portfolio selection setting with <math>d</math> risky assets and one risk-free asset. The optimal policy is shown to be multivariate Gaussian with decaying variance. They tested the performance of the generalized EMV algorithm on price data from the stocks in the S\&P 500 with <math>d\geq 20</math> and it outperforms several algorithms including DDPG.
 ===Option Pricing and Hedging===
 Understanding how to price and hedge financial derivatives is a cornerstone of modern mathematical and computational finance due to its importance in the finance industry.  A financial derivative is a contract that derives its value from the performance of an underlying entity. For example, a call or put ''option'' is a contract which gives the holder the right, but not the obligation, to buy or sell an underlying asset or instrument at a specified strike price prior to or on a specified date called the expiration date. Examples of option types include European options which can only be exercised at expiry, and American options which can be exercised at any time before the option expires.
-'''The Black-Scholes Model.''' One of the most important mathematical models for option pricing is the Black-Scholes or Black-Scholes-Merton (BSM) model <ref name="BS1973"/><ref name="merton1973"/>, in which we aim to find the price of a European option <math>V(S_t,t)</math> (<math>0\leq t\leq T</math>) with underlying stock price <math>S_t</math>, expiration time <math>T</math>,
+'''The Black-Scholes Model.''' One of the most important mathematical models for option pricing is the Black-Scholes or Black-Scholes-Merton (BSM) model <ref name="BS1973">F.BLACK and M.SCHOLES, ''The pricing of options and corporate  liabilities'', Journal of Political Economy, 81 (1973), pp.637--654.</ref><ref name="merton1973">R.C. Merton, ''Theory of rational option pricing'', The Bell  Journal of Economics and Management Science,  (1973), pp.141--183.</ref>, in which we aim to find the price of a European option <math>V(S_t,t)</math> (<math>0\leq t\leq T</math>) with underlying stock price <math>S_t</math>, expiration time <math>T</math>,
 and payoff at expiry <math>P(S_T)</math>. The underlying stock price <math>S_t</math> is assumed to be non-dividend paying and to follow a geometric Brownian motion
@@ Line 217: / Line 217: @@
 d_1=\frac{1}{\sigma\sqrt{T-t}}\left(\ln\left(\frac{s}{K}\right)+\left(r+\frac{\sigma^2}{2}\right)(T-t)\right),\qquad d_2=d_1-\sigma\sqrt{T-t}.
 </math>
-We refer to the survey <ref name="broadie2004anniversary"/> for details about extensions of the BSM model, other classical option pricing models, and numerical methods such as the Monte Carlo method.
+We refer to the survey <ref name="broadie2004anniversary">M.BROADIE and J.B. Detemple, ''Anniversary article: Option pricing:  Valuation models and applications'', Management Science, 50 (2004),  pp.1145--1177.</ref> for details about extensions of the BSM model, other classical option pricing models, and numerical methods such as the Monte Carlo method.
-In a complete market one can ''hedge'' a given derivative contract by buying and selling the underlying asset in the right way to eliminate risk. In the Black-Scholes analysis ''delta'' hedging is used, in which we hedge the risk of a call option by shorting <math>\Delta_t</math> units of the underlying asset with <math>\Delta_t:=\frac{\partial V}{\partial S}(S_t,t)</math> (the sensitivity of the option price with respect to the asset price). It is also possible to use financial derivatives to ''hedge'' against the volatility of given positions in the underlying assets. However, in practice, we can only rebalance portfolios at discrete time points and frequent transactions may incur high costs. Therefore an optimal hedging strategy depends on the tradeoff between the hedging error and the transaction costs. It is worth mentioning that this is in a similar spirit to the mean-variance portfolio optimization framework introduced in [[#sec:portfolio_optimization |Section]]. Much effort has been made to include the transaction costs using classical approaches such as dynamic programming, see, for example <ref name="leland1985option"/><ref name="figlewski1989options"/><ref name="henrotte1993transaction"/>. We also refer to <ref name="giurca2021"/> for the details about delta hedging under different model assumptions.
+In a complete market one can ''hedge'' a given derivative contract by buying and selling the underlying asset in the right way to eliminate risk. In the Black-Scholes analysis ''delta'' hedging is used, in which we hedge the risk of a call option by shorting <math>\Delta_t</math> units of the underlying asset with <math>\Delta_t:=\frac{\partial V}{\partial S}(S_t,t)</math> (the sensitivity of the option price with respect to the asset price). It is also possible to use financial derivatives to ''hedge'' against the volatility of given positions in the underlying assets. However, in practice, we can only rebalance portfolios at discrete time points and frequent transactions may incur high costs. Therefore an optimal hedging strategy depends on the tradeoff between the hedging error and the transaction costs. It is worth mentioning that this is in a similar spirit to the mean-variance portfolio optimization framework introduced in [[#sec:portfolio_optimization |Section]]. Much effort has been made to include the transaction costs using classical approaches such as dynamic programming, see, for example <ref name="leland1985option">H.E. Leland, ''Option pricing and replication with transactions  costs'', The Journal of Finance, 40 (1985), pp.1283--1301.</ref><ref name="figlewski1989options">S.FIGLEWSKI, ''Options arbitrage in imperfect markets'', The Journal  of Finance, 44 (1989), pp.1289--1311.</ref><ref name="henrotte1993transaction">P.HENROTTE, ''Transaction costs and duplication strategies'', Graduate  School of Business, Stanford University,  (1993).</ref>. We also refer to <ref name="giurca2021">A.GIURCA and S.BOROVKOVA, ''Delta hedging of derivatives using deep  reinforcement learning'', Available at SSRN 3847272,  (2021).</ref> for the details about delta hedging under different model assumptions.
-However, the above discussion addresses option pricing and hedging in a model-based way since there are strong assumptions made on the asset price dynamics and the form of transaction costs.  In practice some assumptions of the BSM model are not realistic since: 1) transaction costs due to commissions, market impact, and non-zero bid-ask spread exist in the real market; 2) the volatility is not constant; 3) short term returns typically have a heavy tailed distribution <ref name="cont2001empirical"/><ref name="chakraborti2011econophysics"/>.   Thus the resulting prices and hedges may suffer from model mis-specification when the real asset dynamics is not exactly as assumed and the transaction costs are difficult to model. Thus we will focus on a model-free RL approach that can address some of these issues.
+However, the above discussion addresses option pricing and hedging in a model-based way since there are strong assumptions made on the asset price dynamics and the form of transaction costs.  In practice some assumptions of the BSM model are not realistic since: 1) transaction costs due to commissions, market impact, and non-zero bid-ask spread exist in the real market; 2) the volatility is not constant; 3) short term returns typically have a heavy tailed distribution <ref name="cont2001empirical">R.CONT, ''Empirical properties of asset returns: stylized facts and  statistical issues'', Quantitative Finance, 1 (2001), p.223.</ref><ref name="chakraborti2011econophysics">A.CHAKRABORTI, I.M. Toke, M.PATRIARCA, and F.ABERGEL, ''Econophysics review: I. empirical facts'', Quantitative Finance, 11 (2011),  pp.991--1012.</ref>.   Thus the resulting prices and hedges may suffer from model mis-specification when the real asset dynamics is not exactly as assumed and the transaction costs are difficult to model. Thus we will focus on a model-free RL approach that can address some of these issues.
-'''RL Approach.''' RL methods including (deep) <math>Q</math>-learning <ref name="halperin2020bs"/><ref name="du2020option"/><ref name="cao2021hedging"/><ref name="li2009"/><ref name="halperin2019qlbs"/>, PPO <ref name="du2020option"/>, and DDPG <ref name="cao2021hedging"/> have been applied to find hedging strategies and price financial derivatives. The state variables often include asset price, current positions, option strikes, and time remaining to expiry. The control variable is often set to be the change in holdings. Examples of reward functions are (risk-adjusted) expected wealth/return <ref name="halperin2020bs"/><ref name="halperin2019qlbs"/><ref name="du2020option"/><ref name="kolm2019"/> (as in the mean-variance portfolio optimization), option payoff <ref name="li2009"/>, and (risk-adjusted) hedging cost <ref name="cao2021hedging"/>. The benchmarks for pricing models are typically the BSM model <ref name="halperin2020bs"/><ref name="halperin2019qlbs"/> and binomial option pricing model <ref name="dubrov2015"/> (introduced in <ref name="cox1979option"/>). The learned hedging strategies are typically compared to a discrete approximation to the delta hedging strategy for the BSM model  <ref name="cao2021hedging"/><ref name="du2020option"/><ref name="buehler2019hedging"/><ref name="cannelli2020hedging"/>, or the hedging strategy for the Heston model <ref name="buehler2019hedging"/>.  In contrast to the BSM model where the volatility is assumed to be constant, the Heston model assumes that the volatility of the underlying asset follows a particular stochastic process which leads to a semi-analytical solution. The performance measures for the hedging strategy used in RL papers include (expected) hedging cost/error/loss <ref name="cao2021hedging"/><ref name="kolm2019"/><ref name="buehler2019hedging"/><ref name="cannelli2020hedging"/>, PnL <ref name="du2020option"/><ref name="kolm2019"/><ref name="cannelli2020hedging"/>, and average payoff <ref name="li2009"/>. Some practical issues have been taken into account in RL models, including transaction costs <ref name="buehler2019hedging"/><ref name="du2020option"/><ref name="cao2021hedging"/><ref name="kolm2019"/> and position constraints such as round lotting <ref name="du2020option"/> (where a round lot is a standard number of securities to be traded, such as 100 shares) and limits on the trading size (for example buy or sell up to 100 shares) <ref name="kolm2019"/>.
+'''RL Approach.''' RL methods including (deep) <math>Q</math>-learning <ref name="halperin2020bs">\leavevmode\vrule height 2pt depth -1.6pt width 23pt, ''QLBS: Q-learner  in the Black-Scholes (-Merton) worlds'', The Journal of Derivatives, 28  (2020), pp.99--122.</ref><ref name="du2020option">J.DU, M.JIN, P.N. Kolm, G.RITTER, Y.WANG, and B.ZHANG, ''Deep  reinforcement learning for option replication and hedging'', The Journal of  Financial Data Science, 2 (2020), pp.44--57.</ref><ref name="cao2021hedging">J.CAO, J.CHEN, J.HULL, and Z.POULOS, ''Deep hedging of derivatives  using reinforcement learning'', The Journal of Financial Data Science, 3  (2021), pp.10--27.</ref><ref name="li2009">Y.LI, C.SZEPESVARI, and D.SCHUURMANS, ''Learning exercise policies  for American options'', in Artificial Intelligence and Statistics, PMLR,  2009, pp.352--359.</ref><ref name="halperin2019qlbs">I.HALPERIN, ''The QLBS Q-learner goes NuQlear: Fitted Q  iteration, inverse RL, and option portfolios'', Quantitative Finance, 19  (2019), pp.1543--1553.</ref>, PPO <ref name="du2020option"/>, and DDPG <ref name="cao2021hedging"/> have been applied to find hedging strategies and price financial derivatives. The state variables often include asset price, current positions, option strikes, and time remaining to expiry. The control variable is often set to be the change in holdings. Examples of reward functions are (risk-adjusted) expected wealth/return <ref name="halperin2020bs"/><ref name="halperin2019qlbs"/><ref name="du2020option"/><ref name="kolm2019">P.N. Kolm and G.RITTER, ''Dynamic replication and hedging: A  reinforcement learning approach'', The Journal of Financial Data Science, 1  (2019), pp.159--171.</ref> (as in the mean-variance portfolio optimization), option payoff <ref name="li2009"/>, and (risk-adjusted) hedging cost <ref name="cao2021hedging"/>. The benchmarks for pricing models are typically the BSM model <ref name="halperin2020bs"/><ref name="halperin2019qlbs"/> and binomial option pricing model <ref name="dubrov2015">B.DUBROV, ''Monte Carlo simulation with machine learning for  pricing American options and convertible bonds'', Available at SSRN 2684523,   (2015).</ref> (introduced in <ref name="cox1979option">J.C. Cox, S.A. Ross, and M.RUBINSTEIN, ''Option pricing: A  simplified approach'', Journal of Financial Economics, 7 (1979),  pp.229--263.</ref>). The learned hedging strategies are typically compared to a discrete approximation to the delta hedging strategy for the BSM model  <ref name="cao2021hedging"/><ref name="du2020option"/><ref name="buehler2019hedging">H.BUEHLER, L.GONON, J.TEICHMANN, and B.WOOD, ''Deep hedging'',  Quantitative Finance, 19 (2019), pp.1271--1291.</ref><ref name="cannelli2020hedging">L.CANNELLI, G.NUTI, M.SALA, and O.SZEHR, ''Hedging using  reinforcement learning: Contextual $k$-armed bandit versus Q-learning'',  arXiv preprint arXiv:2007.01623,  (2020).</ref>, or the hedging strategy for the Heston model <ref name="buehler2019hedging"/>.  In contrast to the BSM model where the volatility is assumed to be constant, the Heston model assumes that the volatility of the underlying asset follows a particular stochastic process which leads to a semi-analytical solution. The performance measures for the hedging strategy used in RL papers include (expected) hedging cost/error/loss <ref name="cao2021hedging"/><ref name="kolm2019"/><ref name="buehler2019hedging"/><ref name="cannelli2020hedging"/>, PnL <ref name="du2020option"/><ref name="kolm2019"/><ref name="cannelli2020hedging"/>, and average payoff <ref name="li2009"/>. Some practical issues have been taken into account in RL models, including transaction costs <ref name="buehler2019hedging"/><ref name="du2020option"/><ref name="cao2021hedging"/><ref name="kolm2019"/> and position constraints such as round lotting <ref name="du2020option"/> (where a round lot is a standard number of securities to be traded, such as 100 shares) and limits on the trading size (for example buy or sell up to 100 shares) <ref name="kolm2019"/>.
-For European options, <ref name="halperin2020bs"/> developed a discrete-time option pricing model called the QLBS (<math>Q</math>-Learner in Black-Scholes) model based on Fitted <math>Q</math>-learning (see [[guide:576dcdd2b6#subsec:deep_value_based |Section]]). This model learns both the option price and the hedging strategy in a similar spirit to the mean-variance portfolio optimization framework.  <ref name="halperin2019qlbs"/> extended the QLBS model in <ref name="halperin2020bs"/> by using Fitted <math>Q</math>-learning. They also investigated the model in a different setting where the agent infers the risk-aversion parameter in the reward function using observed states and actions. However, <ref name="halperin2020bs"/> and <ref name="halperin2019qlbs"/> did not consider transaction costs in their analysis. By contrast, <ref name="buehler2019hedging"/> used deep neural networks to approximate an optimal hedging strategy under market frictions, including transaction costs, and ''convex risk measures'' <ref name="kloppel2007dynamic"/> such as conditional value at risk. They showed that their method can accurately recover the optimal hedging strategy in the Heston model <ref name="Heston1993"/> without transaction costs and it can be used to numerically study the impact of proportional transaction costs on option prices. The learned hedging strategy is also shown to outperform delta hedging in the (daily recalibrated) BSM model in a simple setting, this is hedging an at-the-money European call option on the S\&P 500 index. The method in <ref name="buehler2019hedging"/> was extended in <ref name="carbonneau2021"/> to price and hedge a very large class of derivatives including
+For European options, <ref name="halperin2020bs"/> developed a discrete-time option pricing model called the QLBS (<math>Q</math>-Learner in Black-Scholes) model based on Fitted <math>Q</math>-learning (see [[guide:576dcdd2b6#subsec:deep_value_based |Section]]). This model learns both the option price and the hedging strategy in a similar spirit to the mean-variance portfolio optimization framework.  <ref name="halperin2019qlbs"/> extended the QLBS model in <ref name="halperin2020bs"/> by using Fitted <math>Q</math>-learning. They also investigated the model in a different setting where the agent infers the risk-aversion parameter in the reward function using observed states and actions. However, <ref name="halperin2020bs"/> and <ref name="halperin2019qlbs"/> did not consider transaction costs in their analysis. By contrast, <ref name="buehler2019hedging"/> used deep neural networks to approximate an optimal hedging strategy under market frictions, including transaction costs, and ''convex risk measures'' <ref name="kloppel2007dynamic">S.KL{\"o}ppel and M.SCHWEIZER, ''Dynamic indifference valuation via  convex risk measures'', Mathematical Finance, 17 (2007), pp.599--627.</ref> such as conditional value at risk. They showed that their method can accurately recover the optimal hedging strategy in the Heston model <ref name="Heston1993">S.HESTON, ''A closed-form solution for options with stochastic  volatility with applications to bond and currency options'', Review of  Financial Studies, 6 (1993), pp.327--343.</ref> without transaction costs and it can be used to numerically study the impact of proportional transaction costs on option prices. The learned hedging strategy is also shown to outperform delta hedging in the (daily recalibrated) BSM model in a simple setting, this is hedging an at-the-money European call option on the S\&P 500 index. The method in <ref name="buehler2019hedging"/> was extended in <ref name="carbonneau2021">A.CARBONNEAU and F.GODIN, ''Equal risk pricing of derivatives with  deep hedging'', Quantitative Finance, 21 (2021), pp.593--608.</ref> to price and hedge a very large class of derivatives including
-vanilla options and exotic options in more complex environments (e.g. stochastic volatility models and jump processes). <ref name="kolm2019"/> found the optimal hedging strategy by optimizing a simplified version of the mean-variance objective function \eqref{eqn:MV_obj}, subject to discrete trading, round lotting, and nonlinear transaction costs. They showed in their simulations that the learned hedging strategy achieves a much lower cost, with no significant difference in volatility of total PnL, compared to the delta hedging strategy. <ref name="du2020option"/> extended the framework in <ref name="kolm2019"/> and tested the performance of DQN and PPO for European options with different strikes. Their simulation results showed that these models are superior to delta hedging in general, and out of all models, PPO achieves the best performance in terms of PnL, training time, and the amount of data needed for training.  <ref name="cannelli2020hedging"/> formulated the optimal hedging problem as a Risk-averse Contextual <math>k</math>-Armed Bandit (R-CMAB) model (see the discussion of the contextual bandit problem in [[guide:C8c80a2ae8#sec:MDP2learning |Section]]) and proposed a deep CMAB algorithm involving Thompson Sampling <ref name="thompson1933likelihood"/>. They showed that their algorithm outperforms DQN in terms of sample efficiency and hedging error when compared to delta hedging (in the setting of the BSM model). Their learned hedging strategy was also shown to converge to delta hedging in the absence of risk adjustment, discretization error and transaction costs. <ref name="cao2021hedging"/> considered <math>Q</math>-learning and DDPG for the problem of hedging a short position in a call option when there are transaction costs. The objective function is set to be a weighted sum of the expected hedging cost and the standard deviation of the hedging cost. They showed that their approach achieves a markedly lower expected hedging cost but with a slightly higher standard deviation of the hedging cost when compared to delta hedging. In their simulations, the stock price is assumed to follow either geometric Brownian motion or a stochastic volatility model.
+vanilla options and exotic options in more complex environments (e.g. stochastic volatility models and jump processes). <ref name="kolm2019"/> found the optimal hedging strategy by optimizing a simplified version of the mean-variance objective function \eqref{eqn:MV_obj}, subject to discrete trading, round lotting, and nonlinear transaction costs. They showed in their simulations that the learned hedging strategy achieves a much lower cost, with no significant difference in volatility of total PnL, compared to the delta hedging strategy. <ref name="du2020option"/> extended the framework in <ref name="kolm2019"/> and tested the performance of DQN and PPO for European options with different strikes. Their simulation results showed that these models are superior to delta hedging in general, and out of all models, PPO achieves the best performance in terms of PnL, training time, and the amount of data needed for training.  <ref name="cannelli2020hedging"/> formulated the optimal hedging problem as a Risk-averse Contextual <math>k</math>-Armed Bandit (R-CMAB) model (see the discussion of the contextual bandit problem in [[guide:C8c80a2ae8#sec:MDP2learning |Section]]) and proposed a deep CMAB algorithm involving Thompson Sampling <ref name="thompson1933likelihood">W.R. Thompson, ''On the likelihood that one unknown probability  exceeds another in view of the evidence of two samples'', Biometrika, 25  (1933), pp.285--294.</ref>. They showed that their algorithm outperforms DQN in terms of sample efficiency and hedging error when compared to delta hedging (in the setting of the BSM model). Their learned hedging strategy was also shown to converge to delta hedging in the absence of risk adjustment, discretization error and transaction costs. <ref name="cao2021hedging"/> considered <math>Q</math>-learning and DDPG for the problem of hedging a short position in a call option when there are transaction costs. The objective function is set to be a weighted sum of the expected hedging cost and the standard deviation of the hedging cost. They showed that their approach achieves a markedly lower expected hedging cost but with a slightly higher standard deviation of the hedging cost when compared to delta hedging. In their simulations, the stock price is assumed to follow either geometric Brownian motion or a stochastic volatility model.
-For American options, the key challenge is to find the optimal exercise strategy which determines when to exercise the option as this determines the price. <ref name="li2009"/> used the Least-Squares Policy Iteration (LSPI) algorithm <ref name="lagoudakis2003least"/> and the Fitted <math>Q</math>-learning algorithm to learn the exercise policy for American options. In their experiments for American put options using both real and synthetic data, the two algorithms gain larger average payoffs than the benchmark Longstaff-Schwartz method <ref name="longstaff2001"/>, which is the standard Least-Squares Monte Carlo algorithm. <ref name="li2009"/> also analyzed their approach from a theoretical perspective and derived a high probability, finite-time bound on their method. <ref name="dubrov2015"/> then extended the work in  <ref name="li2009"/> by combining random forests, a popular machine learning technique, with Monte Carlo simulation for pricing of both American options and ''convertible bonds'', which are corporate bonds that can be converted to the stock of the issuing company by the bond holder. They showed that the proposed algorithm provides more accurate prices than several other methods including LSPI, Fitted <math>Q</math>-learning, and the Longstaff-Schwartz method.
+For American options, the key challenge is to find the optimal exercise strategy which determines when to exercise the option as this determines the price. <ref name="li2009"/> used the Least-Squares Policy Iteration (LSPI) algorithm <ref name="lagoudakis2003least">M.G. Lagoudakis and R.PARR, ''Least-squares policy iteration'', The  Journal of Machine Learning Research, 4 (2003), pp.1107--1149.</ref> and the Fitted <math>Q</math>-learning algorithm to learn the exercise policy for American options. In their experiments for American put options using both real and synthetic data, the two algorithms gain larger average payoffs than the benchmark Longstaff-Schwartz method <ref name="longstaff2001">F.A. Longstaff and E.S. Schwartz, ''Valuing American options by  simulation: A simple least-squares approach'', The Review of Financial  Studies, 14 (2001), pp.113--147.</ref>, which is the standard Least-Squares Monte Carlo algorithm. <ref name="li2009"/> also analyzed their approach from a theoretical perspective and derived a high probability, finite-time bound on their method. <ref name="dubrov2015"/> then extended the work in  <ref name="li2009"/> by combining random forests, a popular machine learning technique, with Monte Carlo simulation for pricing of both American options and ''convertible bonds'', which are corporate bonds that can be converted to the stock of the issuing company by the bond holder. They showed that the proposed algorithm provides more accurate prices than several other methods including LSPI, Fitted <math>Q</math>-learning, and the Longstaff-Schwartz method.
 ===Market Making===
@@ Line 231: / Line 231: @@
 by placing buy and sell limit orders in the LOB for that instrument while earning the bid-ask spread.
-The objective in market making is different from problems in optimal execution (to target a position) or portfolio optimization (for long-term investing). Instead of profiting from identifying the  correct price movement direction, the objective of a market maker is to profit from earning the bid-ask spread without accumulating undesirably large  positions (known as inventory) <ref name="gueant2012optimal"/>. A market maker faces three major sources of risk <ref name="guilbaud2013optimal"/>. The inventory risk <ref name="avellaneda2008high"/> refers to the risk of accumulating an undesirable large net inventory, which significantly increases volatility due to market movements. The execution risk <ref name="kuhn2010optimal"/> is the risk that limit orders may not get filled over a desired horizon. Finally, the adverse selection risk refers to the situation where there is a directional price movement that sweeps through the limit orders submitted by the market marker such that the price does not revert back by the end of the trading horizon. This may lead to a huge loss as the market maker in general needs to clear their inventory at the end of the horizon (typically the end of the day to avoid overnight inventory).
+The objective in market making is different from problems in optimal execution (to target a position) or portfolio optimization (for long-term investing). Instead of profiting from identifying the  correct price movement direction, the objective of a market maker is to profit from earning the bid-ask spread without accumulating undesirably large  positions (known as inventory) <ref name="gueant2012optimal">O.GU{\'e}ant, C.-A. Lehalle, and J.FERNANDEZ-Tapia, ''Optimal  portfolio liquidation with limit orders'', SIAM Journal on Financial  Mathematics, 3 (2012), pp.740--764.</ref>. A market maker faces three major sources of risk <ref name="guilbaud2013optimal">F.GUILBAUD and H.PHAM, ''Optimal high-frequency trading with limit  and market orders'', Quantitative Finance, 13 (2013), pp.79--94.</ref>. The inventory risk <ref name="avellaneda2008high">M.AVELLANEDA and S.STOIKOV, ''High-frequency trading in a limit  order book'', Quantitative Finance, 8 (2008), pp.217--224.</ref> refers to the risk of accumulating an undesirable large net inventory, which significantly increases volatility due to market movements. The execution risk <ref name="kuhn2010optimal">C.K{\"u}hn and M.STROH, ''Optimal portfolios of a small investor in  a limit order market: A shadow price approach'', Mathematics and Financial  Economics, 3 (2010), pp.45--72.</ref> is the risk that limit orders may not get filled over a desired horizon. Finally, the adverse selection risk refers to the situation where there is a directional price movement that sweeps through the limit orders submitted by the market marker such that the price does not revert back by the end of the trading horizon. This may lead to a huge loss as the market maker in general needs to clear their inventory at the end of the horizon (typically the end of the day to avoid overnight inventory).
-'''Stochastic Control Approach.''' Traditionally the theoretical study of market making strategies follows a stochastic control approach, where the LOB dynamics are modeled directly by some stochastic process and an optimal market making strategy that maximizes the market maker's expected utility can be obtained by solving the Hamilton--Jacobi--Bellman equation. See <ref name="avellaneda2008high"/>, <ref name="gueant2013dealing"/> and <ref name="obizhaeva2013optimal"/>, and
+'''Stochastic Control Approach.''' Traditionally the theoretical study of market making strategies follows a stochastic control approach, where the LOB dynamics are modeled directly by some stochastic process and an optimal market making strategy that maximizes the market maker's expected utility can be obtained by solving the Hamilton--Jacobi--Bellman equation. See <ref name="avellaneda2008high"/>, <ref name="gueant2013dealing">\leavevmode\vrule height 2pt depth -1.6pt width 23pt, ''Dealing with the  inventory risk: A solution to the market making problem'', Mathematics and  Financial Economics, 7 (2013), pp.477--507.</ref> and <ref name="obizhaeva2013optimal">A.A. Obizhaeva and J.WANG, ''Optimal trading strategy and  supply/demand dynamics'', Journal of Financial Markets, 16 (2013), pp.1--32.</ref>, and
 <ref name="cartea2015algorithmic"/>{{rp|at=Chapter 10}} for examples.
 We follow the framework in <ref name="avellaneda2008high"/> as an example to demonstrate the control formulation in continuous time and discuss its advantages and disadvantages.
@@ Line 294: / Line 294: @@
 admits a (semi) closed-form solution which leads to nice insights about the problem, it all builds on the full analytical specification of the market dynamics. In addition, there are very few utility functions (e.g. exponential (CARA), power (CRRA), and  quadratic) known in the literature that could possibly lead to closed-form solutions. The same issues arise in other work along this line (<ref name="gueant2013dealing"/> and <ref name="obizhaeva2013optimal"/>, and <ref name="cartea2015algorithmic"/>{{rp|at=Chapter 10}}) where strong model assumptions are made about the prices or about the LOB or both. This requirement of full analytical specification means these papers are quite removed from realistic market making, as financial markets do not conform to any simple parametric model specification with fixed parameters.
-'''RL Approach.''' For market making problems with an RL approach, both value-based methods (such as the <math>Q</math>-learning algorithm  <ref name="abernethy2013adaptive"/><ref name="spooner2018market"/> and SARSA <ref name="spooner2018market"/>) and policy-based methods (such as deep policy gradient method <ref name="zhao2021high"/>) have been used. The state variables are often composed of bid and ask prices, current holdings of assets, order-flow imbalance, volatility, and some sophisticated market indices. The control variables are typically set to be the spread to post a pair of limit buy and limit sell orders. Examples of reward include PnL with inventory cost  <ref name="abernethy2013adaptive"/><ref name="spooner2018market"/><ref name="spooner2020robust"/><ref name="zhao2021high"/> or Implementation Shortfall with inventory cost <ref name="gavsperov2021market"/>.
+'''RL Approach.''' For market making problems with an RL approach, both value-based methods (such as the <math>Q</math>-learning algorithm  <ref name="abernethy2013adaptive">J.D. Abernethy and S.KALE, ''Adaptive market making via online  learning'', in NIPS, Citeseer, 2013, pp.2058--2066.</ref><ref name="spooner2018market">T.SPOONER, J.FEARNLEY, R.SAVANI, and A.KOUKORINIS, ''Market making  via reinforcement learning'', in International Foundation for Autonomous  Agents and Multiagent Systems, AAMAS '18, 2018, pp.434--442.</ref> and SARSA <ref name="spooner2018market"/>) and policy-based methods (such as deep policy gradient method <ref name="zhao2021high">M.ZHAO and V.LINETSKY, ''High frequency automated market making  algorithms with adverse selection risk control via reinforcement learning'',  in Proceedings of the Second ACM International Conference on AI in Finance,  2021, pp.1--9.</ref>) have been used. The state variables are often composed of bid and ask prices, current holdings of assets, order-flow imbalance, volatility, and some sophisticated market indices. The control variables are typically set to be the spread to post a pair of limit buy and limit sell orders. Examples of reward include PnL with inventory cost  <ref name="abernethy2013adaptive"/><ref name="spooner2018market"/><ref name="spooner2020robust">T.SPOONER and R.SAVANI, ''Robust Market Making via  Adversarial Reinforcement Learning'', in Proc.\ of the 29th  International Joint Conference on Artificial Intelligence, {IJCAI-20}, 7  2020, pp.4590--4596.</ref><ref name="zhao2021high"/> or Implementation Shortfall with inventory cost <ref name="gavsperov2021market">B.GA{\vs}perov and Z.KOSTANJ{\vc}ar, ''Market making with  signals through deep reinforcement learning'', IEEE Access, 9 (2021),  pp.61611--61622.</ref>.
-The first RL method for market making was explored by <ref name="chan2001electronic"/> and the authors applied three RL algorithms and tested them in simulation environments: a Monte Carlo method, a SARSA method and an actor-critic method.
+The first RL method for market making was explored by <ref name="chan2001electronic">N.T. Chan and C.SHELTON, ''An electronic market-maker'', Technical  report, MIT,  (2001).</ref> and the authors applied three RL algorithms and tested them in simulation environments: a Monte Carlo method, a SARSA method and an actor-critic method.
 For all three methods, the state variables include inventory of the market-maker, order imbalance on the market, and market quality measures.  The actions are the changes in the bid/ask price to post the limit orders and the sizes of the limit buy/sell orders. The reward function is set as a linear combination of profit (to maximize), inventory risk (to minimize), and market qualities (to maximize). The authors found that SARSA and Monte Carlo methods work well in a simple simulation environment. The actor-critic method is more plausible in complex environments and generates stochastic policies that correctly adjust bid/ask prices with respect to order imbalance and effectively control the trade-off between the profit and the quoted spread (defined as the price difference to post limit buy orders and limit sell orders). Furthermore, the stochastic policies are shown to outperform deterministic policies in achieving a lower variance of the resulting spread. Later on, <ref name="abernethy2013adaptive"/> designed a group of “spread-based” market making strategies parametrized by a minimum quoted spread. The strategies bet on the mean-reverting behavior of the mid-price and utilize the opportunities when the mid-price deviates from the price during the previous period. Then an online algorithm is used to pick in each period a minimum quoted spread. The states of the market maker are the current inventory and price data. The actions are the quantities and at what prices to offer in the market. It is assumed that the market maker interacts with a continuous double auction via an order book. The market maker can place both market and limit orders and is able to make and cancel orders after every price fluctuation. The authors provided structural properties of these strategies which allows them to obtain low regret relative to the best such strategy in hindsight which maximizes the realized rewards.
 <ref name="spooner2018market"/> generalized the results in <ref name="chan2001electronic"/> and adopted several reinforcement learning algorithms (including <math>Q</math>-learning and SARSA) to improve the decisions of the market maker. In their framework, the action space contains  ten actions. The first nine actions correspond to a pair of orders with
@@ Line 310: / Line 310: @@
 (ARL)  to a zero-sum game version of the control formulation  \eqref{eq:MM_dynamics}-\eqref{avalaneda-hjb}. The adversary acts as a proxy for other market participants that would like to profit at the
 market maker's expense.
-In addition to designing learning algorithms for the market maker in an unknown financial environment, RL algorithms can also be used to solve high-dimensional control problems for the market maker or to solve the control problem with the presence of a time-dependent rebate in the full information setting. In particular, <ref name="gueant2019deep"/> focused on a setting where a market maker needs to decide the optimal bid and ask quotes for a given universe of bonds in an OTC market. This problem is high-dimensional and  other classical numerical methods including finite differences are inapplicable.  The authors proposed  a
+In addition to designing learning algorithms for the market maker in an unknown financial environment, RL algorithms can also be used to solve high-dimensional control problems for the market maker or to solve the control problem with the presence of a time-dependent rebate in the full information setting. In particular, <ref name="gueant2019deep">O.GU{\'e}ant and I.MANZIUK, ''Deep reinforcement learning for market  making in corporate bonds: beating the curse of dimensionality'', Applied  Mathematical Finance, 26 (2019), pp.387--452.</ref> focused on a setting where a market maker needs to decide the optimal bid and ask quotes for a given universe of bonds in an OTC market. This problem is high-dimensional and  other classical numerical methods including finite differences are inapplicable.  The authors proposed  a
-model-based Actor-Critic-like algorithm involving a deep neural network to numerically solve the problem. Similar ideas have been applied to market making problems in dark pools <ref name="baldacci2019market"/> (see the discussion on dark pools in [[#sec:SOR |Section]]). With the presence of a time-dependent rebate, there is no closed-form solution for the associated stochastic control problem of the market maker. Instead, <ref name="zhang2020reinforcement"/> proposed a Hamiltonian-guided value function approximation algorithm to solve for the numerical solutions under this scenario.
+model-based Actor-Critic-like algorithm involving a deep neural network to numerically solve the problem. Similar ideas have been applied to market making problems in dark pools <ref name="baldacci2019market">B.BALDACCI, I.MANZIUK, T.MASTROLIA, and M.ROSENBAUM, ''Market  making and incentives design in the presence of a dark pool: A deep  reinforcement learning approach'', arXiv preprint arXiv:1912.01129,  (2019).</ref> (see the discussion on dark pools in [[#sec:SOR |Section]]). With the presence of a time-dependent rebate, there is no closed-form solution for the associated stochastic control problem of the market maker. Instead, <ref name="zhang2020reinforcement">G.ZHANG and Y.CHEN, ''Reinforcement learning for optimal market  making with the presence of rebate'', Available at SSRN 3646753,  (2020).</ref> proposed a Hamiltonian-guided value function approximation algorithm to solve for the numerical solutions under this scenario.
-Multi-agent RL algorithms are also used to improve the strategy for market making with a particular focus on the impact of competition from other market makers or the interaction with other types of market participant. See <ref name="ganesh2019reinforcement"/> and <ref name="patel2018optimizing"/>.
+Multi-agent RL algorithms are also used to improve the strategy for market making with a particular focus on the impact of competition from other market makers or the interaction with other types of market participant. See <ref name="ganesh2019reinforcement">S.GANESH, N.VADORI, M.XU, H.ZHENG, P.REDDY, and M.VELOSO, ''Reinforcement learning for market making in a multi-agent dealer market'',  arXiv preprint arXiv:1911.05892,  (2019).</ref> and <ref name="patel2018optimizing">Y.PATEL, ''Optimizing market making using multi-agent reinforcement  learning'', arXiv preprint arXiv:1812.10252,  (2018).</ref>.
 ===<span id="sec:robo-advising"></span>Robo-advising===
-Robo-advisors, or automated investment managers, are a class of financial advisers that provide online financial advice or investment management  with  minimal human intervention. They provide digital financial advice based on mathematical rules or algorithms which can easily take into account different sources of data such as news, social media information, sentiment data and earnings reports. Robo-advisors  have gained widespread popularity  and  emerged prominently as an alternative to traditional human advisers in recent years.  The first robo-advisors were launched after the 2008 financial crisis when financial services institutions were facing the ensuing loss of trust from their clients. Examples of pioneering robo-advising firms include Betterment and Wealthfront. As of 2020, the value of assets under robo-management is highest in the United States and exceeded \$650 billion <ref name="capponi2021personalized"/>.
+Robo-advisors, or automated investment managers, are a class of financial advisers that provide online financial advice or investment management  with  minimal human intervention. They provide digital financial advice based on mathematical rules or algorithms which can easily take into account different sources of data such as news, social media information, sentiment data and earnings reports. Robo-advisors  have gained widespread popularity  and  emerged prominently as an alternative to traditional human advisers in recent years.  The first robo-advisors were launched after the 2008 financial crisis when financial services institutions were facing the ensuing loss of trust from their clients. Examples of pioneering robo-advising firms include Betterment and Wealthfront. As of 2020, the value of assets under robo-management is highest in the United States and exceeded \$650 billion <ref name="capponi2021personalized">A.CAPPONI, S.OLAFSSON, and T.ZARIPHOPOULOU, ''Personalized  robo-advising: Enhancing investment through client interaction'', Management  Science,  (2021).</ref>.
 The robo-advisor does not know the client's risk preference in advance but learns it while interacting with the client. The robo-advisor then improves its investment
@@ Line 326: / Line 326: @@
 '''RL Approach.''' There are only a few references on robo-advising with an RL approach since this is still a relatively new topic. We  review each paper with details. The first RL algorithm for a robo-advisor was proposed by
-<ref name="alsabah2021robo"/> where the authors designed an exploration-exploitation algorithm to learn the investor's risk appetite over time by observing her portfolio choices in different market environments. The set of various market environments of interest is formulated as the state space <math>\mathcal{S}</math>. In each period, the robo-advisor places an investor's capital into one of several pre-constructed portfolios which can be viewed as the action space <math>\mathcal{A}</math>. Each portfolio decision reflects the robo-advisor's belief concerning the investor's true risk preference from a discrete set of possible risk aversion parameters <math>
+<ref name="alsabah2021robo">H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, ''Robo-advising: Learning investors' risk preferences via portfolio choices'',  Journal of Financial Econometrics, 19 (2021), pp.369--392.</ref> where the authors designed an exploration-exploitation algorithm to learn the investor's risk appetite over time by observing her portfolio choices in different market environments. The set of various market environments of interest is formulated as the state space <math>\mathcal{S}</math>. In each period, the robo-advisor places an investor's capital into one of several pre-constructed portfolios which can be viewed as the action space <math>\mathcal{A}</math>. Each portfolio decision reflects the robo-advisor's belief concerning the investor's true risk preference from a discrete set of possible risk aversion parameters <math>
 \Theta = \{\theta_i\}_{1\leq i \leq |\Theta|}</math>. The investor interacts with the robo-advisor by portfolio selection choices, and such interactions are used to update
 the robo-advisor's estimate of the investor's risk profile.  The authors proved that, with high probability, the proposed exploration-exploitation algorithm performs near optimally with the number of time steps depending polynomially on various model parameters.
-<ref name="wang2021robo"/> proposed an investment robo-advising framework consisting of two agents. The first
+<ref name="wang2021robo">H.WANG and S.YU, ''Robo-advising: Enhancing investment with inverse  optimization and deep reinforcement learning'', arXiv preprint  arXiv:2105.09264,  (2021).</ref> proposed an investment robo-advising framework consisting of two agents. The first
 agent, an inverse portfolio optimization agent, infers an investor's risk preference and expected return directly from historical allocation data using online inverse optimization. The
 second agent, a deep RL agent, aggregates the inferred sequence of expected returns to formulate
 a new multi-period mean-variance portfolio optimization problem that can be solved using a deep RL approach based on the DDPG method. The proposed investment pipeline was applied to real market data from
 April 1, 2016 to February 1, 2021 and was shown to consistently outperform the S\&P 500 benchmark portfolio that represents the aggregate market optimal allocation.
-As mentioned earlier in this subsection, learning the client's risk preference is challenging as the preference may depend on multiple factors and may change over time. <ref name="yu2020learning"/> was dedicated to learning the risk preferences from investment portfolios using an inverse optimization technique. In particular, the proposed inverse optimization approach can be used to measure time varying risk preferences directly from market signals and portfolios. This approach is developed based on two methodologies: convex optimization based modern portfolio theory and learning the decision-making scheme through inverse optimization.
+As mentioned earlier in this subsection, learning the client's risk preference is challenging as the preference may depend on multiple factors and may change over time. <ref name="yu2020learning">S.YU, H.WANG, and C.DONG, ''Learning risk preferences from  investment portfolios using inverse optimization'', arXiv preprint  arXiv:2010.01687,  (2020).</ref> was dedicated to learning the risk preferences from investment portfolios using an inverse optimization technique. In particular, the proposed inverse optimization approach can be used to measure time varying risk preferences directly from market signals and portfolios. This approach is developed based on two methodologies: convex optimization based modern portfolio theory and learning the decision-making scheme through inverse optimization.
 ===<span id="sec:SOR"></span>Smart Order Routing===
 In order to execute a trade of a given asset, market participants  may have the opportunity to split the trade and submit orders to different venues, including both lit pools and dark pools, where this asset is traded. This could potentially improve the overall execution price and quantity. Both the decision and hence the outcome are influenced by the characteristics of different venues as well as the structure of transaction fees and rebates across different venues.
@@ Line 344: / Line 344: @@
-'''Allocation Across Lit Pools.''' The SOR problem across multiple lit pools (or primary exchanges) was first studied in <ref name="cont2017optimal"/> where the authors formulated the SOR problem as a convex optimization problem.
+'''Allocation Across Lit Pools.''' The SOR problem across multiple lit pools (or primary exchanges) was first studied in <ref name="cont2017optimal">R.CONT and A.KUKANOV, ''Optimal order placement in limit order  markets'', Quantitative Finance, 17 (2017), pp.21--39.</ref> where the authors formulated the SOR problem as a convex optimization problem.
 Consider a trader who needs to buy <math>S</math> shares of a stock within a short time interval <math>[0, T]</math>. At time <math>0</math>, the trader may submit <math>K</math> limit orders with <math>L_k \ge 0</math> shares to exchanges <math>k = 1,\cdots,K</math> (joining the queue of the best bid price level) and also market orders for <math>M \ge 0</math> shares.
 At time <math>T</math> if the total executed quantity is less than <math>S</math> the trader also submits a market order to execute the remaining amount. The trader's order placement decision is thus summarized by a
@@ Line 387: / Line 387: @@
 Finally, the cost function is defined as the summation of all three pieces <math>V(X,\xi): = V_{\rm execution}(X, \xi)+ V_{\rm penalty}(X, \xi) + V_{\rm impact}(X,\xi)</math>. <ref name="cont2017optimal"/>{{rp|at=Proposition 4}}  provides  optimality conditions for an order allocation <math>X^* = (M^*,L_1,\cdots,L_K)</math>. In particular,  (semi)-explicit model conditions are given for when <math>L_k^* > 0</math> (<math>M^* > 0</math>), i.e., when submitting limit orders to venue <math>k</math> (when submitting market orders)  is optimal.
-In a different approach to the single-period model introduced above, <ref name="baldacci2020adaptive"/> formulated the SOR problem as an order allocation problem across multiple lit pools over multiple trading periods. Each venue is characterized by a bid-ask spread process and  an imbalance process. The dependencies between the imbalance and spread at the venues are considered through a covariance matrix. A Bayesian learning framework for learning and updating the model parameters is proposed to take into account possibly changing market conditions. Extensions to include short/long trading signals, market impact or hidden liquidity are also discussed.
+In a different approach to the single-period model introduced above, <ref name="baldacci2020adaptive">B.BALDACCI and I.MANZIUK, ''Adaptive trading strategies across  liquidity pools'', arXiv preprint arXiv:2008.07807,  (2020).</ref> formulated the SOR problem as an order allocation problem across multiple lit pools over multiple trading periods. Each venue is characterized by a bid-ask spread process and  an imbalance process. The dependencies between the imbalance and spread at the venues are considered through a covariance matrix. A Bayesian learning framework for learning and updating the model parameters is proposed to take into account possibly changing market conditions. Extensions to include short/long trading signals, market impact or hidden liquidity are also discussed.
@@ Line 400: / Line 400: @@
 pool (i.e., <math>\min(s_t^i,v_t^i)</math>), but not how many would have been traded if
 more were allocated (i.e., <math>s_t^i</math>).
-Based on this property of censored feedback, <ref name="ganchev2010censored"/> formulated the allocation problem across dark pools as an online learning problem under the assumption that <math>s_t^i</math> is an i.i.d. sample from some unknown distribution <math>P_i</math> and the total allocation quantity <math>V_t</math> is an i.i.d. sample from an unknown distribution <math>Q</math> with <math>V_t</math> upper bounded by a constant <math>V > 0</math> almost surely. At each iteration <math>t</math>, the learner allocates the orders greedily according to the  estimate <math>\widehat{P}^{(t-1)}_i</math> of the distribution <math>P_i</math> for all dark pools <math>i=1,2,\cdots, K</math> derived from previous iteration <math>t-1</math>. Then the learner can update the estimation <math>\widehat{P}^{(t)}_i</math> of the distribution <math>P_i</math> with a modified version of the Kaplan-Meier estimate (a non-parametric statistic used to estimate the cumulative probability) with the new censored observation <math>\min(s_t^i,v_t^i)</math> from iteration <math>t</math>. The authors then proved that for any <math>\varepsilon  >  0</math> and
+Based on this property of censored feedback, <ref name="ganchev2010censored">K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, ''Censored  exploration and the dark pool problem'', Communications of the ACM, 53 (2010),  pp.99--107.</ref> formulated the allocation problem across dark pools as an online learning problem under the assumption that <math>s_t^i</math> is an i.i.d. sample from some unknown distribution <math>P_i</math> and the total allocation quantity <math>V_t</math> is an i.i.d. sample from an unknown distribution <math>Q</math> with <math>V_t</math> upper bounded by a constant <math>V > 0</math> almost surely. At each iteration <math>t</math>, the learner allocates the orders greedily according to the  estimate <math>\widehat{P}^{(t-1)}_i</math> of the distribution <math>P_i</math> for all dark pools <math>i=1,2,\cdots, K</math> derived from previous iteration <math>t-1</math>. Then the learner can update the estimation <math>\widehat{P}^{(t)}_i</math> of the distribution <math>P_i</math> with a modified version of the Kaplan-Meier estimate (a non-parametric statistic used to estimate the cumulative probability) with the new censored observation <math>\min(s_t^i,v_t^i)</math> from iteration <math>t</math>. The authors then proved that for any <math>\varepsilon  >  0</math> and
 <math> \delta >  0</math>, with probability <math>1-\delta</math> (over the randomness
 of draws from Q and <math>\{P_i\}_{i=1}^K</math>), after running for a time
@@ Line 407: / Line 407: @@
 step with probability at least <math>1-\varepsilon</math>.
-The  setup of <ref name="ganchev2010censored"/> was generalized in <ref name="agarwal2010optimal"/> where the authors assumed that the sequences of volumes <math>V_t</math> and available liquidities <math>\{s_t^i\}_{i=1}^K</math> are chosen by an adversary who knows the previous allocations of their algorithm.
+The  setup of <ref name="ganchev2010censored"/> was generalized in <ref name="agarwal2010optimal">A.AGARWAL, P.BARTLETT, and M.DAMA, ''Optimal allocation strategies  for the dark pool problem'', in Proceedings of the Thirteenth International  Conference on Artificial Intelligence and Statistics, JMLR Workshop and  Conference Proceedings, 2010, pp.9--16.</ref> where the authors assumed that the sequences of volumes <math>V_t</math> and available liquidities <math>\{s_t^i\}_{i=1}^K</math> are chosen by an adversary who knows the previous allocations of their algorithm.
 An exponentiated gradient style algorithm was proposed and shown to enjoy an optimal regret guarantee <math>\mathcal{O}(V\sqrt{T \ln K})</math> against the best allocation strategy in hindsight.
 ==General references==