Further Developments for Mathematical Finance and Reinforcement Learning

The general RL algorithms developed in the machine learning literature are good starting points for use in financial applications. A possible drawback is that such general RL algorithms tend to overfit, using more information than is actually required for a particular application. On the other hand, the stochastic control approach to many financial decision-making problems may suffer from the risk of model mis-specification. However, it may capture the essential features of a given financial application from a modeling perspective, in terms of the dynamics and the reward function. One promising direction for RL in finance is to develop an even closer integration of the modeling techniques (the domain knowledge) from the stochastic control literature and key components of a given financial application (for example the adverse selection risk for market-making problems and the execution risk for optimal liquidation problems) with the learning power of the RL algorithms. This line of developing a more integrated framework is interesting from both theoretical and applications perspectives. From the application point of view, a modified RL algorithm, with designs tailored to one particular financial application, could lead to better empirical performance. This could be verified by comparison with existing algorithms on the available datasets. In addition, financial applications motivate potential new frameworks and playgrounds for RL algorithms. Carrying out the convergence and sample complexity analysis for these modified algorithms would also be a meaningful direction in which to proceed. Many of the papers referenced in this review provide great initial steps in this direction. We list the following future directions that the reader may find interesting.

Risk-aware or Risk-sensitive RL. Risk arises from the uncertainties associated with future events, and is inevitable since the consequences of actions are uncertain at the time when a decision is made. Many decision-making problems in finance lead to trading strategies and it is important to account for the risk of the proposed strategies (which could be measured for instance by the maximum draw-down, the variance or the 5\ Hence it would be interesting to include risk measures in the design of RL algorithms for financial applications. The challenge of risk-sensitive RL lies both in the non-linearity of the objective function with respect to the reward and in designing a risk-aware exploration mechanism. RL with risk-sensitive utility functions has been studied in several papers without regard to specific financial applications. The work of ^[1] proposes TD(0) and [math]Q[/math]-learning-style algorithms that transform temporal differences instead of cumulative rewards, and proves their convergence. Risk-sensitive RL with a general family of utility functions is studied in ^[2], which also proposes a [math]Q[/math]-learning algorithm with convergence guarantees. The work of ^[3] studies a risk-sensitive policy gradient algorithm, though with no theoretical guarantees. ^[4] considers the problem of risk-sensitive RL with exponential utility and proposes two efficient model-free algorithms, Risk-sensitive Value Iteration (RSVI) and Risk-sensitive [math]Q[/math]-learning (RSQ), with a near-optimal sample complexity guarantee. ^[5] developed a martingale approach to learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios. Another line of work focuses on constrained RL problems with different risk criteria ^[6]^[7]^[8]^[9]^[10]^[11]. Very recently, ^[12] proposed a robust risk-aware reinforcement learning framework via robust optimization and with a rank dependent expected utility function. Financial applications such as statistical arbitrage and portfolio optimization are discussed with detailed numerical examples. ^[13] develops a framework combining policy-gradient-based RL method and dynamic convex risk measures for solving time-consistent risk-sensitive stochastic optimization problems. However, there is no sample complexity or asymptotic convergence studied for the proposed algorithms in ^[12]^[13].

Offline Learning and Online Exploration. Online learning requires updating of algorithm parameters in real-time and this is impractical for many financial decision-making problems, especially in the high-frequency regime. The most plausible setting is to collect data with a pre-specified exploration scheme during trading hours and update the algorithm with the new collected data after the close of trading. This is closely related to the translation of online learning to offline regression ^[14] and RL with batch data ^[15]^[16]^[17]^[18]. However, these developments focus on general methodologies without being specifically tailored to financial applications.

Learning with a Limited Exploration Budget. Exploration can help agents to find new policies to improve their future cumulative rewards. However, too much exploration can be both time consuming and computation consuming, and in particular, it may be very costly for some financial applications. Additionally, exploring black-box trading strategies may need a lot of justification within a financial institution and hence investors tend to limit the effort put into exploration and try to improve performance as much as possible within a given budget for exploration. This idea is similar in spirit to conservative RL where agents explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time ^[19]. This is also related to the problem of information acquisition with a cost which has been studied for economic commodities ^[20] and operations management ^[21]. It may also be interesting to investigate such costs for decision-making problems in financial markets.

Learning with Multiple Objectives. In finance, a common problem is to choose a portfolio when there are two conflicting objectives - the desire to have the expected value of portfolio returns be as high as possible, and the desire to have risk, often measured by the standard deviation of portfolio returns, be as low as possible. This problem is often represented by a graph in which the efficient frontier shows the best combinations of risk and expected return that are available, and in which indifference curves show the investor's preferences for various risk-expected return combinations. Decision makers sometimes combine both criteria into a single objective function consisting of the difference of the expected reward and a scalar multiple of the risk. However, it may well not be in the best interest of a decision maker to combine relevant criteria in a linear format for certain applications. For example, market makers on the OTC market tend to view criteria such as turn around time, balance sheet constraints, inventory cost, profit and loss as separate objective functions. The study of multi-objective RL is still at a preliminary stage and relevant references include ^[22] and ^[23].

Learning to Allocate Across Lit Pools and Dark Pools. Online optimization methods explored in ^[24] and ^[25] for dark pool allocations can be viewed as a single-period RL algorithm and the Bayesian framework developed in ^[26] for allocations across lit pools may be classified as a model-based RL approach. However, there is currently no existing work on applying multi-period and model-free RL methods to learn how to route orders across both dark pools and lit pools. We think this might be an interesting direction to explore as agents sometimes have access to both lit pools and dark pools and these two contrasting pools have quite different information structures and matching mechanisms.

Robo-advising in a Model-free Setting. As introduced in Section, ^[27] considered learning within a set of [math]m[/math] pre-specified investment portfolios, and ^[28] and ^[29] developed learning algorithms and procedures to infer risk preferences, respectively, under the framework of Markowitz mean-variance portfolio optimization. It would be interesting to consider a model-free RL approach where the robo-advisor has the freedom to learn and improve decisions beyond a pre-specified set of strategies or the Markowitz framework.

Sample Efficiency in Learning Trading Strategies. In recent years, sample complexity has been studied extensively to understand modern reinforcement learning algorithms (see Sections \ref{sec:rl_basics}-\ref{sec:deep_value_based}). However, most RL algorithms still require a large number of samples to train a decent trading algorithm, which may exceed the amount of relevant available historical data. Financial time series are known to be non-stationary ^[30], and hence historical data that are further away in time may not be helpful in training efficient learning algorithms for the current market environment. This leads to important questions of designing more sample-efficient RL algorithms for financial applications or developing good market simulators that could generate (unlimited) realistic market scenarios ^[31].

Transfer Learning and Cold Start for Learning New Assets. Financial institutions or individuals may change their baskets of assets to trade over time. Possible reasons may be that new assets (for example cooperative bonds) are issued from time to time or the investors may switch their interest from one sector to another. There are two interesting research directions related to this situation. When an investor has a good trading strategy, trained by an RL algorithm for one asset, how should they transfer the experience to train a trading algorithm for a “similar” asset with fewer samples? This is closely related to transfer learning ^[32]^[33]. To the best of our knowledge, no study for financial applications has been carried out along this direction. Another question is the cold-start problem for newly issued assets. When we have very limited data for a new asset, how should we initialize an RL algorithm and learn a decent strategy using the limited available data and our experience (i.e., the trained RL algorithm or data) with other longstanding assets?

Acknowledgement We thank Xuefeng Gao, Anran Hu, Xiao-Yang Liu, Wenpin Tang, Ziyi Xia, Zhuoran Yang, Junzi Zhang and Zeyu Zheng for helpful discussions and comments on this survey.

Potential Danger: Algorithmic Collusion

Artificial intelligence, algorithmic pricing, and collusion,Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S., 2020. American Economic Review, 110(10), pp.3267-97.
Algorithmic collusion with imperfect monitoring, Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S., 2021. International Journal of Industrial Organization, p.102712

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

References

Cite error: Invalid <ref> tag; no text was provided for refs named mihatsch2002risk
Cite error: Invalid <ref> tag; no text was provided for refs named shen2014risk
Cite error: Invalid <ref> tag; no text was provided for refs named eriksson2019epistemic
Cite error: Invalid <ref> tag; no text was provided for refs named fei2020risk
Cite error: Invalid <ref> tag; no text was provided for refs named vadori2020risk
Cite error: Invalid <ref> tag; no text was provided for refs named achiam2017constrained
Cite error: Invalid <ref> tag; no text was provided for refs named chow2017risk
Cite error: Invalid <ref> tag; no text was provided for refs named chow2015risk
Cite error: Invalid <ref> tag; no text was provided for refs named ding2021provably
Cite error: Invalid <ref> tag; no text was provided for refs named tamar2015policy
Cite error: Invalid <ref> tag; no text was provided for refs named zheng2020constrained
^12.0 ^12.1 Cite error: Invalid <ref> tag; no text was provided for refs named jaimungal2021robust
^13.0 ^13.1 Cite error: Invalid <ref> tag; no text was provided for refs named coache2021reinforcement
Cite error: Invalid <ref> tag; no text was provided for refs named simchi2020bypassing
Cite error: Invalid <ref> tag; no text was provided for refs named chen2019information
Cite error: Invalid <ref> tag; no text was provided for refs named gao2019batched
Cite error: Invalid <ref> tag; no text was provided for refs named garcelon2020conservative
Cite error: Invalid <ref> tag; no text was provided for refs named ren2020dynamic
Cite error: Invalid <ref> tag; no text was provided for refs named wu2016conservative
Cite error: Invalid <ref> tag; no text was provided for refs named pomatto2018cost
Cite error: Invalid <ref> tag; no text was provided for refs named ke2016search
Cite error: Invalid <ref> tag; no text was provided for refs named zhou2020provable
Cite error: Invalid <ref> tag; no text was provided for refs named yang2019generalized
Cite error: Invalid <ref> tag; no text was provided for refs named agarwal2010optimal
Cite error: Invalid <ref> tag; no text was provided for refs named ganchev2010censored
Cite error: Invalid <ref> tag; no text was provided for refs named baldacci2020adaptive
Cite error: Invalid <ref> tag; no text was provided for refs named alsabah2021robo
Cite error: Invalid <ref> tag; no text was provided for refs named wang2021robo
Cite error: Invalid <ref> tag; no text was provided for refs named yu2020learning
Cite error: Invalid <ref> tag; no text was provided for refs named huang2003applications
Cite error: Invalid <ref> tag; no text was provided for refs named wiese2020quant
Cite error: Invalid <ref> tag; no text was provided for refs named torrey2010transfer
Cite error: Invalid <ref> tag; no text was provided for refs named pan2009survey

[mihatsch2002risk-1] Cite error: Invalid <ref> tag; no text was provided for refs named mihatsch2002risk

[shen2014risk-2] Cite error: Invalid <ref> tag; no text was provided for refs named shen2014risk

[eriksson2019epistemic-3] Cite error: Invalid <ref> tag; no text was provided for refs named eriksson2019epistemic

[fei2020risk-4] Cite error: Invalid <ref> tag; no text was provided for refs named fei2020risk

[vadori2020risk-5] Cite error: Invalid <ref> tag; no text was provided for refs named vadori2020risk

[achiam2017constrained-6] Cite error: Invalid <ref> tag; no text was provided for refs named achiam2017constrained

[chow2017risk-7] Cite error: Invalid <ref> tag; no text was provided for refs named chow2017risk

[chow2015risk-8] Cite error: Invalid <ref> tag; no text was provided for refs named chow2015risk

[ding2021provably-9] Cite error: Invalid <ref> tag; no text was provided for refs named ding2021provably

[tamar2015policy-10] Cite error: Invalid <ref> tag; no text was provided for refs named tamar2015policy

[zheng2020constrained-11] Cite error: Invalid <ref> tag; no text was provided for refs named zheng2020constrained

[jaimungal2021robust-12] 12.0 ^12.1 Cite error: Invalid <ref> tag; no text was provided for refs named jaimungal2021robust

[coache2021reinforcement-13] 13.0 ^13.1 Cite error: Invalid <ref> tag; no text was provided for refs named coache2021reinforcement

[simchi2020bypassing-14] Cite error: Invalid <ref> tag; no text was provided for refs named simchi2020bypassing

[chen2019information-15] Cite error: Invalid <ref> tag; no text was provided for refs named chen2019information

[gao2019batched-16] Cite error: Invalid <ref> tag; no text was provided for refs named gao2019batched

[garcelon2020conservative-17] Cite error: Invalid <ref> tag; no text was provided for refs named garcelon2020conservative

[ren2020dynamic-18] Cite error: Invalid <ref> tag; no text was provided for refs named ren2020dynamic

[wu2016conservative-19] Cite error: Invalid <ref> tag; no text was provided for refs named wu2016conservative

[pomatto2018cost-20] Cite error: Invalid <ref> tag; no text was provided for refs named pomatto2018cost

[ke2016search-21] Cite error: Invalid <ref> tag; no text was provided for refs named ke2016search

[zhou2020provable-22] Cite error: Invalid <ref> tag; no text was provided for refs named zhou2020provable

[yang2019generalized-23] Cite error: Invalid <ref> tag; no text was provided for refs named yang2019generalized

[agarwal2010optimal-24] Cite error: Invalid <ref> tag; no text was provided for refs named agarwal2010optimal

[ganchev2010censored-25] Cite error: Invalid <ref> tag; no text was provided for refs named ganchev2010censored

[baldacci2020adaptive-26] Cite error: Invalid <ref> tag; no text was provided for refs named baldacci2020adaptive

[alsabah2021robo-27] Cite error: Invalid <ref> tag; no text was provided for refs named alsabah2021robo

[wang2021robo-28] Cite error: Invalid <ref> tag; no text was provided for refs named wang2021robo

[yu2020learning-29] Cite error: Invalid <ref> tag; no text was provided for refs named yu2020learning

[huang2003applications-30] Cite error: Invalid <ref> tag; no text was provided for refs named huang2003applications

[wiese2020quant-31] Cite error: Invalid <ref> tag; no text was provided for refs named wiese2020quant

[torrey2010transfer-32] Cite error: Invalid <ref> tag; no text was provided for refs named torrey2010transfer

[pan2009survey-33] Cite error: Invalid <ref> tag; no text was provided for refs named pan2009survey

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]