guide:6480cb5929: Difference between revisions

Latest revision as of 01:27, 13 May 2024

The general RL algorithms developed in the machine learning literature are good starting points for use in financial applications. A possible drawback is that such general RL algorithms tend to overfit, using more information than is actually required for a particular application. On the other hand, the stochastic control approach to many financial decision-making problems may suffer from the risk of model mis-specification. However, it may capture the essential features of a given financial application from a modeling perspective, in terms of the dynamics and the reward function. One promising direction for RL in finance is to develop an even closer integration of the modeling techniques (the domain knowledge) from the stochastic control literature and key components of a given financial application (for example the adverse selection risk for market-making problems and the execution risk for optimal liquidation problems) with the learning power of the RL algorithms. This line of developing a more integrated framework is interesting from both theoretical and applications perspectives. From the application point of view, a modified RL algorithm, with designs tailored to one particular financial application, could lead to better empirical performance. This could be verified by comparison with existing algorithms on the available datasets. In addition, financial applications motivate potential new frameworks and playgrounds for RL algorithms. Carrying out the convergence and sample complexity analysis for these modified algorithms would also be a meaningful direction in which to proceed. Many of the papers referenced in this review provide great initial steps in this direction. We list the following future directions that the reader may find interesting.

Risk-aware or Risk-sensitive RL. Risk arises from the uncertainties associated with future events, and is inevitable since the consequences of actions are uncertain at the time when a decision is made. Many decision-making problems in finance lead to trading strategies and it is important to account for the risk of the proposed strategies (which could be measured for instance by the maximum draw-down, the variance or the 5\ Hence it would be interesting to include risk measures in the design of RL algorithms for financial applications. The challenge of risk-sensitive RL lies both in the non-linearity of the objective function with respect to the reward and in designing a risk-aware exploration mechanism. RL with risk-sensitive utility functions has been studied in several papers without regard to specific financial applications. The work of ^[1] proposes TD(0) and [math]Q[/math]-learning-style algorithms that transform temporal differences instead of cumulative rewards, and proves their convergence. Risk-sensitive RL with a general family of utility functions is studied in ^[2], which also proposes a [math]Q[/math]-learning algorithm with convergence guarantees. The work of ^[3] studies a risk-sensitive policy gradient algorithm, though with no theoretical guarantees. ^[4] considers the problem of risk-sensitive RL with exponential utility and proposes two efficient model-free algorithms, Risk-sensitive Value Iteration (RSVI) and Risk-sensitive [math]Q[/math]-learning (RSQ), with a near-optimal sample complexity guarantee. ^[5] developed a martingale approach to learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios. Another line of work focuses on constrained RL problems with different risk criteria ^[6]^[7]^[8]^[9]^[10]^[11]. Very recently, ^[12] proposed a robust risk-aware reinforcement learning framework via robust optimization and with a rank dependent expected utility function. Financial applications such as statistical arbitrage and portfolio optimization are discussed with detailed numerical examples. ^[13] develops a framework combining policy-gradient-based RL method and dynamic convex risk measures for solving time-consistent risk-sensitive stochastic optimization problems. However, there is no sample complexity or asymptotic convergence studied for the proposed algorithms in ^[12]^[13].

Offline Learning and Online Exploration. Online learning requires updating of algorithm parameters in real-time and this is impractical for many financial decision-making problems, especially in the high-frequency regime. The most plausible setting is to collect data with a pre-specified exploration scheme during trading hours and update the algorithm with the new collected data after the close of trading. This is closely related to the translation of online learning to offline regression ^[14] and RL with batch data ^[15]^[16]^[17]^[18]. However, these developments focus on general methodologies without being specifically tailored to financial applications.

Learning with a Limited Exploration Budget. Exploration can help agents to find new policies to improve their future cumulative rewards. However, too much exploration can be both time consuming and computation consuming, and in particular, it may be very costly for some financial applications. Additionally, exploring black-box trading strategies may need a lot of justification within a financial institution and hence investors tend to limit the effort put into exploration and try to improve performance as much as possible within a given budget for exploration. This idea is similar in spirit to conservative RL where agents explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time ^[19]. This is also related to the problem of information acquisition with a cost which has been studied for economic commodities ^[20] and operations management ^[21]. It may also be interesting to investigate such costs for decision-making problems in financial markets.

Learning with Multiple Objectives. In finance, a common problem is to choose a portfolio when there are two conflicting objectives - the desire to have the expected value of portfolio returns be as high as possible, and the desire to have risk, often measured by the standard deviation of portfolio returns, be as low as possible. This problem is often represented by a graph in which the efficient frontier shows the best combinations of risk and expected return that are available, and in which indifference curves show the investor's preferences for various risk-expected return combinations. Decision makers sometimes combine both criteria into a single objective function consisting of the difference of the expected reward and a scalar multiple of the risk. However, it may well not be in the best interest of a decision maker to combine relevant criteria in a linear format for certain applications. For example, market makers on the OTC market tend to view criteria such as turn around time, balance sheet constraints, inventory cost, profit and loss as separate objective functions. The study of multi-objective RL is still at a preliminary stage and relevant references include ^[22] and ^[23].

Learning to Allocate Across Lit Pools and Dark Pools. Online optimization methods explored in ^[24] and ^[25] for dark pool allocations can be viewed as a single-period RL algorithm and the Bayesian framework developed in ^[26] for allocations across lit pools may be classified as a model-based RL approach. However, there is currently no existing work on applying multi-period and model-free RL methods to learn how to route orders across both dark pools and lit pools. We think this might be an interesting direction to explore as agents sometimes have access to both lit pools and dark pools and these two contrasting pools have quite different information structures and matching mechanisms.

Robo-advising in a Model-free Setting. As introduced in Section, ^[27] considered learning within a set of [math]m[/math] pre-specified investment portfolios, and ^[28] and ^[29] developed learning algorithms and procedures to infer risk preferences, respectively, under the framework of Markowitz mean-variance portfolio optimization. It would be interesting to consider a model-free RL approach where the robo-advisor has the freedom to learn and improve decisions beyond a pre-specified set of strategies or the Markowitz framework.

Sample Efficiency in Learning Trading Strategies. In recent years, sample complexity has been studied extensively to understand modern reinforcement learning algorithms (see The Basics of Reinforcement Learning-Deep Value-based RL Algorithms). However, most RL algorithms still require a large number of samples to train a decent trading algorithm, which may exceed the amount of relevant available historical data. Financial time series are known to be non-stationary ^[30], and hence historical data that are further away in time may not be helpful in training efficient learning algorithms for the current market environment. This leads to important questions of designing more sample-efficient RL algorithms for financial applications or developing good market simulators that could generate (unlimited) realistic market scenarios ^[31].

Transfer Learning and Cold Start for Learning New Assets. Financial institutions or individuals may change their baskets of assets to trade over time. Possible reasons may be that new assets (for example cooperative bonds) are issued from time to time or the investors may switch their interest from one sector to another. There are two interesting research directions related to this situation. When an investor has a good trading strategy, trained by an RL algorithm for one asset, how should they transfer the experience to train a trading algorithm for a “similar” asset with fewer samples? This is closely related to transfer learning ^[32]^[33]. To the best of our knowledge, no study for financial applications has been carried out along this direction. Another question is the cold-start problem for newly issued assets. When we have very limited data for a new asset, how should we initialize an RL algorithm and learn a decent strategy using the limited available data and our experience (i.e., the trained RL algorithm or data) with other longstanding assets?

Acknowledgement We thank Xuefeng Gao, Anran Hu, Xiao-Yang Liu, Wenpin Tang, Ziyi Xia, Zhuoran Yang, Junzi Zhang and Zeyu Zheng for helpful discussions and comments on this survey.

Potential Danger: Algorithmic Collusion

Artificial intelligence, algorithmic pricing, and collusion,Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S., 2020. American Economic Review, 110(10), pp.3267-97.
Algorithmic collusion with imperfect monitoring, Calvano, E., Calzolari, G., Denicolo, V. and Pastorello, S., 2021. International Journal of Industrial Organization, p.102712

General references

Hambly, Ben; Xu, Renyuan; Yang, Huining (2023). "Recent Advances in Reinforcement Learning in Finance". arXiv:2112.04553 [q-fin.MF].

References

O.MIHATSCH and R.NEUNEIER, Risk-sensitive reinforcement learning, Machine Learning, 49 (2002), pp.267--290.
Y.SHEN, M.J. Tobia, T.SOMMER, and K.OBERMAYER, Risk-sensitive reinforcement learning, Neural Computation, 26 (2014), pp.1298--1328.
H.ERIKSSON and C.DIMITRAKAKIS, Epistemic risk-sensitive reinforcement learning, arXiv preprint arXiv:1906.06273, (2019).
Y.FEI, Z.YANG, Y.CHEN, Z.WANG, and Q.XIE, Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret, in NeurIPS, 2020.
N.VADORI, S.GANESH, P.REDDY, and M.VELOSO, Risk-sensitive reinforcement learning: A martingale approach to reward uncertainty, arXiv preprint arXiv:2006.12686, (2020).
J.ACHIAM, D.HELD, A.TAMAR, and P.ABBEEL, Constrained policy optimization, in International Conference on Machine Learning, PMLR, 2017, pp.22--31.
Y.CHOW, M.GHAVAMZADEH, L.JANSON, and M.PAVONE, Risk-constrained reinforcement learning with percentile risk criteria, The Journal of Machine Learning Research, 18 (2017), pp.6070--6120.
Y.CHOW, A.TAMAR, S.MANNOR, and M.PAVONE, Risk-sensitive and robust decision-making: A CVaR optimization approach, in NIPS'15, MIT Press, 2015, pp.1522--1530.
D.DING, X.WEI, Z.YANG, Z.WANG, and M.JOVANOVIC, Provably efficient safe exploration via primal-dual policy optimization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp.3304--3312.
A.TAMAR, Y.CHOW, M.GHAVAMZADEH, and S.MANNOR, Policy gradient for coherent risk measures, in Advances in Neural Information Processing Systems, vol.28, 2015.
L.ZHENG and L.RATLIFF, Constrained upper confidence reinforcement learning, in Learning for Dynamics and Control, PMLR, 2020, pp.620--629.
^12.0 ^12.1 S.JAIMUNGAL, S.M. Pesenti, Y.S. Wang, and H.TATSAT, Robust risk-aware reinforcement learning, Available at SSRN 3910498, (2021).
^13.0 ^13.1 A.COACHE and S.JAIMUNGAL, Reinforcement learning with dynamic convex risk measures, arXiv preprint arXiv:2112.13414, (2021).
D.SIMCHI-Levi and Y.XU, Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability, Available at SSRN 3562765, (2020).
J.CHEN and N.JIANG, Information-theoretic considerations in batch reinforcement learning, in International Conference on Machine Learning, PMLR, 2019, pp.1042--1051.
Z.GAO, Y.HAN, Z.REN, and Z.ZHOU, Batched multi-armed bandits problem, in Advances in Neural Information Processing Systems, vol.32, 2019.
E.GARCELON, M.GHAVAMZADEH, A.LAZARIC, and M.PIROTTA, Conservative exploration in reinforcement learning, in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp.1431--1441.
Z.REN and Z.ZHOU, Dynamic batch learning in high-dimensional sparse linear contextual bandits, arXiv preprint arXiv:2008.11918, (2020).
Y.WU, R.SHARIFF, T.LATTIMORE, and C.SZEPESV{\'a}ri, Conservative bandits, in International Conference on Machine Learning, PMLR, 2016, pp.1254--1262.
L.POMATTO, P.STRACK, and O.TAMUZ, The cost of information, arXiv preprint arXiv:1812.04211, (2018).
T.T. Ke, Z.-J.M. Shen, and J.M. Villas-Boas, Search for information on multiple products, Management Science, 62 (2016), pp.3576--3603.
D.ZHOU, J.CHEN, and Q.GU, Provable multi-objective reinforcement learning with generative models, arXiv preprint arXiv:2011.10134, (2020).
R.YANG, X.SUN, and K.NARASIMHAN, A generalized algorithm for multi-objective reinforcement learning and policy adaptation, in Advances in Neural Information Processing Systems, vol.32, 2019.
A.AGARWAL, P.BARTLETT, and M.DAMA, Optimal allocation strategies for the dark pool problem, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp.9--16.
K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, Censored exploration and the dark pool problem, Communications of the ACM, 53 (2010), pp.99--107.
B.BALDACCI and I.MANZIUK, Adaptive trading strategies across liquidity pools, arXiv preprint arXiv:2008.07807, (2020).
H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, Robo-advising: Learning investors' risk preferences via portfolio choices, Journal of Financial Econometrics, 19 (2021), pp.369--392.
H.WANG and S.YU, Robo-advising: Enhancing investment with inverse optimization and deep reinforcement learning, arXiv preprint arXiv:2105.09264, (2021).
S.YU, H.WANG, and C.DONG, Learning risk preferences from investment portfolios using inverse optimization, arXiv preprint arXiv:2010.01687, (2020).
N.E. Huang, M.-L. Wu, W.QU, S.R. Long, and S.S. Shen, Applications of Hilbert--Huang transform to non-stationary financial time series analysis, Applied Stochastic Models in Business and Industry, 19 (2003), pp.245--268.
M.WIESE, R.KNOBLOCH, R.KORN, and P.KRETSCHMER, Quant GANs: Deep generation of financial time series, Quantitative Finance, 20 (2020), pp.1419--1440.
L.TORREY and J.SHAVLIK, Transfer learning, in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, IGI global, 2010, pp.242--264.
S.J. Pan and Q.YANG, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, 22 (2009), pp.1345--1359.

[mihatsch2002risk-1] O.MIHATSCH and R.NEUNEIER, Risk-sensitive reinforcement learning, Machine Learning, 49 (2002), pp.267--290.

[shen2014risk-2] Y.SHEN, M.J. Tobia, T.SOMMER, and K.OBERMAYER, Risk-sensitive reinforcement learning, Neural Computation, 26 (2014), pp.1298--1328.

[eriksson2019epistemic-3] H.ERIKSSON and C.DIMITRAKAKIS, Epistemic risk-sensitive reinforcement learning, arXiv preprint arXiv:1906.06273, (2019).

[fei2020risk-4] Y.FEI, Z.YANG, Y.CHEN, Z.WANG, and Q.XIE, Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret, in NeurIPS, 2020.

[vadori2020risk-5] N.VADORI, S.GANESH, P.REDDY, and M.VELOSO, Risk-sensitive reinforcement learning: A martingale approach to reward uncertainty, arXiv preprint arXiv:2006.12686, (2020).

[achiam2017constrained-6] J.ACHIAM, D.HELD, A.TAMAR, and P.ABBEEL, Constrained policy optimization, in International Conference on Machine Learning, PMLR, 2017, pp.22--31.

[chow2017risk-7] Y.CHOW, M.GHAVAMZADEH, L.JANSON, and M.PAVONE, Risk-constrained reinforcement learning with percentile risk criteria, The Journal of Machine Learning Research, 18 (2017), pp.6070--6120.

[chow2015risk-8] Y.CHOW, A.TAMAR, S.MANNOR, and M.PAVONE, Risk-sensitive and robust decision-making: A CVaR optimization approach, in NIPS'15, MIT Press, 2015, pp.1522--1530.

[ding2021provably-9] D.DING, X.WEI, Z.YANG, Z.WANG, and M.JOVANOVIC, Provably efficient safe exploration via primal-dual policy optimization, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp.3304--3312.

[tamar2015policy-10] A.TAMAR, Y.CHOW, M.GHAVAMZADEH, and S.MANNOR, Policy gradient for coherent risk measures, in Advances in Neural Information Processing Systems, vol.28, 2015.

[zheng2020constrained-11] L.ZHENG and L.RATLIFF, Constrained upper confidence reinforcement learning, in Learning for Dynamics and Control, PMLR, 2020, pp.620--629.

[jaimungal2021robust-12] 12.0 ^12.1 S.JAIMUNGAL, S.M. Pesenti, Y.S. Wang, and H.TATSAT, Robust risk-aware reinforcement learning, Available at SSRN 3910498, (2021).

[coache2021reinforcement-13] 13.0 ^13.1 A.COACHE and S.JAIMUNGAL, Reinforcement learning with dynamic convex risk measures, arXiv preprint arXiv:2112.13414, (2021).

[simchi2020bypassing-14] D.SIMCHI-Levi and Y.XU, Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability, Available at SSRN 3562765, (2020).

[chen2019information-15] J.CHEN and N.JIANG, Information-theoretic considerations in batch reinforcement learning, in International Conference on Machine Learning, PMLR, 2019, pp.1042--1051.

[gao2019batched-16] Z.GAO, Y.HAN, Z.REN, and Z.ZHOU, Batched multi-armed bandits problem, in Advances in Neural Information Processing Systems, vol.32, 2019.

[garcelon2020conservative-17] E.GARCELON, M.GHAVAMZADEH, A.LAZARIC, and M.PIROTTA, Conservative exploration in reinforcement learning, in International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp.1431--1441.

[ren2020dynamic-18] Z.REN and Z.ZHOU, Dynamic batch learning in high-dimensional sparse linear contextual bandits, arXiv preprint arXiv:2008.11918, (2020).

[wu2016conservative-19] Y.WU, R.SHARIFF, T.LATTIMORE, and C.SZEPESV{\'a}ri, Conservative bandits, in International Conference on Machine Learning, PMLR, 2016, pp.1254--1262.

[pomatto2018cost-20] L.POMATTO, P.STRACK, and O.TAMUZ, The cost of information, arXiv preprint arXiv:1812.04211, (2018).

[ke2016search-21] T.T. Ke, Z.-J.M. Shen, and J.M. Villas-Boas, Search for information on multiple products, Management Science, 62 (2016), pp.3576--3603.

[zhou2020provable-22] D.ZHOU, J.CHEN, and Q.GU, Provable multi-objective reinforcement learning with generative models, arXiv preprint arXiv:2011.10134, (2020).

[yang2019generalized-23] R.YANG, X.SUN, and K.NARASIMHAN, A generalized algorithm for multi-objective reinforcement learning and policy adaptation, in Advances in Neural Information Processing Systems, vol.32, 2019.

[agarwal2010optimal-24] A.AGARWAL, P.BARTLETT, and M.DAMA, Optimal allocation strategies for the dark pool problem, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp.9--16.

[ganchev2010censored-25] K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, Censored exploration and the dark pool problem, Communications of the ACM, 53 (2010), pp.99--107.

[baldacci2020adaptive-26] B.BALDACCI and I.MANZIUK, Adaptive trading strategies across liquidity pools, arXiv preprint arXiv:2008.07807, (2020).

[alsabah2021robo-27] H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, Robo-advising: Learning investors' risk preferences via portfolio choices, Journal of Financial Econometrics, 19 (2021), pp.369--392.

[wang2021robo-28] H.WANG and S.YU, Robo-advising: Enhancing investment with inverse optimization and deep reinforcement learning, arXiv preprint arXiv:2105.09264, (2021).

[yu2020learning-29] S.YU, H.WANG, and C.DONG, Learning risk preferences from investment portfolios using inverse optimization, arXiv preprint arXiv:2010.01687, (2020).

[huang2003applications-30] N.E. Huang, M.-L. Wu, W.QU, S.R. Long, and S.S. Shen, Applications of Hilbert--Huang transform to non-stationary financial time series analysis, Applied Stochastic Models in Business and Industry, 19 (2003), pp.245--268.

[wiese2020quant-31] M.WIESE, R.KNOBLOCH, R.KORN, and P.KRETSCHMER, Quant GANs: Deep generation of financial time series, Quantitative Finance, 20 (2020), pp.1419--1440.

[torrey2010transfer-32] L.TORREY and J.SHAVLIK, Transfer learning, in Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, IGI global, 2010, pp.242--264.

[pan2009survey-33] S.J. Pan and Q.YANG, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, 22 (2009), pp.1345--1359.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

@@ Line 1: / Line 1: @@
-<div class="d-none"><math>
-\newcommand*{\rom}[1]{\expandafter\@slowromancap\romannumeral #1@}
-\newcommand{\vertiii}[1]{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert #1
-    \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}}
-\DeclareMathOperator*{\dprime}{\prime \prime}
-\DeclareMathOperator{\Tr}{Tr}
-\DeclareMathOperator{\E}{\mathbb{E}}
-\DeclareMathOperator{\N}{\mathbb{N}}
-\DeclareMathOperator{\R}{\mathbb{R}}
-\DeclareMathOperator{\Sc}{\mathcal{S}}
-\DeclareMathOperator{\Ac}{\mathcal{A}}
-\DeclareMathOperator{\Pc}{\mathcal{P}}
-\DeclareMathOperator*{\argmin}{arg\,min}
-\DeclareMathOperator*{\argmax}{arg\,max}
-\DeclareMathOperator{\sx}{\underline{\sigma}_{\pmb{X}}}
-\DeclareMathOperator{\sqmin}{\underline{\sigma}_{\pmb{Q}}}
-\DeclareMathOperator{\sqmax}{\overline{\sigma}_{\pmb{Q}}}
-\DeclareMathOperator{\sqi}{\underline{\sigma}_{Q,\textit{i}}}
-\DeclareMathOperator{\sqnoti}{\underline{\sigma}_{\pmb{Q},-\textit{i}}}
-\DeclareMathOperator{\sqfir}{\underline{\sigma}_{\pmb{Q},1}}
-\DeclareMathOperator{\sqsec}{\underline{\sigma}_{\pmb{Q},2}}
-\DeclareMathOperator{\sru}{\underline{\sigma}_{\pmb{R}}^{u}}
-\DeclareMathOperator{\srv}{\underline{\sigma}_{\pmb{R}}^v}
-\DeclareMathOperator{\sri}{\underline{\sigma}_{R,\textit{i}}}
-\DeclareMathOperator{\srnoti}{\underline{\sigma}_{\pmb{R},\textit{-i}}}
-\DeclareMathOperator{\srfir}{\underline{\sigma}_{\pmb{R},1}}
-\DeclareMathOperator{\srsec}{\underline{\sigma}_{\pmb{R},2}}
-\DeclareMathOperator{\srmin}{\underline{\sigma}_{\pmb{R}}}
-\DeclareMathOperator{\srmax}{\overline{\sigma}_{\pmb{R}}}
-\DeclareMathOperator{\HH}{\mathcal{H}}
-\DeclareMathOperator{\HE}{\mathcal{H}(1/\varepsilon)}
-\DeclareMathOperator{\HD}{\mathcal{H}(1/\varepsilon)}
-\DeclareMathOperator{\HCKI}{\mathcal{H}(C(\pmb{K}^0))}
-\DeclareMathOperator{\HECK}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))}
-\DeclareMathOperator{\HECKI}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))}
-\DeclareMathOperator{\HC}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))}
-\DeclareMathOperator{\HCK}{\mathcal{H}(C(\pmb{K}))}
-\DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}))}
-\DeclareMathOperator{\HCKR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}))}
-\DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,1/{\it{r}},C(\pmb{K}^0))}
-\DeclareMathOperator{\HCKIR}{\mathcal{H}(1/\varepsilon,C(\pmb{K}^0))}
-\newcommand{\mathds}{\mathbb}</math></div>
 The general RL algorithms developed in the machine learning literature are good starting points for use in financial applications. A possible drawback is that such general RL algorithms tend to overfit, using more information than is actually required for a particular application. On the other hand, the stochastic control approach to many financial decision-making problems may suffer from the risk of model mis-specification. However, it may capture the essential features of a given financial application from a modeling perspective, in terms of the dynamics and the reward function.
 One promising direction for RL in finance is to develop an even closer integration of  the modeling techniques (the domain knowledge) from the stochastic control literature and key components of a given financial application (for example the adverse selection risk for market-making problems and the execution risk for optimal liquidation problems) with the learning power of the RL algorithms. This line of developing a more integrated framework is interesting from both theoretical and applications perspectives. From the application point of view, a modified RL algorithm, with designs tailored to one particular financial application, could lead to better empirical performance. This could be verified by comparison with existing algorithms on the available datasets. In addition, financial applications motivate potential new frameworks and playgrounds for RL algorithms. Carrying out the convergence and sample complexity analysis for these modified algorithms would also be a meaningful direction in which to proceed. Many of the papers referenced in this review provide great initial steps in this direction. We list the following future directions that  the reader may find interesting.
@@ Line 48: / Line 6: @@
 Hence it would be interesting to include risk measures in the design of RL algorithms for financial applications.  The challenge of risk-sensitive RL lies both in the non-linearity of the objective function with respect to the reward and in
 designing a risk-aware exploration mechanism.
-RL with risk-sensitive utility functions has been studied in several papers without regard to specific financial applications.  The work of <ref name="mihatsch2002risk"/> proposes TD(0) and <math>Q</math>-learning-style algorithms that transform temporal differences
+RL with risk-sensitive utility functions has been studied in several papers without regard to specific financial applications.  The work of <ref name="mihatsch2002risk">O.MIHATSCH and R.NEUNEIER, ''Risk-sensitive reinforcement learning'',  Machine Learning, 49 (2002), pp.267--290.</ref> proposes TD(0) and <math>Q</math>-learning-style algorithms that transform temporal differences
 instead of cumulative rewards, and proves their convergence. Risk-sensitive RL with a general
-family of utility functions is studied in <ref name="shen2014risk"/>, which also proposes a <math>Q</math>-learning algorithm with convergence guarantees. The work of <ref name="eriksson2019epistemic"/> studies a risk-sensitive policy gradient algorithm, though
+family of utility functions is studied in <ref name="shen2014risk">Y.SHEN, M.J. Tobia, T.SOMMER, and K.OBERMAYER, ''Risk-sensitive  reinforcement learning'', Neural Computation, 26 (2014), pp.1298--1328.</ref>, which also proposes a <math>Q</math>-learning algorithm with convergence guarantees. The work of <ref name="eriksson2019epistemic">H.ERIKSSON and C.DIMITRAKAKIS, ''Epistemic risk-sensitive  reinforcement learning'', arXiv preprint arXiv:1906.06273,  (2019).</ref> studies a risk-sensitive policy gradient algorithm, though
-with no theoretical guarantees. <ref name="fei2020risk"/>  considers the problem of risk-sensitive RL with exponential utility and proposes
+with no theoretical guarantees. <ref name="fei2020risk">Y.FEI, Z.YANG, Y.CHEN, Z.WANG, and Q.XIE, ''Risk-sensitive  reinforcement learning: Near-optimal risk-sample tradeoff in regret'', in  NeurIPS, 2020.</ref>  considers the problem of risk-sensitive RL with exponential utility and proposes
 two efficient model-free algorithms, Risk-sensitive Value Iteration (RSVI) and Risk-sensitive <math>Q</math>-learning (RSQ), with a near-optimal sample complexity guarantee.
-<ref name="vadori2020risk"/> developed a martingale approach to  learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios.
+<ref name="vadori2020risk">N.VADORI, S.GANESH, P.REDDY, and M.VELOSO, ''Risk-sensitive  reinforcement learning: A martingale approach to reward uncertainty'', arXiv  preprint arXiv:2006.12686,  (2020).</ref> developed a martingale approach to  learn policies that are sensitive to the uncertainty of the rewards and are meaningful under some market scenarios.
-Another line of work focuses on constrained RL problems with different risk criteria <ref name="achiam2017constrained"/><ref name="chow2017risk"/><ref name="chow2015risk"/><ref name="ding2021provably"/><ref name="tamar2015policy"/><ref name="zheng2020constrained"/>.  Very recently, <ref name="jaimungal2021robust"/> proposed a robust risk-aware reinforcement learning framework via robust optimization and with a rank dependent expected utility function. Financial applications such as statistical arbitrage and portfolio optimization are discussed  with detailed numerical examples. <ref name="coache2021reinforcement"/>  develops a framework combining policy-gradient-based RL method and dynamic convex risk measures for solving time-consistent risk-sensitive stochastic optimization problems. However, there is no sample complexity or asymptotic convergence studied for the proposed algorithms in <ref name="jaimungal2021robust"/><ref name="coache2021reinforcement"/>.
+Another line of work focuses on constrained RL problems with different risk criteria <ref name="achiam2017constrained">J.ACHIAM, D.HELD, A.TAMAR, and P.ABBEEL, ''Constrained policy  optimization'', in International Conference on Machine Learning, PMLR, 2017,  pp.22--31.</ref><ref name="chow2017risk">Y.CHOW, M.GHAVAMZADEH, L.JANSON, and M.PAVONE, ''Risk-constrained  reinforcement learning with percentile risk criteria'', The Journal of Machine  Learning Research, 18 (2017), pp.6070--6120.</ref><ref name="chow2015risk">Y.CHOW, A.TAMAR, S.MANNOR, and M.PAVONE, ''Risk-sensitive and  robust decision-making: A CVaR optimization approach'', in NIPS'15, MIT  Press, 2015, pp.1522--1530.</ref><ref name="ding2021provably">D.DING, X.WEI, Z.YANG, Z.WANG, and M.JOVANOVIC, ''Provably  efficient safe exploration via primal-dual policy optimization'', in  International Conference on Artificial Intelligence and Statistics, PMLR,  2021, pp.3304--3312.</ref><ref name="tamar2015policy">A.TAMAR, Y.CHOW, M.GHAVAMZADEH, and S.MANNOR, ''Policy gradient  for coherent risk measures'', in Advances in Neural Information  Processing Systems, vol.28, 2015.</ref><ref name="zheng2020constrained">L.ZHENG and L.RATLIFF, ''Constrained upper confidence reinforcement  learning'', in Learning for Dynamics and Control, PMLR, 2020, pp.620--629.</ref>.  Very recently, <ref name="jaimungal2021robust">S.JAIMUNGAL, S.M. Pesenti, Y.S. Wang, and H.TATSAT, ''Robust  risk-aware reinforcement learning'', Available at SSRN 3910498,  (2021).</ref> proposed a robust risk-aware reinforcement learning framework via robust optimization and with a rank dependent expected utility function. Financial applications such as statistical arbitrage and portfolio optimization are discussed  with detailed numerical examples. <ref name="coache2021reinforcement">A.COACHE and S.JAIMUNGAL, ''Reinforcement learning with dynamic  convex risk measures'', arXiv preprint arXiv:2112.13414,  (2021).</ref>  develops a framework combining policy-gradient-based RL method and dynamic convex risk measures for solving time-consistent risk-sensitive stochastic optimization problems. However, there is no sample complexity or asymptotic convergence studied for the proposed algorithms in <ref name="jaimungal2021robust"/><ref name="coache2021reinforcement"/>.
 '''Offline Learning and Online Exploration.'''
 Online learning requires updating of algorithm parameters in real-time and this is impractical for many financial decision-making problems, especially in the high-frequency regime. The most plausible setting is to collect data with a pre-specified exploration scheme during trading hours and update the algorithm with the new collected data after the close of trading. This is closely
-related to the translation of online learning to offline regression <ref name="simchi2020bypassing"/> and RL with batch data <ref name="chen2019information"/><ref name="gao2019batched"/><ref name="garcelon2020conservative"/><ref name="ren2020dynamic"/>. However, these  developments focus on general methodologies without being specifically tailored to financial applications.
+related to the translation of online learning to offline regression <ref name="simchi2020bypassing">D.SIMCHI-Levi and Y.XU, ''Bypassing the monster: A faster and  simpler optimal algorithm for contextual bandits under realizability'',  Available at SSRN 3562765,  (2020).</ref> and RL with batch data <ref name="chen2019information">J.CHEN and N.JIANG, ''Information-theoretic considerations in batch  reinforcement learning'', in International Conference on Machine Learning,  PMLR, 2019, pp.1042--1051.</ref><ref name="gao2019batched">Z.GAO, Y.HAN, Z.REN, and Z.ZHOU, ''Batched multi-armed bandits  problem'', in Advances in Neural Information Processing Systems,  vol.32, 2019.</ref><ref name="garcelon2020conservative">E.GARCELON, M.GHAVAMZADEH, A.LAZARIC, and M.PIROTTA, ''Conservative exploration in reinforcement learning'', in International  Conference on Artificial Intelligence and Statistics, PMLR, 2020,  pp.1431--1441.</ref><ref name="ren2020dynamic">Z.REN and Z.ZHOU, ''Dynamic batch learning in high-dimensional  sparse linear contextual bandits'', arXiv preprint arXiv:2008.11918,  (2020).</ref>. However, these  developments focus on general methodologies without being specifically tailored to financial applications.
-'''Learning with a Limited Exploration Budget.''' Exploration can help agents to find new policies to improve their future cumulative rewards. However, too much exploration can be both time consuming and computation consuming, and in particular, it may be very costly for some financial applications. Additionally, exploring black-box trading strategies may need a lot of justification within a financial institution and hence investors tend to limit the effort put into exploration and try to improve performance as much as possible within a given budget for exploration.  This idea is similar in spirit to conservative RL where agents explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time <ref name="wu2016conservative"/>. This is also related to the problem of information acquisition with a cost which has been studied for economic commodities <ref name="pomatto2018cost"/> and operations management <ref name="ke2016search"/>. It may also be interesting to investigate such costs for decision-making problems in financial markets.
+'''Learning with a Limited Exploration Budget.''' Exploration can help agents to find new policies to improve their future cumulative rewards. However, too much exploration can be both time consuming and computation consuming, and in particular, it may be very costly for some financial applications. Additionally, exploring black-box trading strategies may need a lot of justification within a financial institution and hence investors tend to limit the effort put into exploration and try to improve performance as much as possible within a given budget for exploration.  This idea is similar in spirit to conservative RL where agents explore new strategies to maximize revenue whilst simultaneously maintaining their revenue above a fixed baseline, uniformly over time <ref name="wu2016conservative">Y.WU, R.SHARIFF, T.LATTIMORE, and C.SZEPESV{\'a}ri, ''Conservative  bandits'', in International Conference on Machine Learning, PMLR, 2016,  pp.1254--1262.</ref>. This is also related to the problem of information acquisition with a cost which has been studied for economic commodities <ref name="pomatto2018cost">L.POMATTO, P.STRACK, and O.TAMUZ, ''The cost of information'', arXiv  preprint arXiv:1812.04211,  (2018).</ref> and operations management <ref name="ke2016search">T.T. Ke, Z.-J.M. Shen, and J.M. Villas-Boas, ''Search for  information on multiple products'', Management Science, 62 (2016),  pp.3576--3603.</ref>. It may also be interesting to investigate such costs for decision-making problems in financial markets.
-'''Learning with Multiple Objectives.''' In finance, a common problem is to choose a portfolio when there are two conflicting objectives - the desire to have the expected value of portfolio returns be as high as possible, and the desire to have risk, often measured by the standard deviation of portfolio returns, be as low as possible. This problem is often represented by a graph in which the efficient frontier shows the best combinations of risk and expected return that are available, and in which indifference curves show the investor's preferences for various risk-expected return combinations. Decision makers sometimes combine both criteria into a single objective function consisting of the difference of the expected reward and a scalar multiple of the risk. However, it may well not be in the best interest of a decision maker to combine relevant criteria in a linear format for certain applications. For example, market makers on the OTC market tend to view criteria such as turn around time, balance sheet constraints, inventory cost, profit and loss as separate objective functions. The study of multi-objective RL is still at a preliminary stage and relevant references include <ref name="zhou2020provable"/> and <ref name="yang2019generalized"/>.
+'''Learning with Multiple Objectives.''' In finance, a common problem is to choose a portfolio when there are two conflicting objectives - the desire to have the expected value of portfolio returns be as high as possible, and the desire to have risk, often measured by the standard deviation of portfolio returns, be as low as possible. This problem is often represented by a graph in which the efficient frontier shows the best combinations of risk and expected return that are available, and in which indifference curves show the investor's preferences for various risk-expected return combinations. Decision makers sometimes combine both criteria into a single objective function consisting of the difference of the expected reward and a scalar multiple of the risk. However, it may well not be in the best interest of a decision maker to combine relevant criteria in a linear format for certain applications. For example, market makers on the OTC market tend to view criteria such as turn around time, balance sheet constraints, inventory cost, profit and loss as separate objective functions. The study of multi-objective RL is still at a preliminary stage and relevant references include <ref name="zhou2020provable">D.ZHOU, J.CHEN, and Q.GU, ''Provable multi-objective reinforcement  learning with generative models'', arXiv preprint arXiv:2011.10134,  (2020).</ref> and <ref name="yang2019generalized">R.YANG, X.SUN, and K.NARASIMHAN, ''A generalized algorithm for  multi-objective reinforcement learning and policy adaptation'', in Advances in  Neural Information Processing Systems, vol.32, 2019.</ref>.
-'''Learning to Allocate Across Lit Pools and Dark Pools.''' Online optimization methods explored in <ref name="agarwal2010optimal"/> and <ref name="ganchev2010censored"/> for dark pool allocations can be viewed as a single-period RL algorithm and the Bayesian framework developed in <ref name="baldacci2020adaptive"/> for allocations across lit pools may be classified as a model-based RL approach. However, there is currently no existing work on applying multi-period and model-free RL methods to learn how to route orders across both dark pools and lit pools. We think this might be an interesting direction to explore as  agents sometimes have access to both lit pools and dark pools and these two contrasting pools have quite different information structures and matching mechanisms.
+'''Learning to Allocate Across Lit Pools and Dark Pools.''' Online optimization methods explored in <ref name="agarwal2010optimal">A.AGARWAL, P.BARTLETT, and M.DAMA, ''Optimal allocation strategies  for the dark pool problem'', in Proceedings of the Thirteenth International  Conference on Artificial Intelligence and Statistics, JMLR Workshop and  Conference Proceedings, 2010, pp.9--16.</ref> and <ref name="ganchev2010censored">K.GANCHEV, Y.NEVMYVAKA, M.KEARNS, and J.W. Vaughan, ''Censored  exploration and the dark pool problem'', Communications of the ACM, 53 (2010),  pp.99--107.</ref> for dark pool allocations can be viewed as a single-period RL algorithm and the Bayesian framework developed in <ref name="baldacci2020adaptive">B.BALDACCI and I.MANZIUK, ''Adaptive trading strategies across  liquidity pools'', arXiv preprint arXiv:2008.07807,  (2020).</ref> for allocations across lit pools may be classified as a model-based RL approach. However, there is currently no existing work on applying multi-period and model-free RL methods to learn how to route orders across both dark pools and lit pools. We think this might be an interesting direction to explore as  agents sometimes have access to both lit pools and dark pools and these two contrasting pools have quite different information structures and matching mechanisms.
 '''Robo-advising in a Model-free Setting.'''
-As introduced in [[guide:812d89983e#sec:robo-advising |Section]], <ref name="alsabah2021robo"/> considered learning  within a set of <math>m</math> pre-specified investment portfolios, and <ref name="wang2021robo"/>  and <ref name="yu2020learning"/> developed  learning algorithms and procedures to infer risk preferences, respectively, under the framework of Markowitz mean-variance portfolio optimization. It would be interesting to consider a model-free RL approach where the robo-advisor has the freedom to learn and improve decisions beyond a pre-specified set of strategies or the Markowitz framework.
+As introduced in [[guide:812d89983e#sec:robo-advising |Section]], <ref name="alsabah2021robo">H.ALSABAH, A.CAPPONI, O.RUIZLACEDELLI, and M.STERN, ''Robo-advising: Learning investors' risk preferences via portfolio choices'',  Journal of Financial Econometrics, 19 (2021), pp.369--392.</ref> considered learning  within a set of <math>m</math> pre-specified investment portfolios, and <ref name="wang2021robo">H.WANG and S.YU, ''Robo-advising: Enhancing investment with inverse  optimization and deep reinforcement learning'', arXiv preprint  arXiv:2105.09264,  (2021).</ref>  and <ref name="yu2020learning">S.YU, H.WANG, and C.DONG, ''Learning risk preferences from  investment portfolios using inverse optimization'', arXiv preprint  arXiv:2010.01687,  (2020).</ref> developed  learning algorithms and procedures to infer risk preferences, respectively, under the framework of Markowitz mean-variance portfolio optimization. It would be interesting to consider a model-free RL approach where the robo-advisor has the freedom to learn and improve decisions beyond a pre-specified set of strategies or the Markowitz framework.
 '''Sample Efficiency in Learning Trading Strategies.'''
-In recent years, sample complexity has been studied extensively  to understand modern reinforcement learning algorithms (see Sections \ref{sec:rl_basics}-\ref{sec:deep_value_based}). However, most RL algorithms still require a large number of samples to train a decent trading algorithm, which may exceed the amount of relevant available historical data. Financial time series are known to be non-stationary <ref name="huang2003applications"/>, and hence historical data that are further away in time may not be helpful in training efficient learning algorithms for the current market environment. This leads to important questions of designing more sample-efficient RL algorithms for financial applications or developing good market simulators that could generate (unlimited) realistic market scenarios <ref name="wiese2020quant"/>.
+In recent years, sample complexity has been studied extensively  to understand modern reinforcement learning algorithms (see [[guide:C8c80a2ae8|The Basics of Reinforcement Learning]]-[[guide:576dcdd2b6#Deep Value-based RL Algorithms|Deep Value-based RL Algorithms]]). However, most RL algorithms still require a large number of samples to train a decent trading algorithm, which may exceed the amount of relevant available historical data. Financial time series are known to be non-stationary <ref name="huang2003applications">N.E. Huang, M.-L. Wu, W.QU, S.R. Long, and S.S. Shen, ''Applications of Hilbert--Huang transform to non-stationary financial time  series analysis'', Applied Stochastic Models in Business and Industry, 19  (2003), pp.245--268.</ref>, and hence historical data that are further away in time may not be helpful in training efficient learning algorithms for the current market environment. This leads to important questions of designing more sample-efficient RL algorithms for financial applications or developing good market simulators that could generate (unlimited) realistic market scenarios <ref name="wiese2020quant">M.WIESE, R.KNOBLOCH, R.KORN, and P.KRETSCHMER, ''Quant GANs:  Deep generation of financial time series'', Quantitative Finance, 20 (2020),  pp.1419--1440.</ref>.
 '''Transfer Learning and Cold Start for Learning New Assets.'''
-Financial institutions or individuals may change their baskets of assets to trade over time. Possible reasons may be that new assets (for example cooperative bonds) are issued from time to time or the investors may switch their interest from one sector to another. There are two interesting research directions related to this situation. When an investor has a good trading strategy, trained by an RL algorithm for one asset, how should they transfer the experience to train a trading algorithm for a “similar” asset with fewer samples? This is closely related to transfer learning <ref name="torrey2010transfer"/><ref name="pan2009survey"/>. To the best of our knowledge, no study for financial applications has been carried out along this direction. Another question is the cold-start problem for newly issued assets. When we have very limited data for a new asset, how should we initialize an RL algorithm and learn a decent strategy using the limited available data and our experience (i.e., the trained RL algorithm or data) with other longstanding assets?
+Financial institutions or individuals may change their baskets of assets to trade over time. Possible reasons may be that new assets (for example cooperative bonds) are issued from time to time or the investors may switch their interest from one sector to another. There are two interesting research directions related to this situation. When an investor has a good trading strategy, trained by an RL algorithm for one asset, how should they transfer the experience to train a trading algorithm for a “similar” asset with fewer samples? This is closely related to transfer learning <ref name="torrey2010transfer">L.TORREY and J.SHAVLIK, ''Transfer learning'', in Handbook of  Research on Machine Learning Applications and Trends: Algorithms, Methods,  and Techniques, IGI global, 2010, pp.242--264.</ref><ref name="pan2009survey">S.J. Pan and Q.YANG, ''A survey on transfer learning'', IEEE  Transactions on Knowledge and Data Engineering, 22 (2009), pp.1345--1359.</ref>. To the best of our knowledge, no study for financial applications has been carried out along this direction. Another question is the cold-start problem for newly issued assets. When we have very limited data for a new asset, how should we initialize an RL algorithm and learn a decent strategy using the limited available data and our experience (i.e., the trained RL algorithm or data) with other longstanding assets?
 '''Acknowledgement''' We thank Xuefeng Gao, Anran Hu, Xiao-Yang Liu, Wenpin Tang, Ziyi Xia, Zhuoran Yang,  Junzi Zhang and Zeyu Zheng  for helpful discussions and comments on this survey.
-\bibliographystyle{siam}
-\bibliography{references}
-\end{document}
 ===Potential Danger: Algorithmic Collusion===