## Abstract

Momentum strategies are an important part of alternative investments and are at the heart of the work of commodity trading advisors. These strategies have, however, been found to have difficulties adjusting to rapid changes in market conditions, such as during the 2020 market crash. In particular, immediately after momentum turning points, when a trend reverses from an uptrend (downtrend) to a downtrend (uptrend), time-series momentum strategies are prone to making bad bets. To improve the responsiveness to regime change, the authors introduce a novel approach, in which they insert an online changepoint detection (CPD) module into a deep momentum network pipeline, which uses a long short-term memory deep-learning architecture to simultaneously learn both trend estimation and position sizing. Furthermore, their model is able to optimize the way in which it balances (1) a slow momentum strategy that exploits persisting trends but does not overreact to localized price moves and (2) a fast mean-reversion strategy regime by quickly flipping its position and then swapping back again to exploit localized price moves. The CPD module outputs a changepoint location and severity score, allowing the model to learn to respond to varying degrees of disequilibrium, or smaller and more localized changepoints, in a data-driven manner. The authors back test their model over the period 1995–2020, and the addition of the CPD module leads to a 33% improvement in the Sharpe ratio. The module is especially beneficial in periods of significant nonstationarity; in particular, over the most recent years tested (2015–2020), the performance boost is approximately 66%. This is especially interesting because traditional momentum strategies underperformed in this period.

**Key Findings**

▪ Momentum strategies, including deep learning–based deep momentum networks, have underperformed in recent years owing to difficulties in adjusting to rapid changes in the market, such as when a trend reverses from an uptrend to a downtrend, or vice versa.

▪ Inserting an online changepoint detection module into a deep momentum network pipeline leads to large performance gains, especially during periods of significant nonstationarity, as observed in recent years.

▪ The model achieves superior risk-adjusted returns by blending a slow momentum strategy with a fast mean-reversion strategy, with the changepoint detection module helping to balance the two in a data-driven manner.

Time-series momentum (TSMOM) (Moskowitz, Ooi, and Pedersen 2012) strategies are derived from the philosophy that strong price trends tend to persist. These trends have been observed to hold across a range of timescales, asset classes, and time periods (Lempérière et al. 2014; Baz et al. 2015; Hurst, Ooi, and Pedersen 2017). Momentum strategies are often referred to as *follow the winner* because it is assumed that winners will continue to be winners in the subsequent period.

Momentum strategies are an important part of alternative investments and are at the heart of the work of commodity trading advisors. Much effort goes into quantifying the magnitude of trends (Bruder et al. 2013; Baz et al. 2015; Levine and Pedersen 2016) and sizing traded positions accordingly (Kim, Tse, and Wald 2016; Baltas and Kosowski 2017; Harvey et al. 2018). Rather than using handcrafted techniques to identify trends and select positions, Lim, Zohren, and Roberts (2019) introduced *deep momentum networks *(DMNs), in which a long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) deep learning architecture achieves this task by directly optimizing on the Sharpe ratio of the signal. Deep learning has been widely used for time-series forecasting (Lim and Zohren 2020), achieving a high level of accuracy across various fields, including the field of finance for daily data (Bao, Yue, and Rao 2017; Gu, Kelly, and Xiu 2017; Lim, Zohren, and Roberts 2019; Kim 2019; Poh et al. 2021) and in a high-frequency setting, using limit order book data (Sirignano and Cont 2018; Zhang, Zohren, and Roberts 2019). In recent years, implementation of such deep learning models has been made accessible via extensive open-source frameworks such as TensorFlow (Abadi et al. 2015) and PyTorch (Paszke et al. 2017).

Momentum strategies aim to capitalize on persisting price trends; however, occasionally these trends break down, which we label *momentum turning points*. At these turning points, momentum strategies are prone to performing poorly because they are unable to adapt quickly to this abrupt change in regime. This concept is explored by Garg et al. (2021) who blended a slow momentum signal based on a long lookback window (LBW), such as 12 months, with a fast momentum signal based on a short LBW, such as 1 month. This approach is a balancing act between reducing noise and being quick enough to respond to turning points. Adopting the terminology from Garg et al. (2021), a bull or bear market is when the two momentum signals agree on a long or short position, respectively. If slow momentum suggests a long (short) position and fast momentum a short (long) position, we term this a *correction (rebound) phase*.

Correction and rebound phases, in which the momentum assumption breaks down, are examples of mean reversion (De Bondt and Thaler 1985; Poterba and Summers 1988; Jegadeesh 1991) regimes. Mean-reversion trading strategies, often referred to as *follow the loser* strategies, assume losers (winners) over some LBW will be winners (losers) in the subsequent period. If we observe the positions taken by a DMN, alongside exploiting persisting trends, the model also exploits fluctuations in return data at a shorter time horizon by regularly flipping its position and then quickly changing back again. We argue that the high Sharpe ratio achieved by DMNs can be largely attributed to this fast mean-reversion property.

Changepoint detection (CPD) is a field that involves the identification of abrupt changes in sequential data, in which the generative parameters for our model after the changepoint are independent of those that come before. The nonstationarity of real-world time series in fields such as finance, robotics, and sensor data has led to a plethora of research in this field. To respond to CPD in real time, we require an online algorithm, which processes each data point as it becomes available, as opposed to offline algorithms that consider the entire dataset at once and detect changepoints retrospectively. First introduced by Adams and MacKay (2007), Bayesian approaches to online CPD, which naturally accommodate to noisy, uncertain, and incomplete time-series data, have proven to be very successful. Assuming a changepoint model of the parameters, the Bayesian approach integrates out the uncertainty for these parameters as opposed to using a point estimate. Gaussian processes (GPs) (Williams and Rasmussen 1996; Rasmussen 2003), which are collections of random variables any finite number of which have joint Gaussian distributions, are well suited to time-series modeling (Roberts et al. 2013). GPs are often referred to as a Bayesian nonparametric model and have the ability to handle changepoints (Garnett et al. 2010; Saatçi, Turner, and Rasmussen 2010; Lloyd et al. 2014). Rather than comparing slow and fast momentum signals to detect regime change, we use GPs as a more principled method for detecting momentum turning points. For our experiments, we use the Python package GPflow (Matthews et al. 2017) to build Gaussian process models, which leverage the TensorFlow framework.

In this article, we introduce a novel approach, in which we add an online CPD module to a DMN pipeline to improve overall strategy returns. By incorporating the CPD module, we optimize our response to momentum turning points in a data-driven manner by passing outputs from the module as inputs to a DMN, which in turn learns trading rules and optimizes positions based on some finance value function, such as the Sharpe ratio (Sharpe 1994). This approach helps to correctly identify when we are in a bull or bear market and select the momentum strategy accordingly. With the addition of the CPD module, the new model learns how to exploit, but not overreact to, noise at a shorter time scale. Our strategy is able to exploit the fast reversion we observe in DMNs but effectively balance this with a slow momentum strategy and improve returns across an entire bull or bear regime. Effectively, the new pipeline has more knowledge on how to respond to abrupt changes, or a lack of changes, in a data-driven way.

We argue that the CPD is an artificial construct that can have varying degrees of severity and is dependent on choices such as the length of the lookback horizon. Rather than specifying regimes based on some criterion or threshold, we use our CPD module to quantify, or score, the level of disequilibrium, allowing the model to consider smaller or more localized regime changes. The length of the LBW is the most sensitive design choice for the CPD module—if the lookback horizon is too long, we miss smaller but still potentially significant regime changes, and if the horizon is too short, the data become too noisy and are of little value. We introduce the LBW length as a structural hyperparameter that we optimize using the outer optimization loop of our model. This allows the module to be more tightly coupled with our LSTM module, thus helping us to maximize the efficiency of the CPD and allowing us to tweak the LSTM hyperparameters in conjunction with the LBW.

It can be noted that the performance of DMNs, without CPD, deteriorates in more recent years. The deterioration in performance is especially notable in the 2015–2020 period, which exhibits a greater degree of turbulence, or disequilibrium, than the preceding years. One possible explanation for deterioration in momentum strategies in recent years is the concept of *factor crowding*, which is discussed in depth by Baltas (2019), who argued that arbitrageurs inflict negative externalities on one another. By using the same models, and hence taking the same positions, a coordination problem is created, pushing the price away from fundamentals. It is argued that momentum strategies are susceptible to this scenario. Impressively, the addition of a CPD module helps to alleviate the deterioration in performance, and our model significantly outperforms the standard DMN model during the 2015–2020 period. A similar phenomenon can be observed from around 2003, when electronic trading was becoming more common, where the deep learning–based strategies start to significantly outperform classic TSMOM strategies.

## CHANGEPOINT DETECTION USING GAUSSIAN PROCESSES

A classic univariate regression problem of the form *y*(*x*) = *f*(*x*) + ϵ, where ϵ is an additive noise process, has the goal of evaluating the function *f* and the probability distribution *p*(*y*_{*}*|x*_{*}) of some point *y*_{*} given some *x*_{*}. Our daily time-series data, for asset *i*, consist of a sequence of observations for (closing) price , up to time *T*. Because financial time series are nonstationary in the mean, for each time *t *we take the first difference of the time series, otherwise known as the arithmetic returns

in an attempt to remove any linear trend in the mean. Throughout this article, for brevity, we will refer to *r*_{t−1,t} simply as *r _{t}*. For the purposes of CPD, it is not computationally feasible, nor is it necessary, to consider the entire time series; hence, we consider the series , with lookback horizon

*l*from time

*T*. For every CPD window, where , we standardize our returns as

This step is taken for two reasons: We can assume that the mean over our window is zero, and with unit variance, we have more consistency across all windows when we run our CPD module.

Our approach to changepoint detection involves a curve-fitting approach for input–output pairs via the use of GP regression (Rasmussen 2003). GP regression is a probabilistic, nonparametric method, popular in the fields of machine learning and time-series analysis (Roberts et al. 2013). It is a kernel-based technique in which the is specified by a covariance function *k*_{ξ}(·), which is in turn parameterized by a set of hyperparameters ξ. In its common guise, the GP has a stationary kernel; however, it should be noted that GPs can readily work well even when the time series is nonstationary (Brahim-Belhouari and Bermak 2004). We define the GP as a distribution over functions where

given noise variance σ_{n}, which helps to deal with noisy outputs that are uncorrelated.

Rizvi (2018) and Liu, Kiskin, and Roberts (2020) demonstrated that a Matérn 3/2 kernel is a good choice of covariance function for noisy financial data, which tend to be highly nonsmooth and not infinitely differentiable. This problem setting favors the least smooth of the Matérn family of kernels, which is the 3/2 kernel. We parametrize our Matérn 3/2 kernel as

4with kernel hyperparameters ξ_{M} = (λ, σ_{h}, σ_{n}), where λ is the input scale and σ_{h} the output scale. We define our covariance matrix for a set of locations x = [*x*_{1}, *x*_{2}, … *x*_{n}] as

Using , we integrate out the function variables to give , with . Because is intractable, we instead apply Bayes’ rule

6and perform type II maximum likelihood on . We minimize the negative log marginal likelihood:

7We use the GPflow framework to compute the hyperparameters ξ, which in turn uses the L-BFGS-B optimization algorithm (Zhu et al. 1997) via the scipy.optimize.minimize package.

Garnett et al. (2010) and Roberts et al. (2013) assumed that our function of interest is well behaved, except for a drastic change, or changepoint, at *c* ∈ {*t* − *l* + 1, *t* − *l* + 2, …, *t* − 1}, after which all observations before *c* are completely uninformative about the observations after this point. It is important to note that the LBW *l* for this approach needs to be prespecified, and it is assumed that it contains a single changepoint. Each of the two regions is described by different covariance functions *k*_{ξ1}, *k*_{ξ2}, in our case Matérn 3/2 kernels, which are parameterized by hyperparameters ξ_{1} and ξ_{2}, respectively. The region-switching kernel is

with a full set of hyperparameters ξ_{R} = {ξ_{1}, ξ_{2}, *c*, σ_{n}}. Here, a changepoint can take multiple forms, with these cases being a drastic change in covariance, a sudden change in the input scale, or a sudden change in the output scale. In the context of financial time series, we can think of these cases as a change in correlation length, a change in mean-reversion length, or a change in volatility.

It is computationally inefficient to fit 2(*l* − 1) GPs, to minimize nlml_{ξR} as in Equation 7, owing to the introduction the discrete hyperparameter *c*. We instead borrow an idea from Lloyd et al. (2014) and approximate the abrupt change of covariance in Equation 8 using a sigmoid function σ(*x*) = 1/(1 + *e*^{−s(x−c)}), which has the properties σ(*x*, *x*′) = σ(*x*)σ(*x*′) and . Here, *c* ∈ (*t* − *l*, *t*) is the changepoint location, and *s* > 0 is the steepness parameter. Our changepoint kernel is

with a full set of hyperparameters ξ_{C} = {ξ_{1}, ξ_{2}, *c*, *s*, σ_{n}}. We can compute nlml_{ξ}_{C} by optimizing the parameters a single GP, which is significantly more efficient than computing nlml_{ξR}, despite having additional hyperparameters. This new kernel has the added benefit of capturing more gradual transitions from one covariance function to another, owing to the addition of the steepness parameter *s*. We implement Equation 9 in GPflow via the gpflow.kernels.ChangePoints class, adding the constraint *c* ∈ (*t* − *l*, *t*), which is not enforced by default.

To quantify the level of disequilibrium, we look at the reduction in negative log marginal likelihood achieved via the introduction of the changepoint kernel hyperparameters through comparison to nlml_{ξM}. If the introduction of additional hyperparameters leads to no reduction in negative log marginal likelihood, then the level of disequilibrium is low. Conversely, a large reduction indicates significant disequilibrium, or a stronger changepoint, because the data are better described by two covariance functions. Our changepoint score and location are

which are both normalized values, which helps to improve stability and performance of our LSTM module.

Exhibit 1 shows plots of daily returns for the S&P 500 composite ratio-adjusted continuous futures contract during the first quarter of 2020, in which returns have been standardized as per Equation 2. The top plot fits a GP using the Matérn 3/2 kernel, and the bottom uses the changepoint kernel specified in Equation 9. The shaded blue region covers ±2 standard deviations from the mean, and we can see that the top plot is dominated by the white noise term σ_{n} ≈ 1. The black dotted line indicates the location of the changepoint hyperparameter *c* after minimizing negative log marginal likelihood, which aligns with the COVID-19 market crash. The negative log marginal likelihood is reduced from 88.0 to 47.9, which corresponds to .

## MOMENTUM STRATEGIES REVIEW

### Classical Strategies

In this article, we focus on univariate time-series approaches (Moskowitz, Ooi, and Pedersen 2012), as opposed to cross-sectional (Jegadeesh and Titman 1993) strategies, which trade assets against each other and select a portfolio based on relative ranking. Volatility scaling (Kim, Tse, and Wald 2016; Harvey et al. 2018) has been proven to play a crucial role in the positive performance of TSMOM strategies, including deep learning strategies (Lim, Zohren, and Roberts 2019). We scale the returns of each asset by its volatility so that each asset has a similar contribution to the overall portfolio returns, ensuring that our strategy targets a consistent amount of risk. The consistency over time and across assets has the added benefit of allowing us to benchmark strategies. Targeting an annualized volatility σ_{tgt}, which we take to be 15% in this article, the realized return of our strategy from day *t* to *t* + 1 is

where *X _{t}* is our position size,

*N*the number of assets in our portfolio, and the ex ante volatility estimate of the

*i*th asset. We compute using a 60-day exponentially weighted moving standard deviation.

The simplest trading strategy for which we benchmark performance is long only, for which we always select the maximum position . The original article on time-series momentum (Moskowitz, Ooi, and Pedersen 2012), which we will refer to as *Moskowitz*, selects a position as , where we are using the volatility scaling framework and *r*_{t−252,t} is annual return. In an attempt to react more quickly to momentum turning points, Garg et al. (2021) blended a slow signal based on annual returns and a fast signal based on monthly returns to give an intermediate strategy:

We control the relative contribution of the fast and slow signal via *w* ∈ [0, 1], with the case *w* = 0 corresponding to the Moskowitz strategy. We additionally use moving average convergence/divergence (MACD) (Baz et al. 2015) as a benchmark; for details on the implementation, we invite the reader to see Lim, Zohren, and Roberts (2019).

### Deep Learning

We adopt a number of key choices that lead to the improved performance of DMNs.

**LSTM architecture.** Of the deep-learning architectures assessed by Lim, Zohren, and Roberts (2019), the LSTM (Hochreiter and Schmidhuber 1997) architecture yields the best results. LSTM is a special kind of recurrent neural network (RNN) (Goodfellow, Bengio, and Courville 2016), initially proposed to address the vanishing and exploding gradient problem (Bengio, Simard, and Frasconi 1994). An RNN takes an input sequence and, through the use of a looping mechanism in which information can flow from one step to another, can be used to transform this into an output sequence while taking into account contextual information in a flexible way. An LSTM operates with cells, which store both short-term memory and long-term memory, using gating mechanisms to summarize and filter information. Internal memory states are sequentially updated with new observations at each step. The resulting model has fewer trainable parameters, is able to learn representations of long-term relationships, and typically achieves better generalization results.

**Trading signal and position sizing.** Trading signals are learned directly by DMNs, removing the need to manually specify both the trend estimator and maps this into a position. The output of the LSTM is followed by a time-distributed, fully connected layer with a activation function tanh(·), which is a squashing function that directly outputs positions . The advantage of this approach is that we learn trading rules and position sizing directly from the data. Once our hyperparameters θ have been trained via backpropagation (LeCun et al. 2012), our LSTM architecture *g*(·; θ) takes input features for all time steps in the LSTM looking back from time *T* with τ steps and directly outputs a sequence of positions:

In an online prediction setting, only the final position in the sequence is of relevance to our strategy.

**Loss function.** It has been observed (Potters and Bouchaud 2016) that correctly predicting the direction of a stock move does not translate directly into a positive strategy return, because the driving moves can often be large but infrequent. Furthermore, we want to account for trade-offs between risk and reward; hence, we explicitly optimize networks for risk-adjusted performance metrics. One such metric used by DMNs is the Sharpe ratio (Sharpe 1994), which calculates the return per unit of volatility. Our Sharpe loss function is

where Ω is the set of all asset–time pairs {(*i*, *t*)|*i* ∈ {1, 2, …, *N*}, *t* ∈ {*T* − τ + 1, …, *T*}}. Automatic differentiation is used to compute gradients for backpropagation (Goodfellow, Bengio, and Courville 2016), which explicitly optimizes networks for our chosen performance metric.

**Model inputs.** For each time step, our model can benefit from inputting signals from various time scales. We normalize returns to be , given a time offset of *t*′ days. We use offsets *t*′ ∈ {1, 21, 63, 126, 256}, corresponding to daily, monthly, quarterly, biannual, and annual returns. We also encode additional information by inputting MACD indicators (Baz et al. 2015). MACD is a volatility-normalized moving-average convergence–divergence signal, defining the relationship between a short and long signal. For implementation details, please refer to Lim, Zohren, and Roberts (2019). We use pairs in {(8, 24), (16, 28), (32, 96)}. We can think of these indicators as performing a function similar to a convolutional layer.

## TRADING STRATEGY

### Strategy Definition

Because we are using a data-driven approach, we split our training data as a first step, setting aside the first 90% for training and the last 10% for validation for each asset. We calibrate our model using the training data by optimizing on the Sharpe loss function (Equation 14) via minibatch stochastic gradient descent (SGD), using the Adam (Kingma and Ba 2015) optimizer. We observe validation loss after each epoch, which is a full pass of the data, to determine convergence. We also use the validation set for the outer optimization loop, in which we tune our model hyperparameters. The hyperparameter optimization process is detailed in Appendix B.

It is necessary to precompute the CPD location and severity parameters as detailed by Equation 10. We do this for each time–asset pair in our training and validation set. It is necessary to do this for a chosen *l* ∈ {10, 21, 63, 126, 252}, corresponding to two weeks, a month, a quarter, half a year, and a full year. We selected these LBW sizes to correspond to input return timescales, with the exception of the 10-day LBW, which was selected to be as close to daily return data as reasonably possible. We reinitialize our Matérn 3/2 kernel for each time step, with all hyperparameters set to 1. This approach was found to be more stable than borrowing parameters from the previous time step. For our changepoint kernel, we initialize the hyperparameters as and *s* = 1. All other parameters are initialized as the equivalent parameter from fitting the Matérn 3/2 kernel, initializing *k*_{ξ1} and *k*_{ξ2} with the same values. In the rare case this process fails, we try again by reinitializing all changepoint kernel parameters to 1, with the exception of setting . In the event the module still fails for a given time step, we fill the outputs and using the outputs from the previous time step, noting that we need to increment the changepoint location by an additional step.

For each LSTM input, we pass in the normalized returns from the different time scales, our MACD indicators, and CPD severity and location for a chosen *l*. We can either fix *l* for our strategy or introduce it as a structural hyperparameter, which is tuned by the outer optimization loop. By doing this, we have information exchange from our CPD module all the way through to our Sharpe ratio loss function and traded positions. Once our model has been fully trained, we can run it online by computing the CPD module for the most recent data points and then using our LSTM module to select positions to hold for the next day for each asset.

### Experiments via Backtesting

For all of our experiments, we used a portfolio of 50 liquid, continuous futures contracts over the period 1990–2020. The combination of commodities, equities, fixed income, and FX futures was selected to make up a well-balanced portfolio. The data were extracted from the Pinnacle Data Corp. CLC database (Pinnacle Data Corp. 2021), and the selected futures contracts are listed in Appendix A. All of the selected assets have less than 10% of data missing.

To back test our model, we use an expanding window approach, in which we start by using 1990–1995 for training/validation and then test out of sample on the period 1995–2000. With each successive iteration, we expand the training/validation window by an additional five years, perform the hyperparameter optimization again, and test on the subsequent five-year period. Data were not available from 1990 for every asset, and we only use an asset if there is enough data available in the validation set for at least one LSTM sequence. All of our results are recorded as an average of the test windows. We test our LSTM with the CPD strategy using an LBW *l* ∈ {10, 21, 63, 126, 252} and then with the optimized *l* for each window, based on validation loss.

We benchmark our strategy against those we have discussed, in which we choose *w* ∈ {0, 0.5, 1} for the intermediate strategy. We also compare our strategy to a DMN that does not have the CPD module. To maintain consistency with previous work by Lim, Zohren, and Roberts (2019), we benchmark strategy

**1. profitability**through annualized returns and percentage of positive captured returns;**2. risk**through annualized volatility, annualized downside deviation, and maximum drawdown (MDD); and**3. risk-adjusted performance**through annualized Sharpe, Sortino, and Calmar ratios.

We provide results for both the raw signal output and then with an additional layer of volatility rescaling to the target of 15%, for ease of comparison between strategies. It should be noted that this article selects a more realistic 50-asset portfolio instead of the full 88 assets previously selected by Lim, Zohren, and Roberts (2019). We focus on the raw predictive power of the model and do not account for transaction costs at this stage; however, this is a simple adjustment and can easily be incorporated into the loss function. We have included some details and analysis of transaction costs in Appendix C. For further information on the implementation and the effects of transaction costs, please refer to Lim, Zohren, and Roberts (2019).

## RESULTS AND DISCUSSION

Our aggregated out-of-sample prediction results, averaged across all five-year windows from 1995–2020, are recorded in Exhibit 3 and again in Exhibit 4 using volatility rescaling. We plot the effect of CPD LBW size on the average Sharpe ratio in Exhibit 2 and demonstrate how optimizing on this as a hyperparameter can improve overall performance. Impressively, due to our GP framework for CPD, we are able to achieve superior results with limited data and hence very small LBWs. There is a notable performance boost from only a two-week LBW, and performance almost maxes out after only one month, with an LBW of one quarter leading to the highest Sharpe ratio. As we approach an LBW of one year, we lose the benefit of the CPD module because it places too much emphasis on larger changepoints that are further in the past. We also note that the CPD computation becomes more intensive for *l* ∈ {126, 252}. If we introduce LBW as a hyperparameter to be reevaluated as the training window continues to expand, we observe an additional 4% increase in the Sharpe ratio, leading to a total increase of 33% over the LSTM baseline.

Another idea involved passing in outputs from multiple CPD modules with different LBWs in parallel as inputs to the LSTM. This was not found to improve the model and actually resulted in degraded performance. Multiple LBWs could be useful if using a more complex deep learning architecture than LSTM.

In Exhibit 5, we observe slow momentum and fast reversion strategies happening simultaneously. By introducing CPD, we are able to achieve superior returns because we are better able to learn the timing of these strategies and when to place more emphasis on one of them, using a data-driven approach. These plots examine the positions our DMN takes for single assets during periods of regime change, providing a comparison of a DMN with and without the CPD module. The top plots track the daily closing price, with the alternating white and gray regions indicating regimes separated by significant changepoints. CPD is performed online with a 63-day LBW, with the changepoint severity on the left plot and on the right plot indicating a changepoint. Each case uses a 63-day burn-in time before we can classify a subsequent changepoint. The middle plots compare the moving averages of position size taken for over a long time scale of one year, indicated by the solid lines, and a shorter timescale of one month, indicated by the dashed line. The bottom plots indicate cumulative returns for each strategy. The plots on the left look at the FTSE 100 Index during the lead up to the 2008 final crash and its aftermath. With the addition of CPD, our strategy is able to exploit persisting trends with better timing. It is quicker to react to the first dip in 2008, taking short positions to exploit the bear market with a slow momentum strategy, and is similarly able to react to adapt to the bull market established in 2009 by more quickly moving to a long strategy. Both approaches exhibit a fast reverting strategy; however, after the addition of CPD, the strategy is slightly less aggressive with positions taken in response to localized changes. The plots on the right look at the British pound exchange rate in the lead up to the Brexit vote in 2016 and its aftermath. Here, the bull and bear regimes are both less defined, and there is a higher level of nonstationarity. With the addition of the CPD module, our model takes a much more conservative slow momentum strategy and instead opts to focus more on achieving positive returns via a fast mean-reverting strategy.

Our results demonstrate that, via the introduction of the CPD module, we outperform the standard DMN in all performance metrics. Our model correctly classifies the direction of the return more often and has a higher average profit-to-loss ratio. We can see that the CPD module helps to reduce risk, thus reducing volatility, downside deviation, and MDD while still achieving slightly higher raw returns. This translates to an improvement in risk-adjusted performance, improving the Sortino ratio by 35% and the Calmar ratio by 25%. These metrics suggest that the CPD module makes our model more robust to market crashes. We observe an improvement in the Sharpe ratio, our target metric, of 33%, which translates to an improvement of 130% in comparison to the best-performing TSMOM strategy.

We plot the raw and rescaled signals to benchmark strategies in Exhibit 6. The plot on the left, of raw signal output, demonstrates that via the introduction of the CPD module, we are able to reduce the strategy volatility, especially during the market nonstationarity of more recent years. With the exception of long only, we omit the reference strategies in this plot to avoid clutter. The plot on the right, of signal with rescaled volatility, demonstrates that our strategy outperforms all benchmarks with risk-adjusted performance. We show intermediate strategy output for *w* ∈ {0, 0.5, 1}. We can see the difficulties of trying to address regime change with handcrafted techniques such as the intermediate *w* = 0.5, which in our experiments actually fails to outperform the *w* = 0 Moskowitz strategy on all risk-adjusted performance ratios.

We note that up until about 2003, when the uptake of electronic trading was becoming much more widespread, the traditional TSMOM and MACD strategies are comparable to the results achieved via the LSTM DMN architecture. At this point, the LSTM starts to significantly outperform these traditional strategies until more recent years, when we see volatility increase and performance, especially risk-adjusted performance, drop significantly. This drop in performance can be largely attributed to increased market nonstationarity. Impressively, with the addition of the CPD module, our DMN pipeline continues to perform well even during the market nonstationarity of the 2015–2020 period. Using five repeated trials of the entire experiment, with and without CPD, the average improvement for the Sharpe ratio in this period is 70%, for LBW *l* = 21.

## CONCLUSIONS

We have demonstrated that the introduction of an online CPD module is a simple, yet effective, way to significantly improve model performance, specifically DMNs. Our model is able to blend different strategies at different timescales, learning to do so in a data-driven manner directly based on our desired risk-adjusted performance metric. In periods of stability, our model is able to achieve superior returns by focusing on slow momentum while exploiting but not overreacting to local mean reversion. The impressive performance increase in periods of nonstationarity, such as recent years, can be attributed to the fact that we (1) can effectively incorporate CPD online with a very short LBW because we do so using GP and (2) pass changepoint score from our CPD module to the DMN, helping our model learn how to respond to varying degrees of disequilibrium. As a result, we enhance performance in such conditions in which we observe a more conservative slow momentum strategy with a focus on fast mean reversion.

Future work includes incorporating a CPD module into other deep learning architectures or performing CPD on a model representation as opposed to model inputs. The work in this article has natural parallels to the field of continual learning, which is a paradigm whereby an agent sequentially learns new tasks. Another direction of work will involve using continual learning for momentum trading, in which CPD is used to determine task boundaries.

## ACKNOWLEDGMENT

We would like to thank the Oxford-Man Institute of Quantitative Finance for financial and computing support.

## APPENDIX B

### EXPERIMENT DETAILS

We split our data into training and validation datasets using a 90%/10% split. We winsorize our data by limiting them to be within five times their exponentially weighted moving (EWM) standard deviations from their EWM average, using a 252-day half-life. We calibrate our model using the training data by optimizing on the Sharpe loss function via minibatch SGD, using the Adam optimizer. We limit our training to 300 epochs, with an early stopping patience of 25 epochs, meaning training is terminated if there is no decrease in validation loss during this time period. The model is implemented via the Keras API in TensorFlow. Our LSTM sequence length was set to 63 for all experiments. For training and validation, in an attempt to prevent overfitting, we split our data into non-overlapping sequences, rather than using a sliding window approach. A stateless LSTM is used, meaning the last state from the previous batch is not used as the initial state for the subsequent batch. Keeping the order of each individual sequence intact, we shuffle the order in which each sequence appears in an epoch. We employ dropout regularization (Srivastava et al. 2014) as another technique to avoid overfitting, applying it to LSTM inputs and outputs.

We tune our hyperparameters, with options listed in Exhibit 7, using an outer optimization loop. We achieve this via 50 iterations of random grid search to identify the optimal model. We perform the full experiment for each choice of CPD LBW length and then use the model that achieved the lowest validation loss for the optimized CPD model.

## APPENDIX C

### TRANSACTION COSTS

In Exhibit 8, we demonstrate the impact of transaction costs on our raw signal, in which we increase the average transaction cost from 0 to 5 bps. The black dotted line indicates the long-only reference. Our strategy outperforms classical strategies for transaction costs of up to 2 bps, at which point it rapidly deteriorates owing to the fast reverting component. We note that a larger CPD LBW size becomes favorable as we increase *C*. We suspect this is because the model focuses on larger long-term changepoints and favors slow momentum over fast reversion. For larger average transaction costs greater than 1 bp, we suggest incorporating turnover-adjusted returns into the loss function (Equation 14). This adjustment is detailed by Lim, Zohren, and Roberts (2019), who demonstrated that it works well when transaction costs are high. Assuming an average transaction cost of *C*, we calculate turnover adjusted returns as

- © 2021 Pageant Media Ltd