## Abstract

Although time-series momentum is a well-studied phenomenon in finance, common strategies require the explicit definition of both a trend estimator and a position sizing rule. In this article, the authors introduce deep momentum networks—a hybrid approach that injects deep learning–based trading rules into the volatility scaling framework of time-series momentum. The model also simultaneously learns both trend estimation and position sizing in a data-driven manner, with networks directly trained by optimizing the Sharpe ratio of the signal. Backtesting on a portfolio of 88 continuous futures contracts, the authors demonstrate that the Sharpe-optimized long short-term memory improved traditional methods by more than two times in the absence of transactions costs and continued outperforming when considering transaction costs up to 2–3 bps. To account for more illiquid assets, the authors also propose a turnover regularization term that trains the network to factor in costs at run-time.

**TOPICS:** Statistical methods, simulations, big data/machine learning

**Key Findings**

• While time-series momentum strategies have been extensively studied in f inance, common strategies require the explicit specification of a trend estimator and position sizing rule.

• In this article, the authors introduce deep momentum networks —a hybrid approach that injects deep learning–based trading rules into the volatility scaling framework of timeseries momentum.

• Backtesting on a portfolio of continuous futures contracts, Deep Momentum Networks were shown to outperform traditional methods for transaction costs of up to 2–3 bps, with a turnover regularisation term proposed for more illiquid assets.

Momentum as a risk premium in finance has been extensively documented in the academic literature, with evidence of persistent abnormal returns demonstrated across a range of asset classes, prediction horizons, and time periods (Lempérière et al. 2014; Baz et al. 2015; Hurst, Ooi, and Pedersen 2017). Based on the philosophy that strong price trends have a tendency to persist, time-series momentum strategies are typically designed to increase position sizes with large directional moves and reduce positions at other times. Although the intuition underpinning the strategy is clear, specific implementation details can vary widely between signals; a plethora of methods are available to estimate the magnitude of price trends (Bruder et al. 2013; Baz et al. 2015; Levine and Pedersen 2016) and map them to actual traded positions (Kim, Tse, and Wald 2016; Baltas and Kosowski 2017; Harvey et al. 2018).

In recent times, deep neural networks have been increasingly used for time-series prediction and have outperformed traditional benchmarks in applications such as demand forecasting (Laptev et al. 2017), medicine (Lim and van der Schaar 2018), and finance (Zhang, Zohren, and Roberts 2019). With the development of modern architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) (Goodfellow, Bengio, and Courville 2016), deep learning models have been favored for their ability to build representations of a given dataset (Bengio, Courville, and Vincent 2013), capturing temporal dynamics and cross-sectional relationships in a purely data-driven manner. The adoption of deep neural networks has also been facilitated by powerful open-source frameworks such as TensorFlow (Abadi et al. 2015) and PyTorch (Paszke et al. 2017), which use automatic differentiation to compute gradients for backpropagation without having to explicitly derive them in advance. In turn, this flexibility has allowed deep neural networks to go beyond standard classification and regression models. For instance, hybrid methods that combine traditional time-series models with neural network components have been observed to outperform pure methods in either category (Makridakis, Spiliotis, and Assimakopoulos 2018)—for example, the exponential smoothing RNN (Smyl, Ranganathan, and Pasqua 2018), autoregressive CNNs (Binkowski, Mikolaj, and Donnat 2018), and Kalman filter variants (Fraccaro et al. 2017; Rangapuram et al. 2018)—while also making outputs easier to interpret by practitioners. Furthermore, these frameworks have enabled the development of new loss functions for training neural networks, such as adversarial loss functions in generative adversarial networks (GANs) (Goodfellow et al. 2014).

Although numerous papers have investigated the use of machine learning for financial time-series prediction, they typically focus on casting the underlying prediction problem as a standard regression or classification task (Bao, Yue, and Rao 2017; Gu, Kelly, and Xiu 2017; Binkowski, Marti, and Donnat 2018; Ghoshal and Roberts 2018; Sirignano and Cont 2018; Kim 2019; Zhang, Zohren, and Roberts 2019) with regression models forecasting expected returns and classification models predicting the direction of future price movements. This approach, however, could lead to suboptimal performance in the context of time-series momentum for several reasons. First, sizing positions based on expected returns alone does not take risk characteristics (e.g., the volatility or skew of the predictive returns distribution) into account, which could inadvertently expose signals to large downside moves. This is particularly relevant because raw momentum strategies without adequate risk adjustments, such as volatility scaling (Kim, Tse, and Wald 2016), are susceptible to large crashes during periods of market panic (Barroso and Santa-Clara 2015; Daniel and Moskowitz 2016). Furthermore, even with volatility scaling—which leads to positively skewed returns distributions and long option–like behavior (Martins and Zou 2012; Jusselin et al. 2017)—trend-following strategies can place more losing trades than winning ones and still be profitable on the whole because they size up only into large but infrequent directional moves. As such, Potters and Bouchaud (2016) argued that the fraction of winning trades is a meaningless metric of performance, given that it cannot be evaluated independently from the trading style of the strategy. Similarly, high classification accuracies may not necessarily translate into positive strategy performance because profitability also depends on the magnitude of returns in each class. This is also echoed in betting strategies such as the Kelly criterion (Rotando and Thorp 1992), which requires both win/loss probabilities and betting odds for optimal sizing in binomial games. In light of the deficiencies of standard supervised learning techniques, new loss functions and training methods would need to be explored for position sizing—accounting for trade-offs between risk and reward.

In this article, we introduce a novel class of hybrid models that combines deep learning–based trading signals with the volatility scaling framework used in time-series momentum strategies (Moskowitz, Ooi, and Pedersen 2012; Baltas and Kosowski 2017), which we refer to as *deep momentum networks* (DMNs). This improves existing methods from several angles. First, by using deep neural networks to directly generate trading signals, we remove the need to manually specify both the trend estimator and position sizing methodology—allowing them to be learned directly using modern time-series prediction architectures. Second, by using automatic differentiation in existing backpropagation frameworks, we explicitly optimize networks for risk-adjusted performance metrics—that is, the Sharpe ratio (Sharpe 1994)—improving the risk profile of the signal on the whole. Lastly, retaining a consistent framework with other momentum strategies also allows us to retain desirable attributes from previous works—specifically volatility scaling, which plays a critical role in the positive performance of time-series momentum strategies (Harvey et al. 2018). This consistency also helps when making comparisons to existing methods and facilitates the interpretation of different components of the overall signal by practitioners.

## RELATED WORK

### Classical Momentum Strategies

Momentum strategies are traditionally divided into two categories: (multivariate) cross-sectional momentum (Jegadeesh and Titman 1993; Kim 2019) and (univariate) time-series momentum (Moskowitz, Ooi, and Pedersen 2012; Baltas and Kosowski 2017). Cross-sectional momentum strategies focus on the relative performance of securities against each other, buying relative winners and selling relative losers. By ranking a universe of stocks based on their past return and trading the top decile against the bottom decile, Jegadeesh and Titman (1993) found that securities that recently outperformed their peers over the past 3 to 12 months continue to outperform on average over the next month. The performance of cross-sectional momentum has also been shown to be stable across time (Jegadeesh and Titman 2001) and across a variety of markets and asset classes (Baz et al. 2015).

Time-series momentum extends the idea to focus on an asset’s own past returns, building portfolios comprising all securities under consideration. This was initially proposed by Moskowitz, Ooi, and Pedersen (2012), who described a concrete strategy that uses volatility scaling and trades positions based on the sign of returns over the past year; they demonstrated profitability across 58 liquid instruments individually over 25 years of data. Since then, numerous trading rules have been proposed, with various trend estimation techniques and methods mapping them to traded positions. For instance, Bruder et al. (2013) documented a wide range of linear and nonlinear filters to measure trends and a statistic to test for its significance, although they did not directly discuss methods to size positions with these estimates. Baltas and Kosowski (2017) adopted an approach similar to that of Moskowitz, Ooi, and Pedersen (2012), regressing the log price over the past 12 months against time and using the regression coefficient *t*-statistics to determine the direction of the traded position. Although Sharpe ratios were comparable between the two, *t*-statistic–based trend estimation led to a 66% reduction in portfolio turnover and consequently trading costs. More sophisticated trading rules were proposed by Baz et al. (2015) and Rohrbach, Suremann, and Osterrieder (2017), taking volatility-normalized moving average convergence divergence (MACD) indicators as inputs. Despite the diversity of options, few comparisons have been made of the trading rules themselves, offering little clear evidence or intuitive reasoning to favor one rule over the next. We hence propose the use of deep neural networks to generate these rules directly, avoiding the need for explicit specification. Training them based on risk-adjusted performance metrics, the networks hence learn optimal training rules directly from the data itself.

### Deep Learning in Finance

Machine learning has long been used for financial time-series prediction, with recent deep learning applications studying mid-price prediction using daily data (Ghoshal and Roberts 2018) or using limit order book data in a high-frequency trading setting (Sirignano and Cont 2018; Zhang, Zohren, and Roberts 2018, 2019). Although a variety of CNN and RNN models have been proposed, they typically frame the forecasting task as a classification problem, demonstrating the improved accuracy of their method in predicting the direction of the next price movement. Trading rules are then manually defined in relation to class probabilities—either by using thresholds on classification probabilities to determine when to initiate positions (Ghoshal and Roberts 2018) or incorporating these thresholds into the classification problem itself by dividing price movements into buy, hold, and sell classes depending on magnitude (Zhang, Zohren, and Roberts 2018, 2019). In addition to restricting the universe of strategies to those that rely on high accuracy, further gains might be made by learning trading rules directly from the data and removing the need for manual specification, both of which are addressed in our proposed method.

Deep learning regression methods have also been considered in cross-sectional strategies (Gu, Kelly, and Xiu 2017; Kim 2019), ranking assets on the basis of expected returns over the next time period. Using a variety of linear, tree-based, and neural network models, Gu, Kelly, and Xiu (2017) demonstrated the outperformance of nonlinear methods, with deep neural networks—specifically three-layer multilayer perceptrons (MLPs)—having the best out-of-sample predictive *R*^{2}. Machine learning portfolios were then built by ranking stocks on a monthly basis using model predictions, with the best strategy coming from a four-layer MLP that trades the top decile against the decile of predictions. In other works, Kim (2019) adopted a similar approach using autoencoder and denoising autoencoder architectures, incorporating volatility scaling into this model as well. Although the results with basic deep neural networks are promising, they do not consider more modern architectures for time-series prediction, such as long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and WaveNet (van den Oord et al. 2016), architectures that we evaluate for the DMN. Moreover, to the best of our knowledge, our article is the first to consider the use of deep learning within the context of time-series momentum strategies, thus opening up possibilities in an alternate class of signals.

Popularized by success of DeepMind’s AlphaGo Zero (Silver et al. 2017), deep reinforcement learning (RL) has also gained much attention in recent times. Prized for its ability to recommend path-dependent actions in dynamic environments, RL is particularly of interest within the context of optimal execution and automated hedging (Bühler et al. 2018; Kolm and Ritter 2019) for example, where actions taken can have an impact on future states of the world (e.g., market impact). However, deep RL methods generally require a realistic simulation environment (for Q-learning or policy gradient methods) or model of the world (for model-based RL) to provide feedback to agents during training—both of which are difficult to obtain in practice.

## STRATEGY DEFINITION

Adopting the framework of Baltas and Kosowski (2017), the combined returns of a time-series momentum (*TSMOM*) strategy can be expressed as follows and is characterized by a trading rule or signal *X*_{t} ∈ [−1, 1]:

Here is the realized return of the strategy from day *t* to *t* + 1, *N*_{t} is the number of included assets at *t*, and is the one-day return of asset *i*. This allows us to retain consistency with previous work on time-series momentum strategies, putting signals together in an equal-volatility-weighted portfolio to cap the returns volatility of each individual asset at a target level. We set the annualized volatility target σ_{tgt} to be 15% and scale asset returns with an ex ante volatility estimate —computed using an exponentially weighted moving standard deviation with a 60-day span on .

### Standard Trading Rules

In traditional financial time-series momentum strategies, the construction of a trading signal *X*_{t} is typically divided into two steps: (1) estimating future trends based on past information and (2) computing the actual positions to hold. We illustrate this in this section using two examples from the academic literature (Moskowitz, Ooi, and Pedersen 2012; Baz et al. 2015), which we also include as benchmarks for our tests.

**Moskowitz, Ooi, and Pedersen.** In their original paper on time-series momentum, a simple trading rule was adopted as follows:

This broadly uses the past year’s returns as a trend estimate for the next time step, taking a maximum long position when the expected trend is positive (i.e., ) and a maximum short position when negative.

**Baz et al.** In practice, more sophisticated methods can be used to compute and , such as the model of Baz et al. (2015) described in the following:

Here is the 63-day rolling standard deviation of asset *i* prices , and *m*(*i*,*S*) is the exponentially weighted moving average of asset *i* prices with a time scale *S* that translates into a half-life of . The MACD signal is defined in relation to a short and a long time scale *S* and *L*, respectively.

The volatility-normalized MACD signal hence measures the strength of the trend, which is then translated in to a position size as follows:

7where . Plotting ϕ(*y*) in Exhibit 1, we can see that positions are increased until before decreasing back to zero for larger moves. This allows the signal to reduce positions in instances in which assets are overbought or oversold—defined as when is observed to be larger than 1.41 times its past year’s standard deviation.

Increasing the complexity even further, multiple signals with different time scales can also be averaged to give a final position:

8where (*S*_{k}, *L*_{k}) is as per Equation 4 with explicitly defined short and long time scales, using *S*_{k} ∈ {8, 16, 32} and *L*_{k} ∈ {24, 48, 96} as defined by Baz et al. (2015).

### Machine Learning Extensions

As described, many arbitrary design decisions are required to define a sophisticated time-series momentum strategy. We hence start by considering how machine learning methods can be used to learn these relationships directly from data—alleviating the need for manual specification.

**Standard supervised learning.** In line with numerous previous investigations (see “Related Work” section), we can cast trend estimation as a standard regression or binary classifications problem, with outputs:

where *f*(·) is the output of the machine learning model, which takes in a vector of input features and model parameters θ to generate predictions. Taking volatility-normalized returns as targets, the following mean-squared error and binary cross-entropy losses can be used for training:

where is the set of all *M* = *NT* possible prediction and target tuples across all *N* assets and *T* time steps. For the binary classification case, 𝕀 is the indicator function —making the estimated probability of a positive return.

This still leaves us to specify how trend estimates map to positions, and we do so using a similar form to Equation 3 as follows:

12 13 14As such, we take a maximum long position when the expected returns are positive in the regression case or when the probability of a positive return is greater than 0.5 in the classification case. This formulation maintains consistency with past work on time-series momentum and volatility scaling, allowing us to make direct comparisons with previous methods and to evaluate the performance of sophisticated trend estimators as opposed to simply using the previous year’s return.

**Direct outputs.** An alternative approach is to use machine learning models to generate positions directly—simultaneously learning both trend estimation and position sizing in the same function:

Given the lack of direct information on the optimal positions to hold at each step—which is required to produce labels for standard regression and classification models—calibration would hence need to be performed by directly optimizing performance metrics. Specifically, we focus on optimizing the average return and the Sharpe ratio via the loss functions as follows:

16 17where *R*(*i*, *t*) is the return captured by the trading rule for asset *i* at time *t*.

## DEEP MOMENTUM NETWORKS

In this section, we examine a variety of architectures that can be used in DMNs, all of which can be easily reconfigured to generate the predictions described in the previous section. This is achieved by implementing the models using the Keras API in Tensorflow (Abadi et al. 2015), in which output activation functions can be flexibly interchanged to generate predictions of different types (e.g., expected returns, binary probabilities, or direct positions). Arbitrary loss functions can also be defined for direct outputs, with gradients for backpropagation being easily computed using the built-in libraries for automatic differentiation.

### Network Architectures

**Lasso regression.** In the simplest case, a standard linear model could be used to generate predictions as follows:

where depending on the prediction task, **w** is a weight vector for the linear model, and *b* is a bias term. Here *g*(·) is a activation function that depends on the specific prediction type—linear for standard regression, sigmoid for binary classification, and tanh-function for direct outputs.

Additional regularization is also provided during training by augmenting the various loss functions to include an additional *L*_{1} regularizer as follows:

where 𝓛(θ) corresponds to one of the loss functions described in the previous section, ∥**w**∥_{1} is the *L*_{1} norm of w, and α is a constant term that we treat as an additional hyperparameter. To incorporate recent history into predictions as well, we concatenate inputs over the past τ days into a single input vector—that is, . This was fixed to be τ = 5 days in our experiments.

**Multilayer perceptron.** Increasing the degree of model complexity slightly, a two-layer neural network can be used to incorporate nonlinear effects:

where is the hidden state of the MLP using an internal tanh activation function, tanh(·), and **W**_{.} and **b**_{.} are layer weight matrixes and biases, respectively.

**WaveNet.** More modern techniques such as CNNs have been used in the domain of time-series prediction, particularly in the form of autoregressive architectures (e.g., Binkowski, Marti, and Donnat 2018). These typically take the form of one-dimensional causal convolutions, sliding convolutional filters across time to extract useful representations that are then aggregated in higher layers of the network. To increase the size of the receptive field—or the length of history fed into the CNN—dilated CNNs such as WaveNet (van den Oord et al. 2016) have been proposed; these skip over inputs at intermediate levels with a predetermined dilation rate and thus can effectively increase the amount of historical information used by the CNN without a large increase in computational cost.

Let us consider a dilated convolutional layer with residual connections that takes the following form:

22Here **W** and **V** are weight matrixes associated with the gated activation function, and **A** and **b** are the weights and biases used to transform the **u** to match the dimensionality of the layer outputs for the skip connection. The equations for WaveNet architecture used in our investigations can then be expressed as

Here each intermediate layer **s**^{(i)}(*t*) aggregates representations at weekly, monthly, and quarterly frequencies, respectively. Intermediate layers are then concatenated at each layer before passing through a two-layer MLP to generate outputs:

State sizes for each intermediate layer , , and the MLP hidden state are fixed to be the same, allowing us to use a single hyperparameter to define the architecture. To independently evaluate the performance of CNN and RNN architectures, the preceding also excludes the LSTM block (i.e., the context stack) described by van den Oord et al. (2016) and focuses purely on the merits of the dilated CNN model.

**Long short-term memory.** Traditionally used in sequence prediction for natural language processing, RNNs—specifically LSTM architectures (Hochreiter and Schmidhuber 1997)—have been increasingly used in time-series prediction tasks. The equations for the LSTM in our model are as follows:

where ⊙ is the Hadamard (elementwise) product; σ(.) is the sigmoid activation function; **W**_{.} and **V**_{.} are weight matrixes for the different layers; , , correspond to the forget, input, and output gates, respectively; is the cell state; and is the hidden state of the LSTM. From these equations, we can see that the LSTM uses the cell state as a compact summary of past information, controlling memory retention with the forget gate and incorporating new information via the input gate. As such, the LSTM is able to learn representations of long-term relationships relevant to the prediction task, sequentially updating its internal memory states with new observations at each step.

### Training Details

Model calibration was undertaken using minibatch stochastic gradient descent with the Adam optimizer (Kingma and Ba 2015), based on the previously defined loss functions. Backpropagation was performed up to a maximum of 100 training epochs using 90% of a given block of training data with the most recent 10% retained as a validation dataset. Validation data are then used to determine convergence—with early stopping triggered when the validation loss has not improved for 25 epochs—and to identify the optimal model across hyperparameter settings. Hyperparameter optimization was conducted using 50 iterations of random search (full details are provided in the Appendix). For additional information on the deep neural network calibration, please refer to Goodfellow, Bengio, and Courville (2016).

Dropout regularization (Srivastava et al. 2014) was a key feature to avoid overfitting in the neural network models, with dropout rates included as hyperparameters during training. This was applied to the inputs and hidden state for the MLP, as well as the inputs (Equation 23) and outputs (Equation 27) of the convolutional layers in the WaveNet architecture. For the LSTM, we adopted the same dropout masks used by Gal and Ghahramani (2016)—applying dropout to the RNN inputs, recurrent states, and outputs.

## PERFORMANCE EVALUATION

### Overview of Dataset

The predictive performance of the different architectures was evaluated via a backtest using 88 ratio-adjusted continuous futures contracts downloaded from the Pinnacle Data Corp. CLC Database.^{1} These contracts spanned a variety of asset classes—including commodities, fixed income, and currency futures—and contained prices from 1990 to 2015. A full breakdown of the dataset can be found in the Appendix.

In contrast to simple panama-stitching, which permits negative prices, the positivity ensured by ratio adjustments would allow us to adopt the geometric returns formulation used by traditional definitions of time-series momentum, maintaining consistency with previous works.

### Backtest Description

Throughout our backtest, the models were recalibrated from scratch every five years using an expanding window of data, rerunning the entire hyperparameter optimization procedure using all data available up to the recalibration point. Model weights were then fixed for signals generated over the next five-year period, ensuring that tests were performed out of sample (i.e., from 1995–2015, with 1990–1995 used for training only). In addition, we also report the average expected return per five-year out-of-sample block in the online supplement, detailing means and standard deviations across all blocks.

For the DMNs, we incorporate a series of useful features adopted by standard time-series momentum strategies (see “Standard Trading Rules”) in to generate predictions at each step:

1.

*Normalized returns*—Returns over the past day, one-month, three-month, six-month, and one-year periods are used, normalized by a measure of daily volatility scaled to an appropriate time scale. For instance, normalized annual returns were taken to be .2.

*MACD indicators*—We also include the MACD indicators (i.e., trend estimates ), as in Equation 4, using the same short time scales*S*_{k}∈ {8, 16, 32} and long time scales*L*_{k}∈ {24, 48, 96}.

For comparisons against traditional time-series momentum strategies, we also incorporate the following reference benchmarks:

1. Long only with volatility scaling

2. Sgn(returns) (Moskowitz, Ooi, and Pedersen 2012)

3. MACD signal (Baz et al. 2015)

Finally, performance was judged based on the following metrics:

1.

*Profitability*—expected returns (E[Returns]) and the percentage of positive returns observed across the test period2.

*Risk*—daily volatility (Vol.), downside deviation, and the maximum drawdown (MDD) of the overall portfolio3.

*Performance ratios*—Risk-adjusted performance was measured by the Sharpe ratio , Sortino ratio , and Calmar ratio , as well as the average profit over the average loss .

### Results and Discussion

Aggregating the out-of-sample predictions from 1995 to 2015 (1990–1995 was used for training only), we compute performance metrics for both the strategy returns based on Equation 1 (Exhibit 2) and for portfolios with an additional layer of volatility scaling, which brings overall strategy returns to match the 15% volatility target (Exhibit 3). Given the large differences in returns volatility seen in Exhibit 2, this rescaling also helps to facilitate comparisons between the cumulative returns of different strategies, which are plotted for various loss functions in Exhibit 4. We note that strategy returns in this section are computed in the absence of transaction costs, allowing us to focus on the raw predictive ability of the models themselves. The impact of transaction costs is explored further in the next section, where we undertake a deeper analysis of signal turnover.

Focusing on the raw signal outputs, the Sharpe ratio–optimized LSTM outperforms all benchmarks as expected, improving the best neural network model (Sharpe-optimized MLP) by 44% and the best reference benchmark (Sgn(Returns)) by more than two times. In conjunction with Sharpe ratio improvements to both the linear and MLP models, this highlights the benefits of using models that capture nonlinear relationships and have access to more time history via an internal memory state. Additional model complexity, however, does not necessarily lead to better predictive performance, as demonstrated by the underperformance of WaveNet compared to both the reference benchmarks and simple linear models. Part of this can be attributed to the difficulties in tuning models with multiple design parameters—for instance, better results could possibly achieved by using alternative dilation rates, number of convolutional layers, and hidden state sizes in Equations 23 to 25 for the WaveNet. In contrast, only a single design parameter is sufficient to specify the hidden state size in both the MLP and LSTM models. Analyzing the relative performance within each model class, we can see that models that directly generate positions perform the best, demonstrating the benefits of simultaneous learning both trend estimation and position sizing functions. In addition, with the exception of a slight decrease in the MLP, Sharpe-optimized models outperform returns-optimized ones, with standard regression and classification benchmarks taking third and fourth place, respectively.

From Exhibit 3, although the addition of volatility scaling at the portfolio level improved performance ratios on the whole, it had a larger beneficial effect on machine learning models compared to the reference benchmarks, propelling Sharpe-optimized MLPs to outperform returns-optimized ones and even leading to Sharpe-optimized linear models beating reference benchmarks. From a risk perspective, we can see that both volatility and downside deviation also become a lot more comparable, with the former hovering close to 15.5% and the latter around 10%. However, Sharpe-optimized LSTMs still retained the lowest MDD across all models, with superior risk-adjusted performance ratios across the board.

Referring to the cumulative returns plots for the rescaled portfolios in Exhibit 4, the benefits of direct outputs with Sharpe ratio optimization can also be observed, with larger cumulative returns observed for linear, MLP, and LSTM models compared to the reference benchmarks. Furthermore, we note the general underperformance of models that use standard regression and classification methods for trend estimation, hinting at the difficulties faced in selecting an appropriate position sizing function and in optimizing models to generate positions without accounting for risk. This is particularly relevant for binary classification methods, which produce relatively flat equity lines and underperform reference benchmarks in general. Some of these poor results can be explained by the implicit decision threshold adopted. From the percentage of positive returns captured in Exhibit 3, most binary classification models have about a 50% accuracy, which, although expected of a classifier with a 0.5 probability threshold, is far below the accuracy seen in other benchmarks. Furthermore, performance is made worse by the fact that the model’s magnitude of gains versus losses is much smaller than that of competing methods, with average loss magnitudes even outweighing profits for the MLP classifier . As such, these observations lend support to the direct generation of position sizes with machine learning methods, given the multiple considerations (e.g., decision thresholds and profit/loss magnitudes) that would be required to incorporate standard supervising learning methods into a profitable trading strategy.

Strategy performance could also be aided by diversification across a range of assets, particularly when the correlation between signals is low. Hence, to evaluate the raw quality of the underlying signal, we investigate the performance constituents of the time-series momentum portfolios using box plots for a variety of performance metrics, plotting the minimum, lower quartile, median, upper quartile, and maximum values across individual futures contracts. We present in Exhibit 5 plots of key performance metrics, with similar results observed in other performance ratios and documented in the online supplement. In general, the Sharpe ratio plots in Exhibit 5, Panel A, echo previous findings, with direct output methods performing better than indirect trend estimation models. However, as seen in Exhibit 5, Panel C, this is mainly attributable to a significant reduction in signal volatility for the Sharpe-optimized methods, despite a comparable range of average returns in Exhibit 5, Panel B. The benefits of retaining the volatility scaling can also be observed, with individual signal volatility capped near the target across all methods—even with a naive sgn(.) position sizer. As such, the combination of volatility scaling, direct outputs, and Sharpe ratio optimization was key to performance gains in DMNs.

**Feature importance.** Although multiple methods for model interpretability have been proposed (Ribeiro, Singh, and Guestrin 2016; Shrikumar, Greenside, and Kundaje 2017; Lundberg and Lee 2017), they are typically formulated to independently assess the impact of exogenous inputs on model predictions at each time point. However, RNN predictions are also driven by an internal state vector, representing a compact summary of all preceding inputs. As such, feature importance at each time slice can differ depending on the history of inputs leading up to that point—even if the corresponding inputs at that time point are identical—making it difficult to apply standard techniques for interpretability. As such, to assess the importance of a given input feature across the entire length of history, we rerun the backtest, setting that feature to zero. This can be viewed as treating the feature as missing during test time, and any consequent performance degradation is reflective of its importance in generating quality predictions. From Exhibit 6—which shows the Sharpe ratio decay when a given feature is removed—we can see that the removal of daily returns results in the largest performance reduction (>2 Sharpe) across all features. This indicates that the daily returns are the most important feature driving predictions, demonstrating that the LSTM is able to build good representations directly from the raw returns. Furthermore, we also observe a Sharp ratio reduction with other features of around 1–1.5. This shows that the model is using all inputs and indicates its ability to learn meaningful relationships from the entire input data.

## TURNOVER ANALYSIS

To investigate how transaction costs affect strategy performance, we first analyze the daily position changes of the signal, characterized for asset *i* by daily turnover as defined by Baltas and Kosowski (2017):

which is broadly proportional to the volume of asset *i* traded on day *t* with reference to the updated portfolio weights.

Exhibit 7, Panel A, shows the average strategy turnover across all assets from 1995 to 2015, focusing on positions generated by the raw signal outputs. As the box plots are charted on a logarithm scale, we note that although the machine learning–based models have a similar turnover, they also trade significantly more than the reference benchmarks—approximately 10 times more compared to the long-only benchmark. This is also reflected in Exhibit 7, Panel A, which compares the average daily returns against the average daily turnover, with ratios from machine learning models lying close to the *x*-axis.

To concretely quantify the impact of transaction costs on performance, we also compute the ex-cost Sharpe ratios using the rebalancing costs defined by Baltas and Kosowski (2017) to adjust our returns for a variety of transaction cost assumptions. For the results in Exhibit 8, the top of each bar chart marks the maximum cost-free Sharpe ratio of the strategy, with each colored block denoting the Sharpe ratio reduction for the corresponding cost assumption. In line with the turnover analysis, the reference benchmarks demonstrate the most resilience to high transaction costs (up to 5 bps), with the profitability across most machine learning models persisting only up to 4 bps. However, we still obtain higher cost-adjusted Sharpe ratios with the Sharpe-optimized LSTM for up to 2–3 bps, demonstrating its suitability for trading more liquid instruments.

### Turnover Regularization

One simple way to account for transaction costs is to use cost-adjusted returns directly during training, augmenting the strategy returns defined in Equation 1 as follows:

36where *c* is a constant reflecting transaction cost assumptions. As such, using in Sharpe ratio loss functions during training corresponds to optimizing the ex-cost risk-adjusted returns, and can also be interpreted as a regularization term for turnover.

Given that the Sharpe-optimized LSTM is still profitable in the presence of small transactions costs, we seek to quantify the effectiveness of turnover regularization when costs are prohibitively high, considering the extreme case in which *c* = 10 bps in our investigation. Tests were focused on the Sharpe-optimized LSTM with and without the turnover regularizer (LSTM + Reg. for the former), including the additional portfolio level volatility scaling to bring signal volatilities to the same level. Based on the results in Exhibit 9, we can see that the turnover regularization does help improve the LSTM in the presence of large costs, leading to slightly better performance ratios when compared to the reference benchmarks.

## CONCLUSIONS

We introduce DMNs, a hybrid class of deep learning models that retains the volatility scaling framework of time-series momentum strategies while using deep neural networks to output position targeting trading signals. Two approaches to position generation were evaluated here. First, we cast trend estimation as a standard supervised learning problem—using machine learning models to forecast the expected asset returns or probability of a positive return at the next time step—and apply a simple maximum long–short trading rule based on the direction of the next return. Second, trading rules were directly generated as outputs from the model, which we calibrate by maximizing the Sharpe ratio or average strategy return. Testing this on a universe of continuous futures contracts, we demonstrate clear improvements in risk-adjusted performance by calibrating models with the Sharpe ratio, with the LSTM achieving the best results. Furthermore, we note the general underperformance of models that use standard regression and classification methods for trend estimation, hinting at the difficulties faced in selecting an appropriate position sizing function and optimizing models to generate positions without accounting for risk.

Incorporating transaction costs, the Sharpe-optimized LSTM outperforms benchmarks up to 2–3 bps of costs, demonstrating its suitability for trading more liquid assets. To accommodate high costs settings, we introduce a turnover regularizer to use during training, which was shown to be effective even in extreme scenarios (i.e., *c* = 10 bps).

Future work includes extensions of the framework presented here to incorporate ways to deal better with nonstationarity in the data, such as using the recently introduced recurrent neural filters (Lim, Zohren, and Roberts 2019). Another direction of future work focuses on the study of time-series momentum at the microstructure level.

## ADDITIONAL READING

**A Century of Evidence on Trend-Following Investing**

Brian Hurst, Yao Hua Ooi, and Lasse Heje Pedersen

*The Journal of Portfolio Management*

**https://jpm.pm-research.com/content/44/1/15**

**ABSTRACT:** *In this article, the authors study the performance of trend-following investing across global markets since 1880, extending the existing evidence by more than 100 years using a novel data set. They find that in each decade since 1880, time-series momentum has delivered positive average returns with low correlations to traditional asset classes. Further, time-series momentum has performed well in 8 out of 10 of the largest crisis periods over the century, defined as the largest drawdowns for a 60/40 stock/bond portfolio. Lastly, the authors find that time-series momentum has performed well across different macro environments, including recessions and booms, war and peace, high- and low-interest-rate regimes, and high- and low-inflation periods.*

**The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality**

David H. Bailey and Marcos López de Prado

*The Journal of Portfolio Management*

**https://jpm.pm-research.com/content/40/5/94**

**ABSTRACT:** *With the advent in recent years of large financial data sets, machine learning, and high-performance computing, analysts can back test millions (if not billions) of alternative investment strategies. Backtest optimizers search for combinations of parameters that maximize the simulated historical performance of a strategy, leading to back test overfitting. The problem of performance inflation extends beyond back testing. More generally, researchers and investment managers tend to report only positive outcomes, a phenomenon known as selection bias. Not controlling for the number of trials involved in a particular discovery leads to overly optimistic performance expectations. The deflated Sharpe ratio (DSR) corrects for two leading sources of performance inflation: Selection bias under multiple testing and non-normally distributed returns. In doing so, DSR helps separate legitimate empirical findings from statistical flukes.*

**On Default Correlation**

**A Copula Function Approach**

David X. Li

*The Journal of Fixed Income*

**https://jfi.pm-research.com/content/9/4/43**

**ABSTRACT:** *This article studies the problem of default correlation. It introduces a random variable called “time-until-default” to denote the survival time of each defaultable entity or financial instrument, and defines the default correlation between two credit risks as the correlation coefficient between their survival times. The author explains why a copula function approach should be used to specify the joint distribution of survival times after marginal distributions of survival times are derived from market information, such as risky bond prices or asset swap spreads. He shows that the current approach to default correlation through asset correlation is equivalent to using a normal copula function. Numerical examples illustrate the use of copula functions in the valuation of some credit derivatives, such as credit default swaps and first-to-default contracts.*

## ACKNOWLEDGMENTS

We would like to thank Anthony Ledford, James Powrie, and Thomas Flury for their interesting comments, as well the Oxford-Man Institute of Quantitative Finance for financial support.

## APPENDIX

### DATASET DETAILS

From the full 98 ratio-adjusted continuous futures contracts in the Pinnacle Data Corp. CLC Database, we extract 88 that have <10% of data missing; a breakdown by asset class described in the following.

To reduce the impact of outliers, we also winsorize the data by capping/flooring it to be within five times its exponentially weighted moving (EWM) standard deviations from its EWM average, computed using a 252-day half-life.

### HYPERPARAMETER OPTIMIZATION

Hyperparameter optimization was applied using 50 iterations of random search, with the full search grid documented in Exhibit A1, with the models fully recalibrated every five years using all available data up to that point. For LSTM-based models, time series were subdivided into trajectories of 63 time steps (≈ three months), with the LSTM unrolled across the length of the trajectory during backpropagation.

## ENDNOTES

↵

^{1}Pinnacle Data Corp. CLC Database: https//pinnacledata2.com/clc.html.

- © 2019 Pageant Media Ltd