Advances in Financial Machine Learning

Marcos Lopez de Prado

2018

Part I - Data Analysis


Chapter 2 - Financial Data Structures

2.2 Essential Types of Financial Data

  • Fundamental Data: be mindful of respecting release and amending dates
  • Market Data: difficult to process because of huge volume (and cost)
  • Analytics: positive point is that it is somehow a ready-made signal extraction from a raw source, but a negative point is that this extraction might be biased
  • Alternative Data: produced by individuals, business processes, or sensors. Primary and early information source, but also hard-to-process. "Data that is hard to store, manipulate, and operate is always the most promising."

2.3 Bars

  • Time Bars
  • Tick Bars
  • Volume Bars
  • Dollar Bars
  • other: Information-Driven Bars

2.4 Dealing with multi-product series

2.5 Sampling Features


Chapter 3 - Labelling

3.2 The Fixed-Time Horizon method

3.3 Computing dynamic thresholds

3.4 The Triple-Barrier method

3.5 Learning side and size

3.6 Meta-Labeling

3.7 How to use Meta-Labeling

3.8 The Quantamental Way


Chapter 4 - Sample Weights

4.2 Overlapping outcomes

4.3 Number of concurrent labels

4.4 Average uniqueness of a label

4.5 Bagging classifiers and uniqueness

4.6 Return attribution

4.7 Time decay

4.8 Class weights


Chapter 5 - Fractionally Differentiated Features

5.2 The stationary vs. memory dilemma

5.3 Literature review

5.4 The method

5.5 Implementation

5.6 Stationary with maximum memory preservation



Part II - Modelling


Chapter 6 - Ensemble Methods

6.2 The three sources of errors

  • Bias (underfit)
  • Variance (overfit)
  • Noise (irreducible error)

6.3 Bootstrap aggregation

6.4 Random forest

6.5 Boosting

6.6 Bagging vs. Boosting in finance

Boosting reduces both variance and bias, but is more prone to overfitting while bagging prevents overfitting.

Bagging is generally preferable to Boosting in financial applications since overfitting is often a greater concern than underfitting. Furthermore, bagging can be parallelized, while generally boosting requires sequential running.

6.7 Bagging for scalability


Chapter 7 - Cross-Validation in Finance

7.2 The role of cross-validation (CV)

7.3 Why k-fold CV fails in finance

7.4 A solution: purged k-fold CV


Chapter 8 - Feature Importance

Repeating a test over and over on the same data will likely lead to a false discovery.

8.2 The importance of feature importance

"Backtesting is not a research tool. Feature importance is."

8.3 Feature importance with substitution effects

8.4 Feature importance without substitution effects

"ML algorithms with always find a pattern, even if that pattern is a statistical fluke. [...] Your PCA analysis has determined that some feature are more "principal" than others, without any knowledge of the labels (unsupervised learning). [...] When your [feature importance analysis] selects as most important (using label information) the same features that PCA chose as principal (ignoring label information), this constitutes confirmatory evidence that the pattern identified by the ML algorithm is not entirely overfit."

8.5 Parallelized vs. stacked feature importance


Chapter 9 - Hyper-Parameter Tuning with Cross-Validation

9.2 Grid search cross-validation

9.3 Randomized search cross-validation

9.4 Scoring and hyper-parameter tuning



Part III - Backtesting


Chapter 10 - Bet Sizing

10.2 Strategy-independent bet sizing approaches

10.3 Bet sizing from predicted probabilities

10.4 Averaging active bets

10.5 Size discretization

10.6 Dynamic bet sizes and limit prices


Chapter 11 - The Dangers of Backtesting

11.2 Mission impossible: the flawless backtest

2014 - Luo et al. - Seven Sins of Quantitative Investing

  • Survivorship bias
  • Look-ahead bias
  • Storytelling
  • Data mining and data snooping
  • Transaction costs
  • Outliers
  • Shorting

11.3 Even if your backtest is flawless, it is probably wrong

"Professionals may produce flawless backtests, and will still fall for multiple testing, selection bias, or backtest overfitting."

11.4 Backtesting is not a research tool

"Never backtest until your model has been fully specified. If the backtest fails, start all over."

11.5 A few general recommendations

How to reduce overfitting

  • Models for asset classes or universes rather than for specific securities: investors do not make a certain mistake on a unique security
  • Apply bagging to prevent overfitting and reduce variance
  • Do not backtest until research is complete
  • Record every backtest conducted to be able to estimate backtest overfitting in the end
  • Simulate several scenarii rather than a unique history
  • If the backtest fails, start from scratch again

"Backtesting while researching is like drinking and driving. Do not research under the influence of a backtest."

11.6 Strategy selection


Chapter 12 - Backtesting through Cross-Validation

12.2 The Walk Forward (WF) Method

Two advantages

  • WF has a clear historical interpretation
  • history is a filtration: using trailing data guarantees that the testing set is out-of-sample (no leakage as long as proper purging was implemented)

Three disadvantages

  • single scenario is tested which can easily be overfit
  • not necessarily representative of future performance
  • XXXXXXXXXXXXX


Chapter 13 - Backtesting on Synthetic Data

13.2 Trading rules

13.3 The problem

13.4 Our framework

13.5 Numerical Determination of optimal trading rules

13.6 Experimental results

13.7 Conclusion


Chapter 14 - Backtest Statistics

14.2 Types of backtest statistics

14.3 General characteristics

14.4 Performance

14.5 Runs

14.6 Implementation shortfall

14.7 Efficiency

14.8 Classification scores

14.9 Attribution


Chapter 15 - Understanding Strategy Risk

15.2 Symmetric payouts

15.3 Asymmetric payouts

15.4 The probability of strategy failure


Chapter 16 - Machine Learning Asset Allocation

16.2 The problem with convex portfolio optimization

16.3 Markowitz’s curse

16.4 From geometric to hierarchical relationships

16.5 A numerical example

16.6 Out-of-sample Monte Carlo simulations

16.7 Further research

16.8 Conclusion



Part IV - Useful Financial Features


Chapter 17 - Structural Breaks

17.2 Types of structural break tests

17.3 CUSUM tests

17.4 Explosiveness tests


Chapter 18 - Entropy Features

18.2 Shannon’s entropy

18.3 The plug-in (or maximum likelihood) estimator

18.4 Lempel-Ziv estimators

18.5 Encoding schemes

18.6 Entropy of a Gaussian process

18.7 Entropy and the generalized mean

18.8 A few financial applications of entropy


Chapter 19 - Microstructural Features

19.2 Review of the literature

19.3 First generation: price sequences

19.4 Second generation: strategic trade models

19.5 Third generation: sequential trade models

19.6 Additional features from microstructural datasets

19.7 What is microstructural information?



Part V - High-Performance Computing Recipes


Chapter 20 - Multiprocessing and Vectorization

20.2 Vectorization Example

20.3 Single-thread vs. multithreading vs. multiprocessing

20.4 Atoms and molecules

20.5 Multiprocessing engines

20.6 Multiprocessing examples


Chapter 21 - Brute Force and Quantum Computers

21.2 Combinatorial optimization

21.3 The objective function

21.4 The problem

21.5 An integer optimization approach

21.6 A numerical example


Chapter 22 - High-Performance Computational Intelligence and Forecasting Technologies

22.2 Regulatory response to the flash crash of 2010

22.3 Background

22.4 HPC Hardware

22.5 HPC Software

22.6 Use Cases

22.7 Summary and call for participation