Advances in Financial Machine Learning
Marcos Lopez de Prado
2018
Part I - Data Analysis
Chapter 2 - Financial Data Structures
2.2 Essential Types of Financial Data
- Fundamental Data: be mindful of respecting release and amending dates
- Market Data: difficult to process because of huge volume (and cost)
- Analytics: positive point is that it is somehow a ready-made signal extraction from a raw source, but a negative point is that this extraction might be biased
- Alternative Data: produced by individuals, business processes, or sensors. Primary and early information source, but also hard-to-process. "Data that is hard to store, manipulate, and operate is always the most promising."
2.3 Bars
- Time Bars
- Tick Bars
- Volume Bars
- Dollar Bars
- other: Information-Driven Bars
2.4 Dealing with multi-product series
2.5 Sampling Features
Chapter 3 - Labelling
3.2 The Fixed-Time Horizon method
3.3 Computing dynamic thresholds
3.4 The Triple-Barrier method
3.5 Learning side and size
3.6 Meta-Labeling
3.7 How to use Meta-Labeling
3.8 The Quantamental Way
Chapter 4 - Sample Weights
4.2 Overlapping outcomes
4.3 Number of concurrent labels
4.4 Average uniqueness of a label
4.5 Bagging classifiers and uniqueness
4.6 Return attribution
4.7 Time decay
4.8 Class weights
Chapter 5 - Fractionally Differentiated Features
5.2 The stationary vs. memory dilemma
5.3 Literature review
5.4 The method
5.5 Implementation
5.6 Stationary with maximum memory preservation
Part II - Modelling
Chapter 6 - Ensemble Methods
6.2 The three sources of errors
- Bias (underfit)
- Variance (overfit)
- Noise (irreducible error)
6.3 Bootstrap aggregation
6.4 Random forest
6.5 Boosting
6.6 Bagging vs. Boosting in finance
Boosting reduces both variance and bias, but is more prone to overfitting while bagging prevents overfitting.
Bagging is generally preferable to Boosting in financial applications since overfitting is often a greater concern than underfitting. Furthermore, bagging can be parallelized, while generally boosting requires sequential running.
6.7 Bagging for scalability
Chapter 7 - Cross-Validation in Finance
7.2 The role of cross-validation (CV)
7.3 Why k-fold CV fails in finance
7.4 A solution: purged k-fold CV
Chapter 8 - Feature Importance
Repeating a test over and over on the same data will likely lead to a false discovery.
8.2 The importance of feature importance
"Backtesting is not a research tool. Feature importance is."
8.3 Feature importance with substitution effects
8.4 Feature importance without substitution effects
"ML algorithms with always find a pattern, even if that pattern is a statistical fluke. [...] Your PCA analysis has determined that some feature are more "principal" than others, without any knowledge of the labels (unsupervised learning). [...] When your [feature importance analysis] selects as most important (using label information) the same features that PCA chose as principal (ignoring label information), this constitutes confirmatory evidence that the pattern identified by the ML algorithm is not entirely overfit."
8.5 Parallelized vs. stacked feature importance
Chapter 9 - Hyper-Parameter Tuning with Cross-Validation
9.2 Grid search cross-validation
9.3 Randomized search cross-validation
9.4 Scoring and hyper-parameter tuning
Part III - Backtesting
Chapter 10 - Bet Sizing
10.2 Strategy-independent bet sizing approaches
10.3 Bet sizing from predicted probabilities
10.4 Averaging active bets
10.5 Size discretization
10.6 Dynamic bet sizes and limit prices
Chapter 11 - The Dangers of Backtesting
11.2 Mission impossible: the flawless backtest
2014 - Luo et al. - Seven Sins of Quantitative Investing
- Survivorship bias
- Look-ahead bias
- Storytelling
- Data mining and data snooping
- Transaction costs
- Outliers
- Shorting
11.3 Even if your backtest is flawless, it is probably wrong
"Professionals may produce flawless backtests, and will still fall for multiple testing, selection bias, or backtest overfitting."
11.4 Backtesting is not a research tool
"Never backtest until your model has been fully specified. If the backtest fails, start all over."
11.5 A few general recommendations
How to reduce overfitting
- Models for asset classes or universes rather than for specific securities: investors do not make a certain mistake on a unique security
- Apply bagging to prevent overfitting and reduce variance
- Do not backtest until research is complete
- Record every backtest conducted to be able to estimate backtest overfitting in the end
- Simulate several scenarii rather than a unique history
- If the backtest fails, start from scratch again
"Backtesting while researching is like drinking and driving. Do not research under the influence of a backtest."
11.6 Strategy selection
Chapter 12 - Backtesting through Cross-Validation
12.2 The Walk Forward (WF) Method
Two advantages
- WF has a clear historical interpretation
- history is a filtration: using trailing data guarantees that the testing set is out-of-sample (no leakage as long as proper purging was implemented)
Three disadvantages
- single scenario is tested which can easily be overfit
- not necessarily representative of future performance
- XXXXXXXXXXXXX
Chapter 13 - Backtesting on Synthetic Data
13.2 Trading rules
13.3 The problem
13.4 Our framework
13.5 Numerical Determination of optimal trading rules
13.6 Experimental results
13.7 Conclusion
Chapter 14 - Backtest Statistics
14.2 Types of backtest statistics
14.3 General characteristics
14.4 Performance
14.5 Runs
14.6 Implementation shortfall
14.7 Efficiency
14.8 Classification scores
14.9 Attribution
Chapter 15 - Understanding Strategy Risk
15.2 Symmetric payouts
15.3 Asymmetric payouts
15.4 The probability of strategy failure
Chapter 16 - Machine Learning Asset Allocation
16.2 The problem with convex portfolio optimization
16.3 Markowitz’s curse
16.4 From geometric to hierarchical relationships
16.5 A numerical example
16.6 Out-of-sample Monte Carlo simulations
16.7 Further research
16.8 Conclusion
Part IV - Useful Financial Features
Chapter 17 - Structural Breaks
17.2 Types of structural break tests
17.3 CUSUM tests
17.4 Explosiveness tests
Chapter 18 - Entropy Features
18.2 Shannon’s entropy
18.3 The plug-in (or maximum likelihood) estimator
18.4 Lempel-Ziv estimators
18.5 Encoding schemes
18.6 Entropy of a Gaussian process
18.7 Entropy and the generalized mean
18.8 A few financial applications of entropy
Chapter 19 - Microstructural Features
19.2 Review of the literature
19.3 First generation: price sequences
19.4 Second generation: strategic trade models
19.5 Third generation: sequential trade models
19.6 Additional features from microstructural datasets
19.7 What is microstructural information?
Part V - High-Performance Computing Recipes
Chapter 20 - Multiprocessing and Vectorization
20.2 Vectorization Example
20.3 Single-thread vs. multithreading vs. multiprocessing
20.4 Atoms and molecules
20.5 Multiprocessing engines
20.6 Multiprocessing examples
Chapter 21 - Brute Force and Quantum Computers
21.2 Combinatorial optimization
21.3 The objective function
21.4 The problem
21.5 An integer optimization approach
21.6 A numerical example
Chapter 22 - High-Performance Computational Intelligence and Forecasting Technologies
22.2 Regulatory response to the flash crash of 2010
22.3 Background
22.4 HPC Hardware
22.5 HPC Software
22.6 Use Cases
22.7 Summary and call for participation