ml-finance-python
python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
README.md
(9296B)
1 # Chapter 06: Machine Learning
2
3 In this introductory chapter, we will start to illustrate how you can use a broad range of supervised and unsupervised machine learning (ML) models for algorithmic trading.
4
5 We cover aspects that apply across model categories so that we can focus on model-specific usage in the following chapters. These aspects include the goal of learning a functional relationship from data by optimizing an objective or loss function. They also include the closely related methods of measuring model performance.
6
7 We distinguish between unsupervised and supervised learning, and supervised regression and classification problems. We also outline use cases for algorithmic trading.
8
9 ## Learning from Data
10
11 ## The Machine Learning Workflow
12
13 Developing an ML solution requires a systematic approach to maximize the chances of success while proceeding efficiently. It is also important to make the process transparent and replicable to facilitate collaboration, maintenance, and subsequent refinements.
14
15 The process is iterative throughout, and the effort at different stages will vary according to the project. Nonethelesee, this process should generally include the following steps:
16
17 1. Frame the problem, identify a target metric, and define success
18 2. Source, clean, and validate the data
19 3. Understand your data and generate informative features
20 4. Pick one or more machine learning algorithms suitable for your data
21 5. Train, test, and tune your models
22 6. Use your model to solve the original problem
23
24
25 ### Basic Walkthrough: K-nearest neighbors
26
27 The notebook [machine_learning_workflow](01_machine_learning_workflow.ipynb) contains several examples that illustrate the machine learning workflow using a simple dataset of house prices.
28
29 - sklearn [Documentation](http://scikit-learn.org/stable/documentation.html)
30
31 ### Frame the problem: goals & metrics
32
33 The starting point for any machine learning exercise is the ultimate use case it aims to address. Sometimes, this goal will be statistical inference in order to identify an association between variables or even a causal relationship. Most frequently, however, the goal will be the direct prediction of an outcome to yield a trading signal.
34
35 ### Collect & prepare the data
36
37 We addressed the sourcing of market and fundamental data in [Chapter 2](../02_market_and_fundamental_data), and for alternative data in [Chapter 3](../03_alternative_data). We will continue to work with various examples of these sources as we illustrate the application of the various models in later chapters.
38
39 ### How to explore, extract and engineer features
40
41 Understanding the distribution of individual variables and the relationships among outcomes and features is the basis for picking a suitable algorithm. This typically starts with visualizations such as scatter plots, as illustrated in the companion notebook (and shown in the following image), but also includes numerical evaluations ranging from linear metrics, such as the correlation, to nonlinear statistics, such as the Spearman rank correlation coefficient that we encountered when we introduced the information coefficient. It also includes information-theoretic measures, such as mutual information
42
43 #### Code Example: Mutual Information
44
45 The notebook [mutual_information](02_mutual_information.ipynb) applies information theory to the financial data we created in the notebook [feature_engineering](../04_alpha_factor_research/00_data/feature_engineering.ipynb), in the chapter [Alpha Factors – Research and Evaluation]((../04_alpha_factor_research).
46
47 ### Select an ML algorithm
48
49 The remainder of this book will introduce several model families, ranging from linear models, which make fairly strong assumptions about the nature of the functional relationship between input and output variables, to deep neural networks, which make very few assumptions.
50
51 ### Design and tune the model
52
53 The ML process includes steps to diagnose and manage model complexity based on estimates of the model's generalization error. An unbiased estimate requires a statistically sound and efficient procedure, as well as error metrics that align with the output variable type, which also determines whether we are dealing with a regression, classification, or ranking problem.
54
55 #### Bias-Variance Trade-Off
56
57 The errors that an ML model makes when predicting outcomes for new input data can be broken down into reducible and irreducible parts. The irreducible part is due to random variation (noise) in the data that is not measured, such as relevant but missing variables or natural variation.
58
59 The notebook [bias_variance](03_bias_variance.ipynb) demonstrates overfitting by approximating a cosine function using increasingly complex polynomials and measuring the in-sample error. It draws 10 random samples with some added noise (n = 30) to learn a polynomial of varying complexity. Each time, the model predicts new data points and we capture the mean-squared error for these predictions, as well as the standard deviation of these errors.
60
61 It goes on to illustrate the impact of overfitting versus underfitting by trying to learn a Taylor series approximation of the cosine function of ninth degree with some added noise. In the following diagram, we draw random samples of the true function and fit polynomials that underfit, overfit, and provide an approximately correct degree of flexibility.
62
63 ### How to use cross-validation for model selection
64
65 When several candidate models (that is, algorithms) are available for your use case, the act of choosing one of them is called the model selection problem. Model selection aims to identify the model that will produce the lowest prediction error given new data.
66
67 #### How to implement cross-validation in Python
68
69 The script [cross_validation](04_cross_validation.py) illustrates various options for splitting data into training and test sets by showing how the indices of a mock dataset with ten observations are assigned to the train and test set.
70
71 ### Parameter tuning with scikit-learn
72
73 Model selection typically involves repeated cross-validation of the out-of-sample performance of models using different algorithms (such as linear regression and random forest) or different configurations. Different configurations may involve changes to hyperparameters or the inclusion or exclusion of different variables.
74
75 #### Learning and Validation curves with yellowbricks
76
77 The notebook [machine_learning_workflow](01_machine_learning_workflow.ipynb)) demonstrates the use of learning and validation illustrates the use of various model selection techniques.
78
79 - Yellowbrick: Machine Learning Visualization [docs](http://www.scikit-yb.org/en/latest/)
80
81 #### Parameter tuning using GridSearchCV and pipeline
82
83 Since hyperparameter tuning is a key ingredient of the machine learning workflow, there are tools to automate this process. The sklearn library includes a GridSearchCV interface that cross-validates all combinations of parameters in parallel, captures the result, and automatically trains the model using the parameter setting that performed best during cross-validation on the full dataset.
84
85 In practice, the training and validation sets often require some processing prior to cross-validation. Scikit-learn offers the Pipeline to also automate any requisite feature-processing steps in the automated hyperparameter tuning facilitated by GridSearchCV.
86
87 The implementation examples in the included machine_learning_workflow.ipynb notebook to see these tools in action.
88
89 The notebook [machine_learning_workflow](01_machine_learning_workflow.ipynb)) also the use of these tools.
90
91 ### Challenges with cross-validation in finance
92
93 A key assumption for the cross-validation methods discussed so far is the independent and identical (iid) distribution of the samples available for training.
94 For financial data, this is often not the case. On the contrary, financial data is neither independently nor identically distributed because of serial correlation and time-varying standard deviation, also known as heteroskedasticity
95
96 #### Purging, embargoing, and combinatorial CV
97
98 For financial data, labels are often derived from overlapping data points as returns are computed from prices in multiple periods. In the context of trading strategies, the results of a model's prediction, which may imply taking a position in an asset, may only be known later, when this decision is evaluated—for example, when a position is closed out.
99
100 The resulting risks include the leaking of information from the test into the training set, likely leading to an artificially inflated performance that needs to be addressed by ensuring that all data is point-in-time—that is, truly available and known at the time it is used as the input for a model. Several methods have been proposed by Marcos Lopez de Prado in [Advances in Financial Machine Learning](https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089) to address these challenges of financial data for cross-validation:
101
102 - Purging: Eliminate training data points where the evaluation occurs after the prediction of a point-in-time data point in the validation set to avoid look-ahead bias.
103 - Embargoing: Further eliminate training samples that follow a test period.