ml-finance-python

python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
README.md

(12743B)
      1 # Chapter 06: Linear Models for Regression & Classification
      2 
      3 The family of linear models represents one of the most useful hypothesis classes. Many learning algorithms that are widely applied in algorithmic trading rely on linear predictors because they can be efficiently trained in many cases, they are relatively robust to noisy financial data, and they have strong links to the theory of finance. Linear predictors are also intuitive, easy to interpret, and often fit the data reasonably well or at least provide a good baseline.
      4 
      5 Linear regression has been known for over 200 years when Legendre and Gauss applied it to astronomy and began to analyze its statistical properties. Numerous extensions have since adapted the linear regression model and the baseline ordinary least squares (OLS) method to learn its parameters:
      6 
      7 - **Generalized linear models** (GLM) expand the scope of applications by allowing for response variables that imply an error distribution other than the normal distribution. GLM include the probit or logistic models for categorical response variables that appear in classification problems.
      8 - More **robust estimation methods** enable statistical inference where the data violates baseline assumptions due to, for example, correlation over time or across observations. This is often the case with panel data that contains repeated observations on the same units such as historical returns on a universe of assets.
      9 - **Shrinkage methods** aim to improve the predictive performance of linear models. They use a complexity penalty that biases the coefficients learned by the model with the goal of reducing the model's variance and improving out-of-sample predictive performance.
     10 
     11 In practice, linear models are applied to regression and classification problems with the goals of inference and prediction. Numerous asset pricing models that have been developed by academic and industry researchers leverage linear regression. Applications include the identification of significant factors that drive asset returns, for example, as a basis for risk management, as well as the prediction of returns over various time horizons. Classification problems, on the other hand, include directional price forecasts.
     12 
     13 In this chapter, we will cover the following topics:
     14 - How linear regression works and which assumptions it makes
     15 - How to train and diagnose linear regression models
     16 - How to use linear regression to predict future returns
     17 - How use regularization to improve the predictive performance
     18 - How logistic regression works
     19 - How to convert a regression into a classification problem
     20 - How to design a trading algorithm based on price predictions generated by a ML model
     21 
     22 ## Linear regression for inference and prediction
     23 
     24 This section introduces the baseline cross-section and panel techniques for linear models and important enhancements that produce accurate estimates when key assumptions are violated. It continues to illustrate these methods by estimating factor models that are ubiquitous in the development of algorithmic trading strategies. Lastly, it focuses on regularization methods.
     25 
     26 - [Introductory Econometrics](http://economics.ut.ac.ir/documents/3030266/14100645/Jeffrey_M._Wooldridge_Introductory_Econometrics_A_Modern_Approach__2012.pdf), Wooldridge, 2012
     27 
     28 ## The multiple linear regression model
     29 
     30 This section introduces the model's specification and objective function, methods to learn its parameters, statistical assumptions that allow for inference and diagnostics of these assumptions, as well as extensions to adapt the model to situations where these assumptions fail. Content includes:
     31 
     32 - How to formulate the Model
     33 - How to train the model
     34 - The Gauss-Markov Theorem
     35 - How to conduct statistical inference
     36 - How to diagnose and remedy problems
     37 - How to run linear regression in practice
     38 
     39 The notebook [linear_regression_intro](01_linear_regression_intro.ipynb) demonstrates the simple and multiple linear regression model, the latter using both OLS and gradient descent based on `statsmodels` and `scikit-learn`. 
     40 
     41 ## How to build a factor model using linear regression
     42 
     43 Algorithmic trading strategies use linear factor models to quantify the relationship between the return of an asset and the sources of risk that represent the main drivers of these returns. Each factor risk carries a premium, and the total asset return can be expected to correspond to a weighted average of these risk premia.
     44 
     45 ### From the CAPM to the Fama—French five-factor model
     46 
     47 Risk factors have been a key ingredient to quantitative models since the Capital Asset Pricing Model (CAPM) explained the expected returns of all assets using their respective exposure to a single factor, the expected excess return of the overall market over the risk-free rate.
     48 
     49 This differs from classic fundamental analysis a la Dodd and Graham where returns depend on firm characteristics. The rationale is that, in the aggregate, investors cannot eliminate this so-called systematic risk through diversification. Hence, in equilibrium, they require compensation for holding an asset commensurate with its systematic risk. The model implies that, given efficient markets where prices immediately reflect all public information, there should be no superior risk-adjusted returns.
     50 
     51  
     52 ### Obtaining the risk factors
     53 
     54 The [Fama—French risk factors](http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) are computed as the return difference on diversified portfolios with high or low values according to metrics that reflect a given risk factor. These returns are obtained by sorting stocks according to these metrics and then going long stocks above a certain percentile while shorting stocks below a certain percentile. The metrics associated with the risk factors are defined as follows:
     55 
     56 - Size: Market Equity (ME) 
     57 - Value: Book Value of Equity (BE) divided by ME
     58 - Operating Profitability (OP): Revenue minus cost of goods sold/assets
     59 - Investment: Investment/assets
     60 
     61 Fama and French make updated risk factor and research portfolio data available through their [website]((http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)), and you can use the [pandas_datareader](https://pandas-datareader.readthedocs.io/en/latest/) library to obtain the data. 
     62 
     63 ### Fama—Macbeth regression
     64 
     65 To address the inference problem caused by the correlation of the residuals, Fama and MacBeth proposed a two-step methodology for a cross-sectional regression of returns on factors. The two-stage Fama—Macbeth regression is designed to estimate the premium rewarded for the exposure to a particular risk factor by the market. The two stages consist of:
     66 - **First stage**: N time-series regression, one for each asset or portfolio, of its excess returns on the factors to estimate the factor loadings.
     67 - **Second stage**: T cross-sectional regression, one for each time period, to estimate the risk premium.
     68 
     69 #### Code Examples
     70 
     71 The notebook [fama_macbeth](02_fama_macbeth.ipynb) illustrates how to run a Fama-Macbeth regression, including using the [LinearModels](https://bashtage.github.io/linearmodels/doc/) library.
     72 
     73 ## Linear models for prediction – shrinkage methods
     74 
     75 When a linear regression model contains many correlated variables, their coefficients will be poorly determined because the effect of a large positive coefficient on the RSS can be canceled by a similarly large negative coefficient on a correlated variable. Hence, the model will have a tendency for high variance due to this wiggle room of the coefficients that increases the risk that the model overfits to the sample.
     76 
     77 ### Hedging against overfitting – regularization in linear models
     78 
     79 One popular technique to control overfitting is that of regularization, which involves the addition of a penalty term to the error function to discourage the coefficients from reaching large values. In other words, size constraints on the coefficients can alleviate the resultant potentially negative impact on out-of-sample predictions. We will encounter regularization methods for all models since overfitting is such a pervasive problem.
     80 In this section, we will introduce shrinkage methods that address two motivations to improve on the approaches to linear models discussed so far:
     81 - Prediction accuracy: The low bias but high variance of least squares estimates suggests that the generalization error could be reduced by shrinking or setting some coefficients to zero, thereby trading off a slightly higher bias for a reduction in the variance of the model.
     82 - Interpretation: A large number of predictors may complicate the interpretation or communication of the big picture of the results. It may be preferable to sacrifice some detail to limit the model to a smaller subset of parameters with the strongest effects.
     83 
     84 ### Ridge regression
     85 
     86 The ridge regression shrinks the regression coefficients by adding a penalty to the objective function that equals the sum of the squared coefficients, which in turn corresponds to the L2 norm of the coefficient vector.
     87 
     88 ### Lasso regression
     89 
     90 The lasso, known as basis pursuit in signal processing, also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm.
     91 
     92 #### References
     93 - [An Introduction to Statistical Learning](https://www-bcf.usc.edu/~gareth/ISL/), James, Witten, Hastie and Tibshirani, 2013, chapter 6
     94 - [Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/), Hastie, Tibshirani and Friedman (2009), chapter 3.4
     95 
     96 ## How to predict stock prices using linear regression
     97 
     98 The notebook [linear_regression](03_linear_regression.ipynb) contains examples for the prediction of stock prices using OLS with statsmodels and sklearn, as well as ridge and lasso models. 
     99 
    100 It is designed to run as a notebook on the [Quantopian](https://www.quantopian.com/help) research platform and relies on the [factor_library](../04_alpha_factor_research/02_alpha_factor_library) introduced in Chapter 4, Research and Evaluation of Alpha Factors.
    101 
    102 ## Linear classification
    103 
    104 There are many different classification techniques to predict a qualitative response. In this section, we will introduce the widely used logistic regression which is closely related to linear regression. We will address more complex methods in the following chapters, on generalized additive models that include decision trees and random forests, as well as gradient boosting machines and neural networks.
    105 
    106 ### The logistic regression model
    107 
    108 The logistic regression model arises from the desire to model the probabilities of the output classes given a function that is linear in x, just like the linear regression model, while at the same time ensuring that they sum to one and remain in the [0, 1] as we would expect from probabilities.
    109 
    110 In this section, we introduce the objective and functional form of the logistic regression model and describe the training method. We then illustrate how to use logistic regression for statistical inference with macro data using statsmodels, and how to predict price movements using the regularized logistic regression implemented by sklearn.
    111 
    112 ### How to conduct inference with statsmodels
    113 
    114 The notebook [logistic_regression_macro_data](05_logistic_regression_macro_data.ipynb)` illustrates how to run a logistic regression on macro data and conduct statistical inference using [statsmodels](https://www.statsmodels.org/stable/index.html).
    115 
    116 ### How to use logistic regression for prediction
    117 
    118 The lasso L1 penalty and the ridge L2 penalty can both be used with logistic regression. They have the same shrinkage effect as we have just discussed, and the lasso can again be used for variable selection with any linear regression model.
    119 
    120 Just as with linear regression, it is important to standardize the input variables as the regularized models are scale sensitive. The regularization hyperparameter also requires tuning using cross-validation as in the linear regression case.
    121 
    122 The notebook [logistic_regression](04_logistic_regression.ipynb) demonstrates how to use Logistic Regression for stock price movement prediction on Quantopian. 
    123 
    124 ## How to design a trading algorithm based on price predictions generated by a ML model
    125 
    126 The notebook []()
    127 
    128 ## References
    129 
    130 - [Risk, Return, and Equilibrium: Empirical Tests](https://www.jstor.org/stable/1831028), Eugene F. Fama and James D. MacBeth, Journal of Political Economy, 81 (1973), pp. 607–636
    131 - [Asset Pricing](http://faculty.chicagobooth.edu/john.cochrane/teaching/asset_pricing.htm), John Cochrane, 2001