ml-finance-python
python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
notebook.ipynb
(15957B)
1 {
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {
6 "collapsed": true
7 },
8 "source": [
9 "# Exercises: Linear Regression \n",
10 "By Christopher van Hoecke, Max Margenot, and Delaney Mackenzie\n",
11 "\n",
12 "## Lecture Link : \n",
13 "https://www.quantopian.com/lectures/linear-regression\n",
14 "\n",
15 "### IMPORTANT NOTE: \n",
16 "This lecture corresponds to the Linear Regression lecture, which is part of the Quantopian lecture series. This homework expects you to rely heavily on the code presented in the corresponding lecture. Please copy and paste regularly from that lecture when starting to work on the problems, as trying to do them from scratch will likely be too difficult.\n",
17 "\n",
18 "Part of the Quantopian Lecture Series:\n",
19 "\n",
20 "* [www.quantopian.com/lectures](https://www.quantopian.com/lectures)\n",
21 "* [github.com/quantopian/research_public](https://github.com/quantopian/research_public)\n",
22 "\n",
23 "----"
24 ]
25 },
26 {
27 "cell_type": "markdown",
28 "metadata": {},
29 "source": [
30 "## Key concepts"
31 ]
32 },
33 {
34 "cell_type": "code",
35 "execution_count": null,
36 "metadata": {
37 "collapsed": true
38 },
39 "outputs": [],
40 "source": [
41 "# Useful Functions\n",
42 "def linreg(X,Y):\n",
43 " # Running the linear regression\n",
44 " X = sm.add_constant(X)\n",
45 " model = regression.linear_model.OLS(Y, X).fit()\n",
46 " a = model.params[0]\n",
47 " b = model.params[1]\n",
48 " X = X[:, 1]\n",
49 "\n",
50 " # Return summary of the regression and plot results\n",
51 " X2 = np.linspace(X.min(), X.max(), 100)\n",
52 " Y_hat = X2 * b + a\n",
53 " plt.scatter(X, Y, alpha=0.3) # Plot the raw data\n",
54 " plt.plot(X2, Y_hat, 'r', alpha=0.9); # Add the regression line, colored in red\n",
55 " plt.xlabel('X Value')\n",
56 " plt.ylabel('Y Value')\n",
57 " return model.summary()"
58 ]
59 },
60 {
61 "cell_type": "code",
62 "execution_count": null,
63 "metadata": {},
64 "outputs": [],
65 "source": [
66 "# Useful Libraries\n",
67 "import math\n",
68 "import numpy as np\n",
69 "import matplotlib.pyplot as plt\n",
70 "\n",
71 "from statsmodels import regression\n",
72 "from statsmodels.stats import diagnostic\n",
73 "import statsmodels.regression as smr\n",
74 "import statsmodels.api as sm\n",
75 "from statsmodels.stats.diagnostic import het_breushpagan\n",
76 "\n",
77 "import scipy as sp\n",
78 "import scipy.stats\n",
79 "import seaborn\n"
80 ]
81 },
82 {
83 "cell_type": "markdown",
84 "metadata": {},
85 "source": [
86 "----"
87 ]
88 },
89 {
90 "cell_type": "markdown",
91 "metadata": {
92 "collapsed": true
93 },
94 "source": [
95 "# Exercise 1: Temperatures\n",
96 "Given this set of Fahrenheit and Celsius values, find a model that expresses the relationship between the two temperature scales."
97 ]
98 },
99 {
100 "cell_type": "code",
101 "execution_count": null,
102 "metadata": {},
103 "outputs": [],
104 "source": [
105 "fahrenheit = [-868, -778, -688, -598, -508, -418, -328, -238, -144, -58, 32, 122, 212, 302, 392, 482, \n",
106 " 572, 662, 752, 842, 932]\n",
107 "celsius = [-500, -450, -400, -350, -300, -250, -200, -150, -100, -50, 0, 50, 100, 150, 200, 250, \n",
108 " 300, 350, 400, 450, 500]\n",
109 "\n",
110 "## Your code goes here"
111 ]
112 },
113 {
114 "cell_type": "markdown",
115 "metadata": {},
116 "source": [
117 "----"
118 ]
119 },
120 {
121 "cell_type": "markdown",
122 "metadata": {},
123 "source": [
124 "# Exercise 2 : Confidence Intervals\n",
125 "## a. Visualizing Confidence Intervals \n",
126 "Using the lecture series and the seaborn library, plot the regression line between the parameters and the $95\\%$ confidence interval."
127 ]
128 },
129 {
130 "cell_type": "code",
131 "execution_count": null,
132 "metadata": {
133 "scrolled": false
134 },
135 "outputs": [],
136 "source": [
137 "start = '2014-01-01'\n",
138 "end = '2015-01-01'\n",
139 "asset = get_pricing('KO', fields='price', start_date=start, end_date=end)\n",
140 "benchmark = get_pricing('PEP', fields='price', start_date=start, end_date=end)\n",
141 "\n",
142 "returns1 = asset.pct_change()[1:]\n",
143 "returns2 = benchmark.pct_change()[1:]\n",
144 "\n",
145 "## Your code goes here"
146 ]
147 },
148 {
149 "cell_type": "markdown",
150 "metadata": {},
151 "source": [
152 "## b. Calculating Confidence Levels of Parameters. \n",
153 "Let's directly calculate the $95\\%$ confidence intervals of our parameters. The formula for a given parameter is:\n",
154 "\n",
155 "$$ CI = \\left(\\beta - z \\cdot \\frac{s}{\\sqrt{n}}, \\beta + z \\cdot \\frac{s}{\\sqrt{n}}\\right) $$\n",
156 "\n",
157 "Where, $\\beta$ is the coefficient, $z$ is the critical value*(t-statistic required to obtain a probability less than the alpha significance level)*, and $SE_{i,i}$ is the Standard Error Matrix. "
158 ]
159 },
160 {
161 "cell_type": "code",
162 "execution_count": null,
163 "metadata": {},
164 "outputs": [],
165 "source": [
166 "start = '2014-01-01'\n",
167 "end = '2015-01-01'\n",
168 "asset = get_pricing('KO', fields='price', start_date=start, end_date=end)\n",
169 "benchmark = get_pricing('PEP', fields='price', start_date=start, end_date=end)\n",
170 "\n",
171 "X = asset.pct_change()[1:]\n",
172 "Y = benchmark.pct_change()[1:]\n",
173 "\n",
174 "result = sm.OLS(Y,X).fit()\n",
175 "\n",
176 "# Convert X to Matrix (adding columns of one)\n",
177 "X = np.vstack((X, np.ones( X.size ) ))\n",
178 "X = np.matrix( X )\n",
179 "\n",
180 "# Matrix Multiplication and inverse calculation\n",
181 "C = np.linalg.inv( X * X.T )\n",
182 "C *= result.mse_resid\n",
183 "SE = np.sqrt(C) # Calucaltion of Standart Error. \n",
184 "\n",
185 "# Critical Values of the t-statistic\n",
186 "N = result.nobs\n",
187 "P = result.df_model\n",
188 "dof = N - P - 1\n",
189 "z = scipy.stats.t(dof).ppf(0.975)\n",
190 "\n",
191 "i = 0\n",
192 "## Your code goes here\n",
193 "\n",
194 "# Fetch values of Beta and parameters of SE from the matrix\n",
195 "beta = ## Your code goes here\n",
196 "c = ## Your code goes here\n",
197 "\n",
198 "print ## Your code goes here"
199 ]
200 },
201 {
202 "cell_type": "markdown",
203 "metadata": {},
204 "source": [
205 "----"
206 ]
207 },
208 {
209 "cell_type": "markdown",
210 "metadata": {},
211 "source": [
212 "# Exercise 3 : $R^2$ Value\n",
213 "\n",
214 "$R^2$ is the measure of how closely your data points are to the regression line, and is defined as $$ R^2 = 1 - \\frac{\\Sigma((y_{predicted} - (y_{actual}))^2)}{\\Sigma( y_{predicited} - \\frac{\\Sigma y_{actual}}{len(y_{actual}})^2} $$ \n",
215 "Given the information from exercise 1, calculate the value of $R^2$ manually.\n",
216 "You can start by expressing f as a function of c from the data obtained from Exercise 1 (these are the predicted values of y). "
217 ]
218 },
219 {
220 "cell_type": "code",
221 "execution_count": null,
222 "metadata": {},
223 "outputs": [],
224 "source": [
225 "# Creat an empty numpy array (float values).\n",
226 "# Find the predicted value of f for every c in celsius (given by f = 32 + 1.8c)\n",
227 "fpred = np.array([])\n",
228 "f = [#________# \n",
229 " for a in celsius] ## Your code goes here (fill in the values of Beta, and X1)\n",
230 "ypredicted = np.append(f, fpred)"
231 ]
232 },
233 {
234 "cell_type": "markdown",
235 "metadata": {},
236 "source": [
237 "Using the values of $y_{predicted}$ and $y_{actual}$, calculate the squared element by element difference of the two lists, and sum them."
238 ]
239 },
240 {
241 "cell_type": "code",
242 "execution_count": null,
243 "metadata": {
244 "collapsed": true
245 },
246 "outputs": [],
247 "source": [
248 "# Calucate the difference between the predicted values of y and the actual values of y, \n",
249 "# Find the square of the difference\n",
250 "# Sum the Squares\n",
251 "\n",
252 "ypred_yact = [#______#\n",
253 " for a, b in zip(ypredicted, fahrenheit)] ## your code goes here (a - b)\n",
254 "diff1squared = [#_______# \n",
255 " for a in ypred_yact] ## Your code goes here\n",
256 "sumsquares1 = sum(diff1squared) ## Your code goes here "
257 ]
258 },
259 {
260 "cell_type": "markdown",
261 "metadata": {},
262 "source": [
263 "Using the values of $y_{predicted}$ and mean, calculate the mean of the predicted values, along with the difference between $y_{predicted} - mean$. Square the values in the list obtained from the difference and sum them. "
264 ]
265 },
266 {
267 "cell_type": "code",
268 "execution_count": null,
269 "metadata": {
270 "collapsed": true
271 },
272 "outputs": [],
273 "source": [
274 "# Calucate the difference between the predicted values of y and mean of y. \n",
275 "# Find the square of the difference\n",
276 "# Sum the Squares\n",
277 "\n",
278 "mean = ## Your code goes here\n",
279 "ypred_mean = ## Your code goes here\n",
280 "ypred_meansquared = ## Your code goes here \n",
281 "sumsquares2 = ## Your code goes here\n"
282 ]
283 },
284 {
285 "cell_type": "markdown",
286 "metadata": {},
287 "source": [
288 "We can now calculate the R-Squared by subtracting one to the ratio of the two sums. "
289 ]
290 },
291 {
292 "cell_type": "code",
293 "execution_count": null,
294 "metadata": {},
295 "outputs": [],
296 "source": [
297 "r = ## Your code goes here\n",
298 "print 'R-squared = ', r"
299 ]
300 },
301 {
302 "cell_type": "markdown",
303 "metadata": {},
304 "source": [
305 "----"
306 ]
307 },
308 {
309 "cell_type": "markdown",
310 "metadata": {},
311 "source": [
312 "# Exercise 4 : Residuals\n",
313 "**Defintion : In statistics, the residuals are differences between the predicted values and the actual values**: \n",
314 "\n",
315 "$$e = y - ลท$$\n",
316 "\n",
317 "## a. Residual Analysis I\n",
318 "- Model the data given bellow as a linear regression. \n",
319 "- Calculate and plot the residual of the data sets *(remember to use the coefficient and the value of x1 to find the predicted values of y)*\n",
320 "- Print the sum of the residuals. \n",
321 "- Discuss the choice of regression model. "
322 ]
323 },
324 {
325 "cell_type": "code",
326 "execution_count": null,
327 "metadata": {
328 "collapsed": true
329 },
330 "outputs": [],
331 "source": [
332 "asset1 = get_pricing('SPY', \n",
333 " fields='price', \n",
334 " start_date='2005-01-01', \n",
335 " end_date='2010-01-01')\n",
336 "asset2 = get_pricing('GS', \n",
337 " fields='price', \n",
338 " start_date='2005-01-01', \n",
339 " end_date='2010-01-01')\n",
340 "\n",
341 "returns1 = asset1.pct_change()[1:]\n",
342 "returns2 = asset2.pct_change()[1:]\n",
343 "\n",
344 "## Your code goes here"
345 ]
346 },
347 {
348 "cell_type": "markdown",
349 "metadata": {},
350 "source": [
351 "Run the Breush-Pagan test to check for heteroskedasticity in the residuals. Note that the residuals of the model should have constant variance, presence of heteroskedasticity would indicate our choice of model is not optimal. "
352 ]
353 },
354 {
355 "cell_type": "code",
356 "execution_count": null,
357 "metadata": {},
358 "outputs": [],
359 "source": [
360 "lm, p_lm, fv, p_fv = ## Your code goes here\n",
361 "print 'p-value for f-statistic of the breush-pagan test:', ## Your code goes here\n",
362 "print '====' \n",
363 "print \"Since the p-value obtained is ______ than alpha (0.05), \\\n",
364 "we ______ reject the null hypothesis of the breush-pagan test\""
365 ]
366 },
367 {
368 "cell_type": "code",
369 "execution_count": null,
370 "metadata": {},
371 "outputs": [],
372 "source": [
373 "# Predicted values of asset1\n",
374 "y = ## Your code goes here\n",
375 "\n",
376 "plt.scatter(y, results.resid)\n",
377 "plt.title('Scatter plot of Residuals to predicted model')\n",
378 "plt.xlabel('Predicted Model')\n",
379 "plt.ylabel('Residuals');"
380 ]
381 },
382 {
383 "cell_type": "markdown",
384 "metadata": {},
385 "source": [
386 "## b. Residual Analysis II\n",
387 "- Run the linear regression function for x and y\n",
388 "- Find and plot the residual of the two data points. \n",
389 "- Discuss the choice in model. "
390 ]
391 },
392 {
393 "cell_type": "code",
394 "execution_count": null,
395 "metadata": {},
396 "outputs": [],
397 "source": [
398 "p1 = get_pricing('SPY', start_date = '2005-01-01', \n",
399 " end_date = '2010-01-01', \n",
400 " fields = 'price').pct_change()[1:]\n",
401 "p2 = get_pricing('XLF', start_date = '2005-01-01', \n",
402 " end_date = '2010-01-01', \n",
403 " fields = 'price').pct_change()[1:]\n",
404 "\n",
405 "## Your code goes here\n",
406 "results2 = \n",
407 "y = "
408 ]
409 },
410 {
411 "cell_type": "code",
412 "execution_count": null,
413 "metadata": {},
414 "outputs": [],
415 "source": [
416 "plt.scatter(y, results2.resid)\n",
417 "plt.title('Scatter plot of Residuals to predicted model')\n",
418 "plt.xlabel('Predicted Model')\n",
419 "plt.ylabel('Residuals')\n",
420 "\n",
421 "lm, p_lm, fv, p_fv = ## Your code goes here\n",
422 "print 'p-value for f-statistic of the breush-pagan test:',## Your code goes here\n",
423 "print '====' \n",
424 "print \"Since the p-value obtained is ____ than alpha (0.05), \\\n",
425 "we ______ the null hypothesis of the breush-pagan test\""
426 ]
427 },
428 {
429 "cell_type": "markdown",
430 "metadata": {},
431 "source": [
432 "While checking for residual is a good way of checking the accuracy of our model choice, we must also check fot heteroscedasticity (checking if there are sub-populations that have different variabilities from others). An assumption of the linear regression model is that there is no heteroscedasticity, OLS estimators are no longer the Best Linear Unbiased Estimators if this assumption is broken. \n",
433 "Read more about heteroscedasticity here https://en.wikipedia.org/wiki/Heteroscedasticity#Consequences"
434 ]
435 },
436 {
437 "cell_type": "markdown",
438 "metadata": {},
439 "source": [
440 "---"
441 ]
442 },
443 {
444 "cell_type": "markdown",
445 "metadata": {},
446 "source": [
447 "Congratulations on completing the Linear Regression exercise!\n",
448 "\n",
449 "As you learn more about writing trading algorithms and the Quantopian platform, be sure to check out the [Quantopian Daily Contest](https://www.quantopian.com/contest), in which you can compete for a cash prize every day.\n",
450 "\n",
451 "Start by going through the [Writing a Contest Algorithm](https://www.quantopian.com/tutorials/contest) Tutorial."
452 ]
453 },
454 {
455 "cell_type": "markdown",
456 "metadata": {},
457 "source": [
458 "*This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. (\"Quantopian\"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Quantopian, Inc. has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Quantopian, Inc. at the time of publication. Quantopian makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.*"
459 ]
460 }
461 ],
462 "metadata": {
463 "kernelspec": {
464 "display_name": "Python 3.5",
465 "language": "python",
466 "name": "py35"
467 },
468 "language_info": {
469 "codemirror_mode": {
470 "name": "ipython",
471 "version": 3
472 },
473 "file_extension": ".py",
474 "mimetype": "text/x-python",
475 "name": "python",
476 "nbconvert_exporter": "python",
477 "pygments_lexer": "ipython3",
478 "version": "3.6.5"
479 }
480 },
481 "nbformat": 4,
482 "nbformat_minor": 2
483 }