ml-finance-python
python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
create_datasets.ipynb
(17910B)
1 {
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# Download and manipulate data"
8 ]
9 },
10 {
11 "cell_type": "markdown",
12 "metadata": {},
13 "source": [
14 "This notebook contains information on downloading the Quandl Wiki stock prices and a few other sources that we use throughout the book. "
15 ]
16 },
17 {
18 "cell_type": "markdown",
19 "metadata": {},
20 "source": [
21 "## Imports & Settings"
22 ]
23 },
24 {
25 "cell_type": "code",
26 "execution_count": 3,
27 "metadata": {},
28 "outputs": [],
29 "source": [
30 "from pathlib import Path\n",
31 "import numpy as np\n",
32 "import pandas as pd\n",
33 "import pandas_datareader.data as web\n",
34 "\n",
35 "pd.set_option('display.expand_frame_repr', False)"
36 ]
37 },
38 {
39 "cell_type": "markdown",
40 "metadata": {},
41 "source": [
42 "## Set Data Store path"
43 ]
44 },
45 {
46 "cell_type": "markdown",
47 "metadata": {},
48 "source": [
49 "Modify path if you would like to store the data elsewhere and change the notebooks accordingly"
50 ]
51 },
52 {
53 "cell_type": "code",
54 "execution_count": 4,
55 "metadata": {},
56 "outputs": [],
57 "source": [
58 "DATA_STORE = Path('assets.h5')"
59 ]
60 },
61 {
62 "cell_type": "markdown",
63 "metadata": {},
64 "source": [
65 "## Quandl Wiki Prices"
66 ]
67 },
68 {
69 "cell_type": "markdown",
70 "metadata": {},
71 "source": [
72 "[Quandl](https://www.quandl.com/) makes available a [dataset](https://www.quandl.com/databases/WIKIP/documentation) with stock prices, dividends and splits for 3000 US publicly-traded companies. Quandl decided to discontinue support in favor of its commercial offerings but the historical data are still useful to demonstrate the application of the machine learning solutions in the book, just ensure you implement your own algorithms on current data.\n",
73 "\n",
74 "> As of April 11, 2018 this data feed is no longer actively supported by the Quandl community. We will continue to host this data feed on Quandl, but we do not recommend using it for investment or analysis."
75 ]
76 },
77 {
78 "cell_type": "markdown",
79 "metadata": {},
80 "source": [
81 "1. Follow the instructions to create a free [Quandl]([Quandl](https://www.quandl.com/)) account\n",
82 "2. [Download](https://www.quandl.com/databases/WIKIP/usage/export) the entire WIKI/PRICES data\n",
83 "3. Extract the .zip file,\n",
84 "4. Move to this directory and rename to wiki_prices.csv\n",
85 "5. Run the below code to store in fast HDF format (see [Chapter 02 on Market & Fundamental Data](../02_market_and_fundamental_data) for details)."
86 ]
87 },
88 {
89 "cell_type": "code",
90 "execution_count": null,
91 "metadata": {},
92 "outputs": [],
93 "source": [
94 "df = (pd.read_csv('wiki_prices.csv',\n",
95 " parse_dates=['date'],\n",
96 " index_col=['date', 'ticker'],\n",
97 " infer_datetime_format=True)\n",
98 " .sort_index())\n",
99 "\n",
100 "print(df.info(null_counts=True))\n",
101 "with pd.HDFStore(DATA_STORE) as store:\n",
102 " store.put('quandl/wiki/prices', df)"
103 ]
104 },
105 {
106 "cell_type": "markdown",
107 "metadata": {},
108 "source": [
109 "### Wiki Prices Metadata"
110 ]
111 },
112 {
113 "cell_type": "markdown",
114 "metadata": {},
115 "source": [
116 "1. Follow the instructions to create a free [Quandl]([Quandl](https://www.quandl.com/)) account if you haven't done so yet\n",
117 "2. Find link to download wiki metadata under Companies](https://www.quandl.com/databases/WIKIP/documentation) or use the download link with your API_KEY: https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=<API_KEY>\n",
118 "3. Extract the .zip file,\n",
119 "4. Move to this directory and rename to wiki_stocks.csv\n",
120 "5. Run the following code to store in fast HDF format"
121 ]
122 },
123 {
124 "cell_type": "code",
125 "execution_count": null,
126 "metadata": {},
127 "outputs": [],
128 "source": [
129 "df = pd.read_csv('wiki_stocks.csv')\n",
130 "df = pd.concat([df.loc[:, 'code'].str.strip(),\n",
131 " df.loc[:, 'name'].str.split('(', expand=True)[0].str.strip().to_frame('name')], axis=1)\n",
132 "\n",
133 "print(df.info(null_counts=True))\n",
134 "with pd.HDFStore(DATA_STORE) as store:\n",
135 " store.put('quandl/wiki/stocks', df)"
136 ]
137 },
138 {
139 "cell_type": "markdown",
140 "metadata": {},
141 "source": [
142 "## S&P 500 Prices"
143 ]
144 },
145 {
146 "cell_type": "markdown",
147 "metadata": {},
148 "source": [
149 "The following code downloads historical S&P 500 prices from FRED (only last 10 years of daily data is freely available)"
150 ]
151 },
152 {
153 "cell_type": "code",
154 "execution_count": null,
155 "metadata": {},
156 "outputs": [],
157 "source": [
158 "df = web.DataReader(name='SP500', data_source='fred', start=2008)\n",
159 "print(df.info())\n",
160 "with pd.HDFStore(DATA_STORE) as store:\n",
161 " store.put('sp500/prices', df)"
162 ]
163 },
164 {
165 "cell_type": "markdown",
166 "metadata": {},
167 "source": [
168 "### S&P 500 Constituents"
169 ]
170 },
171 {
172 "cell_type": "markdown",
173 "metadata": {},
174 "source": [
175 "The following code downloads the current S&P 500 constituents from [Wikipedia](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)."
176 ]
177 },
178 {
179 "cell_type": "code",
180 "execution_count": 5,
181 "metadata": {},
182 "outputs": [
183 {
184 "name": "stdout",
185 "output_type": "stream",
186 "text": [
187 "<class 'pandas.core.frame.DataFrame'>\n",
188 "Index: 505 entries, MMM to ZTS\n",
189 "Data columns (total 7 columns):\n",
190 "name 505 non-null object\n",
191 "gics_sector 505 non-null object\n",
192 "gics_sub_industry 505 non-null object\n",
193 "location 505 non-null object\n",
194 "first_added 402 non-null object\n",
195 "cik 505 non-null int64\n",
196 "founded 172 non-null object\n",
197 "dtypes: int64(1), object(6)\n",
198 "memory usage: 31.6+ KB\n",
199 "None\n"
200 ]
201 },
202 {
203 "name": "stderr",
204 "output_type": "stream",
205 "text": [
206 "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3214: PerformanceWarning: \n",
207 "your performance may suffer as PyTables will pickle object types that it cannot\n",
208 "map directly to c-types [inferred_type->mixed,key->block0_values] [items->['name', 'gics_sector', 'gics_sub_industry', 'location', 'first_added', 'founded']]\n",
209 "\n",
210 " if (yield from self.run_code(code, result)):\n"
211 ]
212 }
213 ],
214 "source": [
215 "url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'\n",
216 "df = pd.read_html(url, header=0)[0]\n",
217 "df.columns = ['name', 'ticker', 'sec_filings', 'gics_sector', 'gics_sub_industry',\n",
218 " 'location', 'first_added', 'cik', 'founded']\n",
219 "df = df.drop('sec_filings', axis=1).set_index('ticker')\n",
220 "print(df.info())\n",
221 "with pd.HDFStore(DATA_STORE) as store:\n",
222 " store.put('sp500/stocks', df)"
223 ]
224 },
225 {
226 "cell_type": "markdown",
227 "metadata": {},
228 "source": [
229 "## Metadata on US-traded companies"
230 ]
231 },
232 {
233 "cell_type": "markdown",
234 "metadata": {},
235 "source": [
236 "The following downloads several attributes for [companies](https://www.nasdaq.com/screening/companies-by-name.aspx) traded on NASDAQ, AMEX and NYSE"
237 ]
238 },
239 {
240 "cell_type": "code",
241 "execution_count": 44,
242 "metadata": {},
243 "outputs": [
244 {
245 "name": "stdout",
246 "output_type": "stream",
247 "text": [
248 "<class 'pandas.core.frame.DataFrame'>\n",
249 "Index: 6852 entries, YI to ZYME\n",
250 "Data columns (total 6 columns):\n",
251 "name 6852 non-null object\n",
252 "lastsale 6742 non-null float64\n",
253 "marketcap 5835 non-null object\n",
254 "ipoyear 3113 non-null float64\n",
255 "sector 5292 non-null object\n",
256 "industry 5292 non-null object\n",
257 "dtypes: float64(2), object(4)\n",
258 "memory usage: 374.7+ KB\n",
259 "None\n"
260 ]
261 }
262 ],
263 "source": [
264 "url = 'https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange={}&render=download'\n",
265 "exchanges = ['NASDAQ', 'AMEX', 'NYSE']\n",
266 "df = pd.concat([pd.read_csv(url.format(ex)) for ex in exchanges]).dropna(how='all', axis=1)\n",
267 "df = df.rename(columns=str.lower).set_index('symbol').drop('summary quote', axis=1)\n",
268 "df = df[~df.index.duplicated()]\n",
269 "print(df.info()) "
270 ]
271 },
272 {
273 "cell_type": "code",
274 "execution_count": 45,
275 "metadata": {},
276 "outputs": [
277 {
278 "data": {
279 "text/html": [
280 "<div>\n",
281 "<style scoped>\n",
282 " .dataframe tbody tr th:only-of-type {\n",
283 " vertical-align: middle;\n",
284 " }\n",
285 "\n",
286 " .dataframe tbody tr th {\n",
287 " vertical-align: top;\n",
288 " }\n",
289 "\n",
290 " .dataframe thead th {\n",
291 " text-align: right;\n",
292 " }\n",
293 "</style>\n",
294 "<table border=\"1\" class=\"dataframe\">\n",
295 " <thead>\n",
296 " <tr style=\"text-align: right;\">\n",
297 " <th></th>\n",
298 " <th>name</th>\n",
299 " <th>lastsale</th>\n",
300 " <th>marketcap</th>\n",
301 " <th>ipoyear</th>\n",
302 " <th>sector</th>\n",
303 " <th>industry</th>\n",
304 " </tr>\n",
305 " <tr>\n",
306 " <th>symbol</th>\n",
307 " <th></th>\n",
308 " <th></th>\n",
309 " <th></th>\n",
310 " <th></th>\n",
311 " <th></th>\n",
312 " <th></th>\n",
313 " </tr>\n",
314 " </thead>\n",
315 " <tbody>\n",
316 " <tr>\n",
317 " <th>YI</th>\n",
318 " <td>111, Inc.</td>\n",
319 " <td>7.7000</td>\n",
320 " <td>$627.89M</td>\n",
321 " <td>2018.0</td>\n",
322 " <td>Health Care</td>\n",
323 " <td>Medical/Nursing Services</td>\n",
324 " </tr>\n",
325 " <tr>\n",
326 " <th>PIH</th>\n",
327 " <td>1347 Property Insurance Holdings, Inc.</td>\n",
328 " <td>5.1700</td>\n",
329 " <td>$31.09M</td>\n",
330 " <td>2014.0</td>\n",
331 " <td>Finance</td>\n",
332 " <td>Property-Casualty Insurers</td>\n",
333 " </tr>\n",
334 " <tr>\n",
335 " <th>PIHPP</th>\n",
336 " <td>1347 Property Insurance Holdings, Inc.</td>\n",
337 " <td>24.7628</td>\n",
338 " <td>$17.33M</td>\n",
339 " <td>NaN</td>\n",
340 " <td>Finance</td>\n",
341 " <td>Property-Casualty Insurers</td>\n",
342 " </tr>\n",
343 " <tr>\n",
344 " <th>TURN</th>\n",
345 " <td>180 Degree Capital Corp.</td>\n",
346 " <td>1.8922</td>\n",
347 " <td>$58.89M</td>\n",
348 " <td>NaN</td>\n",
349 " <td>Finance</td>\n",
350 " <td>Finance/Investors Services</td>\n",
351 " </tr>\n",
352 " <tr>\n",
353 " <th>FLWS</th>\n",
354 " <td>1-800 FLOWERS.COM, Inc.</td>\n",
355 " <td>18.8200</td>\n",
356 " <td>$1.21B</td>\n",
357 " <td>1999.0</td>\n",
358 " <td>Consumer Services</td>\n",
359 " <td>Other Specialty Stores</td>\n",
360 " </tr>\n",
361 " </tbody>\n",
362 "</table>\n",
363 "</div>"
364 ],
365 "text/plain": [
366 " name lastsale marketcap ipoyear sector industry\n",
367 "symbol \n",
368 "YI 111, Inc. 7.7000 $627.89M 2018.0 Health Care Medical/Nursing Services\n",
369 "PIH 1347 Property Insurance Holdings, Inc. 5.1700 $31.09M 2014.0 Finance Property-Casualty Insurers\n",
370 "PIHPP 1347 Property Insurance Holdings, Inc. 24.7628 $17.33M NaN Finance Property-Casualty Insurers\n",
371 "TURN 180 Degree Capital Corp. 1.8922 $58.89M NaN Finance Finance/Investors Services\n",
372 "FLWS 1-800 FLOWERS.COM, Inc. 18.8200 $1.21B 1999.0 Consumer Services Other Specialty Stores"
373 ]
374 },
375 "execution_count": 45,
376 "metadata": {},
377 "output_type": "execute_result"
378 }
379 ],
380 "source": [
381 "df.head()"
382 ]
383 },
384 {
385 "cell_type": "markdown",
386 "metadata": {},
387 "source": [
388 "### Convert market cap information to numerical format"
389 ]
390 },
391 {
392 "cell_type": "markdown",
393 "metadata": {},
394 "source": [
395 "Market cap is provided as strings so we need to convert it to numerical format."
396 ]
397 },
398 {
399 "cell_type": "code",
400 "execution_count": 46,
401 "metadata": {},
402 "outputs": [
403 {
404 "data": {
405 "text/plain": [
406 "M 3388\n",
407 "B 2441\n",
408 "0 6\n",
409 "Name: suffix, dtype: int64"
410 ]
411 },
412 "execution_count": 46,
413 "metadata": {},
414 "output_type": "execute_result"
415 }
416 ],
417 "source": [
418 "mcap = df[['marketcap']].dropna()\n",
419 "mcap['suffix'] = mcap.marketcap.str[-1]\n",
420 "mcap.suffix.value_counts()"
421 ]
422 },
423 {
424 "cell_type": "markdown",
425 "metadata": {},
426 "source": [
427 "Keep only values with value units:"
428 ]
429 },
430 {
431 "cell_type": "code",
432 "execution_count": 47,
433 "metadata": {},
434 "outputs": [
435 {
436 "name": "stdout",
437 "output_type": "stream",
438 "text": [
439 "<class 'pandas.core.frame.DataFrame'>\n",
440 "Index: 5829 entries, YI to ZYME\n",
441 "Data columns (total 2 columns):\n",
442 "marketcap 5829 non-null float64\n",
443 "suffix 5829 non-null object\n",
444 "dtypes: float64(1), object(1)\n",
445 "memory usage: 296.6+ KB\n"
446 ]
447 }
448 ],
449 "source": [
450 "mcap = mcap[mcap.suffix.str.endswith(('B', 'M'))]\n",
451 "mcap.marketcap = pd.to_numeric(mcap.marketcap.str[1:-1])\n",
452 "mcaps = {'M': 1e6, 'B': 1e9}\n",
453 "for symbol, factor in mcaps.items():\n",
454 " mcap.loc[mcap.suffix == symbol, 'marketcap'] *= factor\n",
455 "mcap.info()"
456 ]
457 },
458 {
459 "cell_type": "code",
460 "execution_count": 49,
461 "metadata": {
462 "scrolled": false
463 },
464 "outputs": [
465 {
466 "data": {
467 "text/plain": [
468 "count 5,829\n",
469 "mean 7,495,050,037\n",
470 "std 36,240,773,357\n",
471 "min 1,410,000\n",
472 "10% 35,188,000\n",
473 "20% 92,888,000\n",
474 "30% 196,769,999\n",
475 "40% 345,400,000\n",
476 "50% 619,210,000\n",
477 "60% 1,117,999,999\n",
478 "70% 2,200,000,000\n",
479 "80% 4,724,000,000\n",
480 "90% 13,713,999,999\n",
481 "max 961,260,000,000\n",
482 "Name: marketcap, dtype: object"
483 ]
484 },
485 "execution_count": 49,
486 "metadata": {},
487 "output_type": "execute_result"
488 }
489 ],
490 "source": [
491 "df['marketcap'] = mcap.marketcap\n",
492 "df.marketcap.describe(percentiles=np.arange(.1, 1, .1).round(1)).apply(lambda x: f'{int(x):,d}')"
493 ]
494 },
495 {
496 "cell_type": "markdown",
497 "metadata": {},
498 "source": [
499 "### Store result"
500 ]
501 },
502 {
503 "cell_type": "code",
504 "execution_count": 28,
505 "metadata": {},
506 "outputs": [
507 {
508 "name": "stderr",
509 "output_type": "stream",
510 "text": [
511 "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3214: PerformanceWarning: \n",
512 "your performance may suffer as PyTables will pickle object types that it cannot\n",
513 "map directly to c-types [inferred_type->mixed,key->block1_values] [items->['name', 'sector', 'industry']]\n",
514 "\n",
515 " if (yield from self.run_code(code, result)):\n"
516 ]
517 }
518 ],
519 "source": [
520 "with pd.HDFStore(DATA_STORE) as store:\n",
521 " store.put('us_equities/stocks', df)"
522 ]
523 },
524 {
525 "cell_type": "markdown",
526 "metadata": {},
527 "source": [
528 "## Bond Price Indexes"
529 ]
530 },
531 {
532 "cell_type": "markdown",
533 "metadata": {},
534 "source": [
535 "The following code downloads several bond indexes from the Federal Reserve Economic Data service ([FRED](https://fred.stlouisfed.org/))"
536 ]
537 },
538 {
539 "cell_type": "code",
540 "execution_count": null,
541 "metadata": {},
542 "outputs": [],
543 "source": [
544 "securities = {'BAMLCC0A0CMTRIV' : 'US Corp Master TRI',\n",
545 " 'BAMLHYH0A0HYM2TRIV': 'US High Yield TRI',\n",
546 " 'BAMLEMCBPITRIV' : 'Emerging Markets Corporate Plus TRI',\n",
547 " 'GOLDAMGBD228NLBM' : 'Gold (London, USD)',\n",
548 " 'DGS10' : '10-Year Treasury CMR',\n",
549 " }\n",
550 "\n",
551 "df = web.DataReader(name=list(securities.keys()), data_source='fred', start=2000)\n",
552 "df = df.rename(columns=securities).dropna(how='all').resample('B').mean()\n",
553 "\n",
554 "with pd.HDFStore(DATA_STORE) as store:\n",
555 " store.put('fred/assets', df)"
556 ]
557 }
558 ],
559 "metadata": {
560 "kernelspec": {
561 "display_name": "Python 3",
562 "language": "python",
563 "name": "python3"
564 },
565 "language_info": {
566 "codemirror_mode": {
567 "name": "ipython",
568 "version": 3
569 },
570 "file_extension": ".py",
571 "mimetype": "text/x-python",
572 "name": "python",
573 "nbconvert_exporter": "python",
574 "pygments_lexer": "ipython3",
575 "version": "3.6.8"
576 },
577 "toc": {
578 "base_numbering": 1,
579 "nav_menu": {},
580 "number_sections": true,
581 "sideBar": true,
582 "skip_h1_title": true,
583 "title_cell": "Table of Contents",
584 "title_sidebar": "Contents",
585 "toc_cell": false,
586 "toc_position": {},
587 "toc_section_display": true,
588 "toc_window_display": false
589 }
590 },
591 "nbformat": 4,
592 "nbformat_minor": 2
593 }