ml-finance-python

python scripts for finance machine learning

git clone https://9o.is/git/ml-finance-python.git

create_datasets.ipynb

(17910B)


      1 {
      2  "cells": [
      3   {
      4    "cell_type": "markdown",
      5    "metadata": {},
      6    "source": [
      7     "# Download and manipulate data"
      8    ]
      9   },
     10   {
     11    "cell_type": "markdown",
     12    "metadata": {},
     13    "source": [
     14     "This notebook contains information on downloading the Quandl Wiki stock prices and a few other sources that we use throughout the book. "
     15    ]
     16   },
     17   {
     18    "cell_type": "markdown",
     19    "metadata": {},
     20    "source": [
     21     "## Imports & Settings"
     22    ]
     23   },
     24   {
     25    "cell_type": "code",
     26    "execution_count": 3,
     27    "metadata": {},
     28    "outputs": [],
     29    "source": [
     30     "from pathlib import Path\n",
     31     "import numpy as np\n",
     32     "import pandas as pd\n",
     33     "import pandas_datareader.data as web\n",
     34     "\n",
     35     "pd.set_option('display.expand_frame_repr', False)"
     36    ]
     37   },
     38   {
     39    "cell_type": "markdown",
     40    "metadata": {},
     41    "source": [
     42     "## Set Data Store path"
     43    ]
     44   },
     45   {
     46    "cell_type": "markdown",
     47    "metadata": {},
     48    "source": [
     49     "Modify path if you would like to store the data elsewhere and change the notebooks accordingly"
     50    ]
     51   },
     52   {
     53    "cell_type": "code",
     54    "execution_count": 4,
     55    "metadata": {},
     56    "outputs": [],
     57    "source": [
     58     "DATA_STORE = Path('assets.h5')"
     59    ]
     60   },
     61   {
     62    "cell_type": "markdown",
     63    "metadata": {},
     64    "source": [
     65     "## Quandl Wiki Prices"
     66    ]
     67   },
     68   {
     69    "cell_type": "markdown",
     70    "metadata": {},
     71    "source": [
     72     "[Quandl](https://www.quandl.com/) makes available a [dataset](https://www.quandl.com/databases/WIKIP/documentation) with stock prices, dividends and splits for 3000 US publicly-traded companies. Quandl decided to discontinue support in favor of its commercial offerings but the historical data are still useful to demonstrate the application of the machine learning solutions in the book, just ensure you implement your own algorithms on current data.\n",
     73     "\n",
     74     "> As of April 11, 2018 this data feed is no longer actively supported by the Quandl community. We will continue to host this data feed on Quandl, but we do not recommend using it for investment or analysis."
     75    ]
     76   },
     77   {
     78    "cell_type": "markdown",
     79    "metadata": {},
     80    "source": [
     81     "1. Follow the instructions to create a free [Quandl]([Quandl](https://www.quandl.com/)) account\n",
     82     "2. [Download](https://www.quandl.com/databases/WIKIP/usage/export) the entire WIKI/PRICES data\n",
     83     "3. Extract the .zip file,\n",
     84     "4. Move to this directory and rename to wiki_prices.csv\n",
     85     "5. Run the below code to store in fast HDF format (see [Chapter 02 on Market & Fundamental Data](../02_market_and_fundamental_data) for details)."
     86    ]
     87   },
     88   {
     89    "cell_type": "code",
     90    "execution_count": null,
     91    "metadata": {},
     92    "outputs": [],
     93    "source": [
     94     "df = (pd.read_csv('wiki_prices.csv',\n",
     95     "                 parse_dates=['date'],\n",
     96     "                 index_col=['date', 'ticker'],\n",
     97     "                 infer_datetime_format=True)\n",
     98     "     .sort_index())\n",
     99     "\n",
    100     "print(df.info(null_counts=True))\n",
    101     "with pd.HDFStore(DATA_STORE) as store:\n",
    102     "    store.put('quandl/wiki/prices', df)"
    103    ]
    104   },
    105   {
    106    "cell_type": "markdown",
    107    "metadata": {},
    108    "source": [
    109     "### Wiki Prices Metadata"
    110    ]
    111   },
    112   {
    113    "cell_type": "markdown",
    114    "metadata": {},
    115    "source": [
    116     "1. Follow the instructions to create a free [Quandl]([Quandl](https://www.quandl.com/)) account if you haven't done so yet\n",
    117     "2. Find link to download wiki metadata under Companies](https://www.quandl.com/databases/WIKIP/documentation) or use the download link with your API_KEY: https://www.quandl.com/api/v3/databases/WIKI/metadata?api_key=<API_KEY>\n",
    118     "3. Extract the .zip file,\n",
    119     "4. Move to this directory and rename to wiki_stocks.csv\n",
    120     "5. Run the following code to store in fast HDF format"
    121    ]
    122   },
    123   {
    124    "cell_type": "code",
    125    "execution_count": null,
    126    "metadata": {},
    127    "outputs": [],
    128    "source": [
    129     "df = pd.read_csv('wiki_stocks.csv')\n",
    130     "df = pd.concat([df.loc[:, 'code'].str.strip(),\n",
    131     "                df.loc[:, 'name'].str.split('(', expand=True)[0].str.strip().to_frame('name')], axis=1)\n",
    132     "\n",
    133     "print(df.info(null_counts=True))\n",
    134     "with pd.HDFStore(DATA_STORE) as store:\n",
    135     "    store.put('quandl/wiki/stocks', df)"
    136    ]
    137   },
    138   {
    139    "cell_type": "markdown",
    140    "metadata": {},
    141    "source": [
    142     "## S&P 500 Prices"
    143    ]
    144   },
    145   {
    146    "cell_type": "markdown",
    147    "metadata": {},
    148    "source": [
    149     "The following code downloads historical S&P 500 prices from FRED (only last 10 years of daily data is freely available)"
    150    ]
    151   },
    152   {
    153    "cell_type": "code",
    154    "execution_count": null,
    155    "metadata": {},
    156    "outputs": [],
    157    "source": [
    158     "df = web.DataReader(name='SP500', data_source='fred', start=2008)\n",
    159     "print(df.info())\n",
    160     "with pd.HDFStore(DATA_STORE) as store:\n",
    161     "    store.put('sp500/prices', df)"
    162    ]
    163   },
    164   {
    165    "cell_type": "markdown",
    166    "metadata": {},
    167    "source": [
    168     "### S&P 500 Constituents"
    169    ]
    170   },
    171   {
    172    "cell_type": "markdown",
    173    "metadata": {},
    174    "source": [
    175     "The following code downloads the current S&P 500 constituents from [Wikipedia](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)."
    176    ]
    177   },
    178   {
    179    "cell_type": "code",
    180    "execution_count": 5,
    181    "metadata": {},
    182    "outputs": [
    183     {
    184      "name": "stdout",
    185      "output_type": "stream",
    186      "text": [
    187       "<class 'pandas.core.frame.DataFrame'>\n",
    188       "Index: 505 entries, MMM to ZTS\n",
    189       "Data columns (total 7 columns):\n",
    190       "name                 505 non-null object\n",
    191       "gics_sector          505 non-null object\n",
    192       "gics_sub_industry    505 non-null object\n",
    193       "location             505 non-null object\n",
    194       "first_added          402 non-null object\n",
    195       "cik                  505 non-null int64\n",
    196       "founded              172 non-null object\n",
    197       "dtypes: int64(1), object(6)\n",
    198       "memory usage: 31.6+ KB\n",
    199       "None\n"
    200      ]
    201     },
    202     {
    203      "name": "stderr",
    204      "output_type": "stream",
    205      "text": [
    206       "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3214: PerformanceWarning: \n",
    207       "your performance may suffer as PyTables will pickle object types that it cannot\n",
    208       "map directly to c-types [inferred_type->mixed,key->block0_values] [items->['name', 'gics_sector', 'gics_sub_industry', 'location', 'first_added', 'founded']]\n",
    209       "\n",
    210       "  if (yield from self.run_code(code, result)):\n"
    211      ]
    212     }
    213    ],
    214    "source": [
    215     "url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'\n",
    216     "df = pd.read_html(url, header=0)[0]\n",
    217     "df.columns = ['name', 'ticker', 'sec_filings', 'gics_sector', 'gics_sub_industry',\n",
    218     "              'location', 'first_added', 'cik', 'founded']\n",
    219     "df = df.drop('sec_filings', axis=1).set_index('ticker')\n",
    220     "print(df.info())\n",
    221     "with pd.HDFStore(DATA_STORE) as store:\n",
    222     "    store.put('sp500/stocks', df)"
    223    ]
    224   },
    225   {
    226    "cell_type": "markdown",
    227    "metadata": {},
    228    "source": [
    229     "## Metadata on US-traded companies"
    230    ]
    231   },
    232   {
    233    "cell_type": "markdown",
    234    "metadata": {},
    235    "source": [
    236     "The following downloads several attributes for [companies](https://www.nasdaq.com/screening/companies-by-name.aspx) traded on NASDAQ, AMEX and NYSE"
    237    ]
    238   },
    239   {
    240    "cell_type": "code",
    241    "execution_count": 44,
    242    "metadata": {},
    243    "outputs": [
    244     {
    245      "name": "stdout",
    246      "output_type": "stream",
    247      "text": [
    248       "<class 'pandas.core.frame.DataFrame'>\n",
    249       "Index: 6852 entries, YI to ZYME\n",
    250       "Data columns (total 6 columns):\n",
    251       "name         6852 non-null object\n",
    252       "lastsale     6742 non-null float64\n",
    253       "marketcap    5835 non-null object\n",
    254       "ipoyear      3113 non-null float64\n",
    255       "sector       5292 non-null object\n",
    256       "industry     5292 non-null object\n",
    257       "dtypes: float64(2), object(4)\n",
    258       "memory usage: 374.7+ KB\n",
    259       "None\n"
    260      ]
    261     }
    262    ],
    263    "source": [
    264     "url = 'https://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange={}&render=download'\n",
    265     "exchanges = ['NASDAQ', 'AMEX', 'NYSE']\n",
    266     "df = pd.concat([pd.read_csv(url.format(ex)) for ex in exchanges]).dropna(how='all', axis=1)\n",
    267     "df = df.rename(columns=str.lower).set_index('symbol').drop('summary quote', axis=1)\n",
    268     "df = df[~df.index.duplicated()]\n",
    269     "print(df.info()) "
    270    ]
    271   },
    272   {
    273    "cell_type": "code",
    274    "execution_count": 45,
    275    "metadata": {},
    276    "outputs": [
    277     {
    278      "data": {
    279       "text/html": [
    280        "<div>\n",
    281        "<style scoped>\n",
    282        "    .dataframe tbody tr th:only-of-type {\n",
    283        "        vertical-align: middle;\n",
    284        "    }\n",
    285        "\n",
    286        "    .dataframe tbody tr th {\n",
    287        "        vertical-align: top;\n",
    288        "    }\n",
    289        "\n",
    290        "    .dataframe thead th {\n",
    291        "        text-align: right;\n",
    292        "    }\n",
    293        "</style>\n",
    294        "<table border=\"1\" class=\"dataframe\">\n",
    295        "  <thead>\n",
    296        "    <tr style=\"text-align: right;\">\n",
    297        "      <th></th>\n",
    298        "      <th>name</th>\n",
    299        "      <th>lastsale</th>\n",
    300        "      <th>marketcap</th>\n",
    301        "      <th>ipoyear</th>\n",
    302        "      <th>sector</th>\n",
    303        "      <th>industry</th>\n",
    304        "    </tr>\n",
    305        "    <tr>\n",
    306        "      <th>symbol</th>\n",
    307        "      <th></th>\n",
    308        "      <th></th>\n",
    309        "      <th></th>\n",
    310        "      <th></th>\n",
    311        "      <th></th>\n",
    312        "      <th></th>\n",
    313        "    </tr>\n",
    314        "  </thead>\n",
    315        "  <tbody>\n",
    316        "    <tr>\n",
    317        "      <th>YI</th>\n",
    318        "      <td>111, Inc.</td>\n",
    319        "      <td>7.7000</td>\n",
    320        "      <td>$627.89M</td>\n",
    321        "      <td>2018.0</td>\n",
    322        "      <td>Health Care</td>\n",
    323        "      <td>Medical/Nursing Services</td>\n",
    324        "    </tr>\n",
    325        "    <tr>\n",
    326        "      <th>PIH</th>\n",
    327        "      <td>1347 Property Insurance Holdings, Inc.</td>\n",
    328        "      <td>5.1700</td>\n",
    329        "      <td>$31.09M</td>\n",
    330        "      <td>2014.0</td>\n",
    331        "      <td>Finance</td>\n",
    332        "      <td>Property-Casualty Insurers</td>\n",
    333        "    </tr>\n",
    334        "    <tr>\n",
    335        "      <th>PIHPP</th>\n",
    336        "      <td>1347 Property Insurance Holdings, Inc.</td>\n",
    337        "      <td>24.7628</td>\n",
    338        "      <td>$17.33M</td>\n",
    339        "      <td>NaN</td>\n",
    340        "      <td>Finance</td>\n",
    341        "      <td>Property-Casualty Insurers</td>\n",
    342        "    </tr>\n",
    343        "    <tr>\n",
    344        "      <th>TURN</th>\n",
    345        "      <td>180 Degree Capital Corp.</td>\n",
    346        "      <td>1.8922</td>\n",
    347        "      <td>$58.89M</td>\n",
    348        "      <td>NaN</td>\n",
    349        "      <td>Finance</td>\n",
    350        "      <td>Finance/Investors Services</td>\n",
    351        "    </tr>\n",
    352        "    <tr>\n",
    353        "      <th>FLWS</th>\n",
    354        "      <td>1-800 FLOWERS.COM, Inc.</td>\n",
    355        "      <td>18.8200</td>\n",
    356        "      <td>$1.21B</td>\n",
    357        "      <td>1999.0</td>\n",
    358        "      <td>Consumer Services</td>\n",
    359        "      <td>Other Specialty Stores</td>\n",
    360        "    </tr>\n",
    361        "  </tbody>\n",
    362        "</table>\n",
    363        "</div>"
    364       ],
    365       "text/plain": [
    366        "                                          name  lastsale marketcap  ipoyear             sector                    industry\n",
    367        "symbol                                                                                                                    \n",
    368        "YI                                   111, Inc.    7.7000  $627.89M   2018.0        Health Care    Medical/Nursing Services\n",
    369        "PIH     1347 Property Insurance Holdings, Inc.    5.1700   $31.09M   2014.0            Finance  Property-Casualty Insurers\n",
    370        "PIHPP   1347 Property Insurance Holdings, Inc.   24.7628   $17.33M      NaN            Finance  Property-Casualty Insurers\n",
    371        "TURN                  180 Degree Capital Corp.    1.8922   $58.89M      NaN            Finance  Finance/Investors Services\n",
    372        "FLWS                   1-800 FLOWERS.COM, Inc.   18.8200    $1.21B   1999.0  Consumer Services      Other Specialty Stores"
    373       ]
    374      },
    375      "execution_count": 45,
    376      "metadata": {},
    377      "output_type": "execute_result"
    378     }
    379    ],
    380    "source": [
    381     "df.head()"
    382    ]
    383   },
    384   {
    385    "cell_type": "markdown",
    386    "metadata": {},
    387    "source": [
    388     "### Convert market cap information to numerical format"
    389    ]
    390   },
    391   {
    392    "cell_type": "markdown",
    393    "metadata": {},
    394    "source": [
    395     "Market cap is provided as strings so we need to convert it to numerical format."
    396    ]
    397   },
    398   {
    399    "cell_type": "code",
    400    "execution_count": 46,
    401    "metadata": {},
    402    "outputs": [
    403     {
    404      "data": {
    405       "text/plain": [
    406        "M    3388\n",
    407        "B    2441\n",
    408        "0       6\n",
    409        "Name: suffix, dtype: int64"
    410       ]
    411      },
    412      "execution_count": 46,
    413      "metadata": {},
    414      "output_type": "execute_result"
    415     }
    416    ],
    417    "source": [
    418     "mcap = df[['marketcap']].dropna()\n",
    419     "mcap['suffix'] = mcap.marketcap.str[-1]\n",
    420     "mcap.suffix.value_counts()"
    421    ]
    422   },
    423   {
    424    "cell_type": "markdown",
    425    "metadata": {},
    426    "source": [
    427     "Keep only values with value units:"
    428    ]
    429   },
    430   {
    431    "cell_type": "code",
    432    "execution_count": 47,
    433    "metadata": {},
    434    "outputs": [
    435     {
    436      "name": "stdout",
    437      "output_type": "stream",
    438      "text": [
    439       "<class 'pandas.core.frame.DataFrame'>\n",
    440       "Index: 5829 entries, YI to ZYME\n",
    441       "Data columns (total 2 columns):\n",
    442       "marketcap    5829 non-null float64\n",
    443       "suffix       5829 non-null object\n",
    444       "dtypes: float64(1), object(1)\n",
    445       "memory usage: 296.6+ KB\n"
    446      ]
    447     }
    448    ],
    449    "source": [
    450     "mcap = mcap[mcap.suffix.str.endswith(('B', 'M'))]\n",
    451     "mcap.marketcap = pd.to_numeric(mcap.marketcap.str[1:-1])\n",
    452     "mcaps = {'M': 1e6, 'B': 1e9}\n",
    453     "for symbol, factor in mcaps.items():\n",
    454     "    mcap.loc[mcap.suffix == symbol, 'marketcap'] *= factor\n",
    455     "mcap.info()"
    456    ]
    457   },
    458   {
    459    "cell_type": "code",
    460    "execution_count": 49,
    461    "metadata": {
    462     "scrolled": false
    463    },
    464    "outputs": [
    465     {
    466      "data": {
    467       "text/plain": [
    468        "count              5,829\n",
    469        "mean       7,495,050,037\n",
    470        "std       36,240,773,357\n",
    471        "min            1,410,000\n",
    472        "10%           35,188,000\n",
    473        "20%           92,888,000\n",
    474        "30%          196,769,999\n",
    475        "40%          345,400,000\n",
    476        "50%          619,210,000\n",
    477        "60%        1,117,999,999\n",
    478        "70%        2,200,000,000\n",
    479        "80%        4,724,000,000\n",
    480        "90%       13,713,999,999\n",
    481        "max      961,260,000,000\n",
    482        "Name: marketcap, dtype: object"
    483       ]
    484      },
    485      "execution_count": 49,
    486      "metadata": {},
    487      "output_type": "execute_result"
    488     }
    489    ],
    490    "source": [
    491     "df['marketcap'] = mcap.marketcap\n",
    492     "df.marketcap.describe(percentiles=np.arange(.1, 1, .1).round(1)).apply(lambda x: f'{int(x):,d}')"
    493    ]
    494   },
    495   {
    496    "cell_type": "markdown",
    497    "metadata": {},
    498    "source": [
    499     "### Store result"
    500    ]
    501   },
    502   {
    503    "cell_type": "code",
    504    "execution_count": 28,
    505    "metadata": {},
    506    "outputs": [
    507     {
    508      "name": "stderr",
    509      "output_type": "stream",
    510      "text": [
    511       "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3214: PerformanceWarning: \n",
    512       "your performance may suffer as PyTables will pickle object types that it cannot\n",
    513       "map directly to c-types [inferred_type->mixed,key->block1_values] [items->['name', 'sector', 'industry']]\n",
    514       "\n",
    515       "  if (yield from self.run_code(code, result)):\n"
    516      ]
    517     }
    518    ],
    519    "source": [
    520     "with pd.HDFStore(DATA_STORE) as store:\n",
    521     "    store.put('us_equities/stocks', df)"
    522    ]
    523   },
    524   {
    525    "cell_type": "markdown",
    526    "metadata": {},
    527    "source": [
    528     "## Bond Price Indexes"
    529    ]
    530   },
    531   {
    532    "cell_type": "markdown",
    533    "metadata": {},
    534    "source": [
    535     "The following code downloads several bond indexes from the Federal Reserve Economic Data service ([FRED](https://fred.stlouisfed.org/))"
    536    ]
    537   },
    538   {
    539    "cell_type": "code",
    540    "execution_count": null,
    541    "metadata": {},
    542    "outputs": [],
    543    "source": [
    544     "securities = {'BAMLCC0A0CMTRIV'   : 'US Corp Master TRI',\n",
    545     "              'BAMLHYH0A0HYM2TRIV': 'US High Yield TRI',\n",
    546     "              'BAMLEMCBPITRIV'    : 'Emerging Markets Corporate Plus TRI',\n",
    547     "              'GOLDAMGBD228NLBM'  : 'Gold (London, USD)',\n",
    548     "              'DGS10'             : '10-Year Treasury CMR',\n",
    549     "              }\n",
    550     "\n",
    551     "df = web.DataReader(name=list(securities.keys()), data_source='fred', start=2000)\n",
    552     "df = df.rename(columns=securities).dropna(how='all').resample('B').mean()\n",
    553     "\n",
    554     "with pd.HDFStore(DATA_STORE) as store:\n",
    555     "    store.put('fred/assets', df)"
    556    ]
    557   }
    558  ],
    559  "metadata": {
    560   "kernelspec": {
    561    "display_name": "Python 3",
    562    "language": "python",
    563    "name": "python3"
    564   },
    565   "language_info": {
    566    "codemirror_mode": {
    567     "name": "ipython",
    568     "version": 3
    569    },
    570    "file_extension": ".py",
    571    "mimetype": "text/x-python",
    572    "name": "python",
    573    "nbconvert_exporter": "python",
    574    "pygments_lexer": "ipython3",
    575    "version": "3.6.8"
    576   },
    577   "toc": {
    578    "base_numbering": 1,
    579    "nav_menu": {},
    580    "number_sections": true,
    581    "sideBar": true,
    582    "skip_h1_title": true,
    583    "title_cell": "Table of Contents",
    584    "title_sidebar": "Contents",
    585    "toc_cell": false,
    586    "toc_position": {},
    587    "toc_section_display": true,
    588    "toc_window_display": false
    589   }
    590  },
    591  "nbformat": 4,
    592  "nbformat_minor": 2
    593 }