ml-finance-python

python scripts for finance machine learning

git clone https://9o.is/git/ml-finance-python.git

notebook.ipynb

(19920B)


      1 {
      2  "cells": [
      3   {
      4    "cell_type": "markdown",
      5    "metadata": {
      6     "collapsed": true
      7    },
      8    "source": [
      9     "# EventVestor: Legal and Regulatory\n",
     10     "\n",
     11     "In this notebook, we'll take a look at EventVestor's *Legal and Regulatory* dataset, available on the [Quantopian Store](https://www.quantopian.com/store). This dataset spans January 01, 2007 through the current day, and documents major legal and regulatory events affecting publicly traded companies.\n",
     12     "\n",
     13     "### Blaze\n",
     14     "Before we dig into the data, we want to tell you about how  you generally access Quantopian Store data sets. These datasets are available through an API service known as [Blaze](http://blaze.pydata.org). Blaze provides the Quantopian user with a convenient interface to access very large datasets.\n",
     15     "\n",
     16     "Blaze provides an important function for accessing these datasets. Some of these sets are many millions of records. Bringing that data directly into Quantopian Research directly just is not viable. So Blaze allows us to provide a simple querying interface and shift the burden over to the server side.\n",
     17     "\n",
     18     "It is common to use Blaze to reduce your dataset in size, convert it over to Pandas and then to use Pandas for further computation, manipulation and visualization.\n",
     19     "\n",
     20     "Helpful links:\n",
     21     "* [Query building for Blaze](http://blaze.pydata.org/en/latest/queries.html)\n",
     22     "* [Pandas-to-Blaze dictionary](http://blaze.pydata.org/en/latest/rosetta-pandas.html)\n",
     23     "* [SQL-to-Blaze dictionary](http://blaze.pydata.org/en/latest/rosetta-sql.html).\n",
     24     "\n",
     25     "Once you've limited the size of your Blaze object, you can convert it to a Pandas DataFrames using:\n",
     26     "> `from odo import odo`  \n",
     27     "> `odo(expr, pandas.DataFrame)`\n",
     28     "\n",
     29     "### Free samples and limits\n",
     30     "One other key caveat: we limit the number of results returned from any given expression to 10,000 to protect against runaway memory usage. To be clear, you have access to all the data server side. We are limiting the size of the responses back from Blaze.\n",
     31     "\n",
     32     "There is a *free* version of this dataset as well as a paid one. The free one includes about three years of historical data, though not up to the current day.\n",
     33     "\n",
     34     "With preamble in place, let's get started:"
     35    ]
     36   },
     37   {
     38    "cell_type": "code",
     39    "execution_count": 2,
     40    "metadata": {
     41     "collapsed": false
     42    },
     43    "outputs": [],
     44    "source": [
     45     "# import the dataset\n",
     46     "from quantopian.interactive.data.eventvestor import legal_and_regulatory\n",
     47     "# or if you want to import the free dataset, use:\n",
     48     "# from quantopian.interactive.data.eventvestor import legal_and_regulatory_free\n",
     49     "\n",
     50     "# import data operations\n",
     51     "from odo import odo\n",
     52     "# import other libraries we will use\n",
     53     "import pandas as pd"
     54    ]
     55   },
     56   {
     57    "cell_type": "code",
     58    "execution_count": 3,
     59    "metadata": {
     60     "collapsed": false
     61    },
     62    "outputs": [
     63     {
     64      "data": {
     65       "text/plain": [
     66        "dshape(\"\"\"var * {\n",
     67        "  event_id: ?float64,\n",
     68        "  asof_date: datetime,\n",
     69        "  trade_date: ?datetime,\n",
     70        "  symbol: ?string,\n",
     71        "  event_type: ?string,\n",
     72        "  event_headline: ?string,\n",
     73        "  legal_amount: ?float64,\n",
     74        "  legal_units: ?string,\n",
     75        "  legal_entity: ?string,\n",
     76        "  event_rating: ?float64,\n",
     77        "  timestamp: datetime,\n",
     78        "  sid: ?int64\n",
     79        "  }\"\"\")"
     80       ]
     81      },
     82      "execution_count": 3,
     83      "metadata": {},
     84      "output_type": "execute_result"
     85     }
     86    ],
     87    "source": [
     88     "# Let's use blaze to understand the data a bit using Blaze dshape()\n",
     89     "legal_and_regulatory.dshape"
     90    ]
     91   },
     92   {
     93    "cell_type": "code",
     94    "execution_count": 4,
     95    "metadata": {
     96     "collapsed": false
     97    },
     98    "outputs": [
     99     {
    100      "data": {
    101       "text/html": [
    102        "8180"
    103       ],
    104       "text/plain": [
    105        "8180"
    106       ]
    107      },
    108      "execution_count": 4,
    109      "metadata": {},
    110      "output_type": "execute_result"
    111     }
    112    ],
    113    "source": [
    114     "# And how many rows are there?\n",
    115     "# N.B. we're using a Blaze function to do this, not len()\n",
    116     "legal_and_regulatory.count()"
    117    ]
    118   },
    119   {
    120    "cell_type": "code",
    121    "execution_count": 5,
    122    "metadata": {
    123     "collapsed": false
    124    },
    125    "outputs": [
    126     {
    127      "data": {
    128       "text/html": [
    129        "<table border=\"1\" class=\"dataframe\">\n",
    130        "  <thead>\n",
    131        "    <tr style=\"text-align: right;\">\n",
    132        "      <th></th>\n",
    133        "      <th>event_id</th>\n",
    134        "      <th>asof_date</th>\n",
    135        "      <th>trade_date</th>\n",
    136        "      <th>symbol</th>\n",
    137        "      <th>event_type</th>\n",
    138        "      <th>event_headline</th>\n",
    139        "      <th>legal_amount</th>\n",
    140        "      <th>legal_units</th>\n",
    141        "      <th>legal_entity</th>\n",
    142        "      <th>event_rating</th>\n",
    143        "      <th>timestamp</th>\n",
    144        "      <th>sid</th>\n",
    145        "    </tr>\n",
    146        "  </thead>\n",
    147        "  <tbody>\n",
    148        "    <tr>\n",
    149        "      <th>0</th>\n",
    150        "      <td>77848</td>\n",
    151        "      <td>2007-01-05</td>\n",
    152        "      <td>2007-01-08</td>\n",
    153        "      <td>AMAT</td>\n",
    154        "      <td>Legal/Regulatory</td>\n",
    155        "      <td>Applied Materials' Review under the HSR Antitr...</td>\n",
    156        "      <td>0.0</td>\n",
    157        "      <td>NaN</td>\n",
    158        "      <td>NaN</td>\n",
    159        "      <td>1</td>\n",
    160        "      <td>2007-01-06</td>\n",
    161        "      <td>337</td>\n",
    162        "    </tr>\n",
    163        "    <tr>\n",
    164        "      <th>1</th>\n",
    165        "      <td>148666</td>\n",
    166        "      <td>2007-01-05</td>\n",
    167        "      <td>2007-01-05</td>\n",
    168        "      <td>FCS</td>\n",
    169        "      <td>Legal/Regulatory</td>\n",
    170        "      <td>Fairchild Semiconductor Appeals Ruling in ZTE ...</td>\n",
    171        "      <td>8.4</td>\n",
    172        "      <td>$M</td>\n",
    173        "      <td>Zhongxing Telecom Ltd</td>\n",
    174        "      <td>1</td>\n",
    175        "      <td>2007-01-06</td>\n",
    176        "      <td>20486</td>\n",
    177        "    </tr>\n",
    178        "    <tr>\n",
    179        "      <th>2</th>\n",
    180        "      <td>77994</td>\n",
    181        "      <td>2007-01-09</td>\n",
    182        "      <td>2007-01-09</td>\n",
    183        "      <td>XLNX</td>\n",
    184        "      <td>Legal/Regulatory</td>\n",
    185        "      <td>Xilinx announces dismissal of shareholder deri...</td>\n",
    186        "      <td>0.0</td>\n",
    187        "      <td>NaN</td>\n",
    188        "      <td>NaN</td>\n",
    189        "      <td>1</td>\n",
    190        "      <td>2007-01-10</td>\n",
    191        "      <td>8344</td>\n",
    192        "    </tr>\n",
    193        "  </tbody>\n",
    194        "</table>"
    195       ],
    196       "text/plain": [
    197        "   event_id  asof_date trade_date symbol        event_type  \\\n",
    198        "0     77848 2007-01-05 2007-01-08   AMAT  Legal/Regulatory   \n",
    199        "1    148666 2007-01-05 2007-01-05    FCS  Legal/Regulatory   \n",
    200        "2     77994 2007-01-09 2007-01-09   XLNX  Legal/Regulatory   \n",
    201        "\n",
    202        "                                      event_headline  legal_amount  \\\n",
    203        "0  Applied Materials' Review under the HSR Antitr...           0.0   \n",
    204        "1  Fairchild Semiconductor Appeals Ruling in ZTE ...           8.4   \n",
    205        "2  Xilinx announces dismissal of shareholder deri...           0.0   \n",
    206        "\n",
    207        "  legal_units           legal_entity  event_rating  timestamp    sid  \n",
    208        "0         NaN                    NaN             1 2007-01-06    337  \n",
    209        "1          $M  Zhongxing Telecom Ltd             1 2007-01-06  20486  \n",
    210        "2         NaN                    NaN             1 2007-01-10   8344  "
    211       ]
    212      },
    213      "execution_count": 5,
    214      "metadata": {},
    215      "output_type": "execute_result"
    216     }
    217    ],
    218    "source": [
    219     "# Let's see what the data looks like. We'll grab the first three rows.\n",
    220     "legal_and_regulatory[:3]"
    221    ]
    222   },
    223   {
    224    "cell_type": "markdown",
    225    "metadata": {},
    226    "source": [
    227     "Let's go over the columns:\n",
    228     "- **event_id**: the unique identifier for this event.\n",
    229     "- **asof_date**: EventVestor's timestamp of event capture.\n",
    230     "- **trade_date**: for event announcements made before trading ends, trade_date is the same as event_date. For announcements issued after market close, trade_date is next market open day.\n",
    231     "- **symbol**: stock ticker symbol of the affected company.\n",
    232     "- **event_type**: this should always be *Legal/Regulatory*.\n",
    233     "- **event_headline**: a brief description of the event\n",
    234     "- **legal_amount**: amount mentioned in the case, if any.\n",
    235     "- **legal_units**: units of the legal_amount: most commonly millions of dollars.\n",
    236     "- **legal_entity**: the related entity in the legal case.\n",
    237     "- **event_rating**: this is always 1. The meaning of this is uncertain.\n",
    238     "- **timestamp**: this is our timestamp on when we registered the data.\n",
    239     "- **sid**: the equity's unique identifier. Use this instead of the symbol."
    240    ]
    241   },
    242   {
    243    "cell_type": "markdown",
    244    "metadata": {},
    245    "source": [
    246     "We've done much of the data processing for you. Fields like `timestamp` and `sid` are standardized across all our Store Datasets, so the datasets are easy to combine. We have standardized the `sid` across all our equity databases.\n",
    247     "\n",
    248     "We can select columns and rows with ease. Below, we'll fetch all 2014 events involving General Motors."
    249    ]
    250   },
    251   {
    252    "cell_type": "code",
    253    "execution_count": 6,
    254    "metadata": {
    255     "collapsed": false,
    256     "scrolled": true
    257    },
    258    "outputs": [
    259     {
    260      "data": {
    261       "text/html": [
    262        "<table border=\"1\" class=\"dataframe\">\n",
    263        "  <thead>\n",
    264        "    <tr style=\"text-align: right;\">\n",
    265        "      <th></th>\n",
    266        "      <th>event_id</th>\n",
    267        "      <th>asof_date</th>\n",
    268        "      <th>trade_date</th>\n",
    269        "      <th>symbol</th>\n",
    270        "      <th>event_type</th>\n",
    271        "      <th>event_headline</th>\n",
    272        "      <th>legal_amount</th>\n",
    273        "      <th>legal_units</th>\n",
    274        "      <th>legal_entity</th>\n",
    275        "      <th>event_rating</th>\n",
    276        "      <th>timestamp</th>\n",
    277        "      <th>sid</th>\n",
    278        "    </tr>\n",
    279        "  </thead>\n",
    280        "  <tbody>\n",
    281        "    <tr>\n",
    282        "      <th>0</th>\n",
    283        "      <td>1695334</td>\n",
    284        "      <td>2014-03-19</td>\n",
    285        "      <td>2014-03-20</td>\n",
    286        "      <td>GM</td>\n",
    287        "      <td>Legal/Regulatory</td>\n",
    288        "      <td>General Motors Co. Faces Class Action Lawsuit ...</td>\n",
    289        "      <td>0</td>\n",
    290        "      <td>NaN</td>\n",
    291        "      <td>Hagens Berman</td>\n",
    292        "      <td>1</td>\n",
    293        "      <td>2014-03-20</td>\n",
    294        "      <td>40430</td>\n",
    295        "    </tr>\n",
    296        "    <tr>\n",
    297        "      <th>1</th>\n",
    298        "      <td>1696156</td>\n",
    299        "      <td>2014-03-21</td>\n",
    300        "      <td>2014-03-21</td>\n",
    301        "      <td>GM</td>\n",
    302        "      <td>Legal/Regulatory</td>\n",
    303        "      <td>General Motors Co. Faces Class Action Lawsuit</td>\n",
    304        "      <td>0</td>\n",
    305        "      <td>NaN</td>\n",
    306        "      <td>Pomerantz LLP</td>\n",
    307        "      <td>1</td>\n",
    308        "      <td>2014-03-22</td>\n",
    309        "      <td>40430</td>\n",
    310        "    </tr>\n",
    311        "    <tr>\n",
    312        "      <th>2</th>\n",
    313        "      <td>1703401</td>\n",
    314        "      <td>2014-04-22</td>\n",
    315        "      <td>2014-04-23</td>\n",
    316        "      <td>GM</td>\n",
    317        "      <td>Legal/Regulatory</td>\n",
    318        "      <td>General Motors Company Faces Class Action Laws...</td>\n",
    319        "      <td>0</td>\n",
    320        "      <td>NaN</td>\n",
    321        "      <td>Law Offices of Howard G. Smith; Glancy Binkow ...</td>\n",
    322        "      <td>1</td>\n",
    323        "      <td>2014-04-23</td>\n",
    324        "      <td>40430</td>\n",
    325        "    </tr>\n",
    326        "    <tr>\n",
    327        "      <th>3</th>\n",
    328        "      <td>1725393</td>\n",
    329        "      <td>2014-05-16</td>\n",
    330        "      <td>2014-05-19</td>\n",
    331        "      <td>GM</td>\n",
    332        "      <td>Legal/Regulatory</td>\n",
    333        "      <td>General Motors Co. Faces Class Action Lawsuits</td>\n",
    334        "      <td>0</td>\n",
    335        "      <td>NaN</td>\n",
    336        "      <td>Pomerantz LLP</td>\n",
    337        "      <td>1</td>\n",
    338        "      <td>2014-05-17</td>\n",
    339        "      <td>40430</td>\n",
    340        "    </tr>\n",
    341        "    <tr>\n",
    342        "      <th>4</th>\n",
    343        "      <td>1725455</td>\n",
    344        "      <td>2014-05-17</td>\n",
    345        "      <td>2014-05-19</td>\n",
    346        "      <td>GM</td>\n",
    347        "      <td>Legal/Regulatory</td>\n",
    348        "      <td>General Motors to Pay $35M Fine Over Delay to ...</td>\n",
    349        "      <td>35</td>\n",
    350        "      <td>$M</td>\n",
    351        "      <td>NaN</td>\n",
    352        "      <td>1</td>\n",
    353        "      <td>2014-05-18</td>\n",
    354        "      <td>40430</td>\n",
    355        "    </tr>\n",
    356        "    <tr>\n",
    357        "      <th>5</th>\n",
    358        "      <td>1736421</td>\n",
    359        "      <td>2014-06-18</td>\n",
    360        "      <td>2014-06-18</td>\n",
    361        "      <td>GM</td>\n",
    362        "      <td>Legal/Regulatory</td>\n",
    363        "      <td>General Motors Co. Faces Class Action Lawsuit</td>\n",
    364        "      <td>0</td>\n",
    365        "      <td>NaN</td>\n",
    366        "      <td>Hagens Berman Sobol Shapiro</td>\n",
    367        "      <td>1</td>\n",
    368        "      <td>2014-06-19</td>\n",
    369        "      <td>40430</td>\n",
    370        "    </tr>\n",
    371        "    <tr>\n",
    372        "      <th>6</th>\n",
    373        "      <td>1768189</td>\n",
    374        "      <td>2014-08-09</td>\n",
    375        "      <td>2014-08-09</td>\n",
    376        "      <td>GM</td>\n",
    377        "      <td>Legal/Regulatory</td>\n",
    378        "      <td>Judge Rejects General Motors Co. Motion to Dis...</td>\n",
    379        "      <td>0</td>\n",
    380        "      <td>NaN</td>\n",
    381        "      <td>NaN</td>\n",
    382        "      <td>1</td>\n",
    383        "      <td>2014-08-10</td>\n",
    384        "      <td>40430</td>\n",
    385        "    </tr>\n",
    386        "  </tbody>\n",
    387        "</table>"
    388       ],
    389       "text/plain": [
    390        "   event_id  asof_date trade_date symbol        event_type  \\\n",
    391        "0   1695334 2014-03-19 2014-03-20     GM  Legal/Regulatory   \n",
    392        "1   1696156 2014-03-21 2014-03-21     GM  Legal/Regulatory   \n",
    393        "2   1703401 2014-04-22 2014-04-23     GM  Legal/Regulatory   \n",
    394        "3   1725393 2014-05-16 2014-05-19     GM  Legal/Regulatory   \n",
    395        "4   1725455 2014-05-17 2014-05-19     GM  Legal/Regulatory   \n",
    396        "5   1736421 2014-06-18 2014-06-18     GM  Legal/Regulatory   \n",
    397        "6   1768189 2014-08-09 2014-08-09     GM  Legal/Regulatory   \n",
    398        "\n",
    399        "                                      event_headline  legal_amount  \\\n",
    400        "0  General Motors Co. Faces Class Action Lawsuit ...             0   \n",
    401        "1     General Motors Co. Faces Class Action Lawsuit              0   \n",
    402        "2  General Motors Company Faces Class Action Laws...             0   \n",
    403        "3     General Motors Co. Faces Class Action Lawsuits             0   \n",
    404        "4  General Motors to Pay $35M Fine Over Delay to ...            35   \n",
    405        "5      General Motors Co. Faces Class Action Lawsuit             0   \n",
    406        "6  Judge Rejects General Motors Co. Motion to Dis...             0   \n",
    407        "\n",
    408        "  legal_units                                       legal_entity  \\\n",
    409        "0         NaN                                      Hagens Berman   \n",
    410        "1         NaN                                      Pomerantz LLP   \n",
    411        "2         NaN  Law Offices of Howard G. Smith; Glancy Binkow ...   \n",
    412        "3         NaN                                      Pomerantz LLP   \n",
    413        "4          $M                                                NaN   \n",
    414        "5         NaN                        Hagens Berman Sobol Shapiro   \n",
    415        "6         NaN                                                NaN   \n",
    416        "\n",
    417        "   event_rating  timestamp    sid  \n",
    418        "0             1 2014-03-20  40430  \n",
    419        "1             1 2014-03-22  40430  \n",
    420        "2             1 2014-04-23  40430  \n",
    421        "3             1 2014-05-17  40430  \n",
    422        "4             1 2014-05-18  40430  \n",
    423        "5             1 2014-06-19  40430  \n",
    424        "6             1 2014-08-10  40430  "
    425       ]
    426      },
    427      "execution_count": 6,
    428      "metadata": {},
    429      "output_type": "execute_result"
    430     }
    431    ],
    432    "source": [
    433     "# get GM's sid first\n",
    434     "gm_sid = symbols('GM').sid\n",
    435     "cases = legal_and_regulatory[('2013-12-31' < legal_and_regulatory['asof_date']) & \n",
    436     "                        (legal_and_regulatory['asof_date'] <'2015-01-01') & \n",
    437     "                        (legal_and_regulatory.sid == gm_sid)]\n",
    438     "# When displaying a Blaze Data Object, the printout is automatically truncated to ten rows.\n",
    439     "cases.sort('asof_date')"
    440    ]
    441   },
    442   {
    443    "cell_type": "markdown",
    444    "metadata": {},
    445    "source": [
    446     "Now suppose we want a DataFrame of the Blaze Data Object above, but only want entries with a non-zero legal_amount. Further, we want to drop the `event_rating` and `event_type`."
    447    ]
    448   },
    449   {
    450    "cell_type": "code",
    451    "execution_count": 7,
    452    "metadata": {
    453     "collapsed": false
    454    },
    455    "outputs": [
    456     {
    457      "data": {
    458       "text/html": [
    459        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
    460        "<table border=\"1\" class=\"dataframe\">\n",
    461        "  <thead>\n",
    462        "    <tr style=\"text-align: right;\">\n",
    463        "      <th></th>\n",
    464        "      <th>event_id</th>\n",
    465        "      <th>asof_date</th>\n",
    466        "      <th>trade_date</th>\n",
    467        "      <th>symbol</th>\n",
    468        "      <th>event_headline</th>\n",
    469        "      <th>legal_amount</th>\n",
    470        "      <th>legal_units</th>\n",
    471        "      <th>legal_entity</th>\n",
    472        "      <th>timestamp</th>\n",
    473        "      <th>sid</th>\n",
    474        "    </tr>\n",
    475        "  </thead>\n",
    476        "  <tbody>\n",
    477        "    <tr>\n",
    478        "      <th>4</th>\n",
    479        "      <td>1725455</td>\n",
    480        "      <td>2014-05-17</td>\n",
    481        "      <td>2014-05-19</td>\n",
    482        "      <td>GM</td>\n",
    483        "      <td>General Motors to Pay $35M Fine Over Delay to ...</td>\n",
    484        "      <td>35</td>\n",
    485        "      <td>$M</td>\n",
    486        "      <td>NaN</td>\n",
    487        "      <td>2014-05-18</td>\n",
    488        "      <td>40430</td>\n",
    489        "    </tr>\n",
    490        "  </tbody>\n",
    491        "</table>\n",
    492        "</div>"
    493       ],
    494       "text/plain": [
    495        "   event_id  asof_date trade_date symbol  \\\n",
    496        "4   1725455 2014-05-17 2014-05-19     GM   \n",
    497        "\n",
    498        "                                      event_headline  legal_amount  \\\n",
    499        "4  General Motors to Pay $35M Fine Over Delay to ...            35   \n",
    500        "\n",
    501        "  legal_units legal_entity  timestamp    sid  \n",
    502        "4          $M          NaN 2014-05-18  40430  "
    503       ]
    504      },
    505      "execution_count": 7,
    506      "metadata": {},
    507      "output_type": "execute_result"
    508     }
    509    ],
    510    "source": [
    511     "df = odo(cases, pd.DataFrame)\n",
    512     "df = df[df.legal_amount > 0]\n",
    513     "df.drop(df[['event_type','event_rating']], axis=1, inplace=True)\n",
    514     "df"
    515    ]
    516   }
    517  ],
    518  "metadata": {
    519   "kernelspec": {
    520    "display_name": "Python 2",
    521    "language": "python",
    522    "name": "python2"
    523   },
    524   "language_info": {
    525    "codemirror_mode": {
    526     "name": "ipython",
    527     "version": 2
    528    },
    529    "file_extension": ".py",
    530    "mimetype": "text/x-python",
    531    "name": "python",
    532    "nbconvert_exporter": "python",
    533    "pygments_lexer": "ipython2",
    534    "version": "2.7.10"
    535   }
    536  },
    537  "nbformat": 4,
    538  "nbformat_minor": 0
    539 }