ml-finance-python

python scripts for finance machine learning

git clone https://9o.is/git/ml-finance-python.git

preprocessing.ipynb

(25693B)


      1 {
      2  "cells": [
      3   {
      4    "cell_type": "markdown",
      5    "metadata": {},
      6    "source": [
      7     "# Word vectors from SEC filings using gensim"
      8    ]
      9   },
     10   {
     11    "cell_type": "markdown",
     12    "metadata": {},
     13    "source": [
     14     "In this section, we will learn word and phrase vectors from annual SEC filings using gensim to illustrate the potential value of word embeddings for algorithmic trading. In the following sections, we will combine these vectors as features with price returns to train neural networks to predict equity prices from the content of security filings.\n",
     15     "\n",
     16     "In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see chapter 3 on Alternative Data). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling"
     17    ]
     18   },
     19   {
     20    "cell_type": "markdown",
     21    "metadata": {},
     22    "source": [
     23     "## Imports & Settings"
     24    ]
     25   },
     26   {
     27    "cell_type": "code",
     28    "execution_count": 2,
     29    "metadata": {
     30     "ExecuteTime": {
     31      "end_time": "2018-12-08T22:35:17.862176Z",
     32      "start_time": "2018-12-08T22:35:17.757049Z"
     33     }
     34    },
     35    "outputs": [],
     36    "source": [
     37     "from pathlib import Path\n",
     38     "import numpy as np\n",
     39     "import pandas as pd\n",
     40     "from time import time\n",
     41     "from collections import Counter\n",
     42     "import logging\n",
     43     "from gensim.models import Word2Vec\n",
     44     "from gensim.models.word2vec import LineSentence"
     45    ]
     46   },
     47   {
     48    "cell_type": "code",
     49    "execution_count": 3,
     50    "metadata": {
     51     "ExecuteTime": {
     52      "end_time": "2018-12-08T22:26:08.716608Z",
     53      "start_time": "2018-12-08T22:26:08.713845Z"
     54     }
     55    },
     56    "outputs": [],
     57    "source": [
     58     "pd.set_option('display.expand_frame_repr', False)\n",
     59     "np.random.seed(42)"
     60    ]
     61   },
     62   {
     63    "cell_type": "code",
     64    "execution_count": null,
     65    "metadata": {},
     66    "outputs": [],
     67    "source": [
     68     "def format_time(t):\n",
     69     "    m, s = divmod(t, 60)\n",
     70     "    h, m = divmod(m, 60)\n",
     71     "    return '{:02.0f}:{:02.0f}:{:02.0f}'.format(h, m, s)"
     72    ]
     73   },
     74   {
     75    "cell_type": "markdown",
     76    "metadata": {},
     77    "source": [
     78     "### Logging Setup"
     79    ]
     80   },
     81   {
     82    "cell_type": "code",
     83    "execution_count": 4,
     84    "metadata": {
     85     "ExecuteTime": {
     86      "end_time": "2018-12-08T22:26:09.622852Z",
     87      "start_time": "2018-12-08T22:26:09.618313Z"
     88     }
     89    },
     90    "outputs": [],
     91    "source": [
     92     "logging.basicConfig(\n",
     93     "        filename='preprocessing.log',\n",
     94     "        level=logging.DEBUG,\n",
     95     "        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n",
     96     "        datefmt='%H:%M:%S')"
     97    ]
     98   },
     99   {
    100    "cell_type": "markdown",
    101    "metadata": {},
    102    "source": [
    103     "### Paths"
    104    ]
    105   },
    106   {
    107    "cell_type": "markdown",
    108    "metadata": {},
    109    "source": [
    110     "Each filing is a separate text file and a master index contains filing metadata. We extract the most informative sections, namely\n",
    111     "- Item 1 and 1A: Business and Risk Factors\n",
    112     "- Item 7 and 7A: Management's Discussion and Disclosures about Market Risks\n",
    113     "\n",
    114     "The notebook preprocessing shows how to parse and tokenize the text using spaCy, similar to the approach in chapter 14. We do not lemmatize the tokens to preserve nuances of word usage.\n",
    115     "\n",
    116     "We use gensim to detect phrases. The Phrases module scores the tokens and the Phraser class transforms the text data accordingly. The notebook shows how to repeat the process to create longer phrases."
    117    ]
    118   },
    119   {
    120    "cell_type": "code",
    121    "execution_count": 80,
    122    "metadata": {
    123     "ExecuteTime": {
    124      "end_time": "2018-12-08T22:05:44.659946Z",
    125      "start_time": "2018-12-08T22:05:44.650955Z"
    126     }
    127    },
    128    "outputs": [],
    129    "source": [
    130     "filing_path = Path('data/filings')"
    131    ]
    132   },
    133   {
    134    "cell_type": "code",
    135    "execution_count": null,
    136    "metadata": {},
    137    "outputs": [],
    138    "source": [
    139     "sections_path = Path('data/sections')\n",
    140     "if not sections_path.exists():\n",
    141     "    sections_path.mkdir(exist_ok=True)"
    142    ]
    143   },
    144   {
    145    "cell_type": "markdown",
    146    "metadata": {},
    147    "source": [
    148     "## Identify Sections"
    149    ]
    150   },
    151   {
    152    "cell_type": "code",
    153    "execution_count": null,
    154    "metadata": {},
    155    "outputs": [],
    156    "source": [
    157     "for i, filing in enumerate(filing_path.glob('*.txt')):\n",
    158     "    if i % 500 == 0:\n",
    159     "        print(i, end=' ', flush=True)\n",
    160     "    filing_id = int(filing.stem)\n",
    161     "    items = {}\n",
    162     "    for section in filing.read_text().lower().split('°'):\n",
    163     "        if section.startswith('item '):\n",
    164     "            if len(section.split()) > 1:\n",
    165     "                item = section.split()[1].replace('.', '').replace(':', '').replace(',', '')\n",
    166     "                text = ' '.join([t for t in section.split()[2:]])\n",
    167     "                    if items.get(item) is None or len(items.get(item)) < len(text):\n",
    168     "                        items[item] = text\n",
    169     "\n",
    170     "    txt = pd.Series(items).reset_index()\n",
    171     "    txt.columns = ['item', 'text']\n",
    172     "    txt.to_csv(sections_path / (filing.stem + '.csv'), index=False)"
    173    ]
    174   },
    175   {
    176    "cell_type": "markdown",
    177    "metadata": {},
    178    "source": [
    179     "## Parse Sections"
    180    ]
    181   },
    182   {
    183    "cell_type": "markdown",
    184    "metadata": {},
    185    "source": [
    186     "Select the following sections:"
    187    ]
    188   },
    189   {
    190    "cell_type": "code",
    191    "execution_count": 81,
    192    "metadata": {
    193     "ExecuteTime": {
    194      "end_time": "2018-12-08T22:15:15.102683Z",
    195      "start_time": "2018-12-08T22:15:15.100109Z"
    196     }
    197    },
    198    "outputs": [],
    199    "source": [
    200     "sections = ['1', '1a', '7', '7a']"
    201    ]
    202   },
    203   {
    204    "cell_type": "code",
    205    "execution_count": null,
    206    "metadata": {},
    207    "outputs": [],
    208    "source": [
    209     "clean_path = Path('data/selected_sections')\n",
    210     "if not clean_path.exists():\n",
    211     "    clean_path.mkdir(exist_ok=True)"
    212    ]
    213   },
    214   {
    215    "cell_type": "code",
    216    "execution_count": null,
    217    "metadata": {},
    218    "outputs": [],
    219    "source": [
    220     "nlp = spacy.load('en', disable=['ner'])\n",
    221     "nlp.max_length = 6000000"
    222    ]
    223   },
    224   {
    225    "cell_type": "code",
    226    "execution_count": null,
    227    "metadata": {},
    228    "outputs": [],
    229    "source": [
    230     "vocab = Counter()\n",
    231     "t = total_tokens = 0\n",
    232     "stats = []\n",
    233     "\n",
    234     "start = time()\n",
    235     "done = 1\n",
    236     "for text_file in sections_path.glob('*.csv'):\n",
    237     "    file_id = int(text_file.stem)\n",
    238     "    clean_file = clean_path / f'{file_id}.csv'\n",
    239     "    if clean_file.exists():\n",
    240     "        continue\n",
    241     "    items = pd.read_csv(text_file).dropna()\n",
    242     "    items.item = items.item.astype(str)\n",
    243     "    items = items[items.item.isin(sections)]\n",
    244     "    if done % 100 == 0:\n",
    245     "        duration = time() - start\n",
    246     "        to_go = (to_do - done) * duration / done\n",
    247     "        print(f'{done:>5}\\t{format_time(duration)}\\t{total_tokens / duration:,.0f}\\t{format_time(to_go)}')\n",
    248     "    \n",
    249     "    clean_doc = []\n",
    250     "    for _, (item, text) in items.iterrows():\n",
    251     "        doc = nlp(text)\n",
    252     "        for s, sentence in enumerate(doc.sents):\n",
    253     "            clean_sentence = []\n",
    254     "            if sentence is not None:\n",
    255     "                for t, token in enumerate(sentence, 1):\n",
    256     "                    if not any([token.is_stop,\n",
    257     "                                token.is_digit,\n",
    258     "                                not token.is_alpha,\n",
    259     "                                token.is_punct,\n",
    260     "                                token.is_space,\n",
    261     "                                token.lemma_ == '-PRON-',\n",
    262     "                                token.pos_ in ['PUNCT', 'SYM', 'X']]):\n",
    263     "                        clean_sentence.append(token.text.lower())\n",
    264     "                total_tokens += t\n",
    265     "                if len(clean_sentence) > 0:\n",
    266     "                    clean_doc.append([item, s, ' '.join(clean_sentence)])\n",
    267     "    (pd.DataFrame(clean_doc,\n",
    268     "                  columns=['item', 'sentence', 'text'])\n",
    269     "     .dropna()\n",
    270     "     .to_csv(clean_file, index=False))\n",
    271     "    done += 1"
    272    ]
    273   },
    274   {
    275    "cell_type": "markdown",
    276    "metadata": {},
    277    "source": [
    278     "## Create ngrams"
    279    ]
    280   },
    281   {
    282    "cell_type": "code",
    283    "execution_count": 4,
    284    "metadata": {
    285     "ExecuteTime": {
    286      "end_time": "2018-12-08T22:36:42.347622Z",
    287      "start_time": "2018-12-08T22:36:42.343529Z"
    288     }
    289    },
    290    "outputs": [],
    291    "source": [
    292     "ngram_path = Path('data', 'ngrams')\n",
    293     "stats_path = Path('corpus_stats')"
    294    ]
    295   },
    296   {
    297    "cell_type": "code",
    298    "execution_count": 5,
    299    "metadata": {
    300     "ExecuteTime": {
    301      "end_time": "2018-12-08T22:36:57.526969Z",
    302      "start_time": "2018-12-08T22:36:57.522768Z"
    303     }
    304    },
    305    "outputs": [],
    306    "source": [
    307     "def create_unigrams(min_length=3):\n",
    308     "    texts = []\n",
    309     "    sentence_counter = Counter()\n",
    310     "    unigrams = ngram_path / 'ngrams_1.txt'\n",
    311     "    vocab = Counter()\n",
    312     "    for f in path.glob('*.csv'):\n",
    313     "        df = pd.read_csv(f)\n",
    314     "        df.item = df.item.astype(str)\n",
    315     "        df = df[df.item.isin(items)]\n",
    316     "        sentence_counter.update(df.groupby('item').size().to_dict())\n",
    317     "        for sentence in df.text.str.split().tolist():\n",
    318     "            if len(sentence) >= min_length:\n",
    319     "                vocab.update(sentence)\n",
    320     "                texts.append(' '.join(sentence))\n",
    321     "    (pd.DataFrame(sentence_counter.most_common(), \n",
    322     "                  columns=['item', 'sentences'])\n",
    323     "     .to_csv(stats_path / 'selected_sentences.csv', index=False))\n",
    324     "    (pd.DataFrame(vocab.most_common(), columns=['token', 'n'])\n",
    325     "     .to_csv(stats_path / 'sections_vocab.csv', index=False))\n",
    326     "    unigrams.write_text('\\n'.join(texts))\n",
    327     "    return [l.split() for l in texts]"
    328    ]
    329   },
    330   {
    331    "cell_type": "code",
    332    "execution_count": null,
    333    "metadata": {},
    334    "outputs": [],
    335    "source": [
    336     "start = time()\n",
    337     "if not unigrams.exists():\n",
    338     "    texts = create_unigrams()\n",
    339     "else:\n",
    340     "    texts = [l.split() for l in unigrams.open()]\n",
    341     "print('Reading: ', format_time(time() - start))"
    342    ]
    343   },
    344   {
    345    "cell_type": "code",
    346    "execution_count": null,
    347    "metadata": {},
    348    "outputs": [],
    349    "source": [
    350     "def create_ngrams(max_length=3):\n",
    351     "    \"\"\"Using gensim to create ngrams\"\"\"\n",
    352     "\n",
    353     "    n_grams = pd.DataFrame()\n",
    354     "    start = time()\n",
    355     "    for n in range(2, max_length + 1):\n",
    356     "        print(n, end=' ', flush=True)\n",
    357     "\n",
    358     "        sentences = LineSentence(f'ngrams_{n - 1}.txt')\n",
    359     "        phrases = Phrases(sentences=sentences,\n",
    360     "                          min_count=25,  # ignore terms with a lower count\n",
    361     "                          threshold=0.5,  # accept phrases with higher score\n",
    362     "                          max_vocab_size=40000000,  # prune of less common words to limit memory use\n",
    363     "                          delimiter=b'_',  # how to join ngram tokens\n",
    364     "                          progress_per=50000,  # log progress every\n",
    365     "                          scoring='npmi')\n",
    366     "\n",
    367     "        s = pd.DataFrame([[k.decode('utf-8'), v]\n",
    368     "                          for k, v in phrases.export_phrases(sentences)]\n",
    369     "                         , columns=['phrase', 'score']).assign(length=n)\n",
    370     "\n",
    371     "        n_grams = pd.concat([n_grams, s])\n",
    372     "        grams = Phraser(phrases)\n",
    373     "        sentences = grams[sentences]\n",
    374     "        Path(f'ngrams_{n}.txt').write_text('\\n'.join([' '.join(s) for s in sentences]))\n",
    375     "\n",
    376     "    n_grams = n_grams.sort_values('score', ascending=False)\n",
    377     "    n_grams.phrase = n_grams.phrase.str.replace('_', ' ')\n",
    378     "    n_grams['ngram'] = n_grams.phrase.str.replace(' ', '_')\n",
    379     "\n",
    380     "    with pd.HDFStore('vocab.h5') as store:\n",
    381     "        store.put('ngrams', n_grams)\n",
    382     "\n",
    383     "    print('\\n\\tDuration: ', format_time(time() - start))\n",
    384     "    print('\\tngrams: {:,d}\\n'.format(len(n_grams)))\n",
    385     "    print(n_grams.groupby('length').size())"
    386    ]
    387   },
    388   {
    389    "cell_type": "code",
    390    "execution_count": null,
    391    "metadata": {},
    392    "outputs": [],
    393    "source": [
    394     "create_ngrams()"
    395    ]
    396   },
    397   {
    398    "cell_type": "markdown",
    399    "metadata": {},
    400    "source": [
    401     "## Inspect Corpus"
    402    ]
    403   },
    404   {
    405    "cell_type": "code",
    406    "execution_count": 40,
    407    "metadata": {
    408     "ExecuteTime": {
    409      "end_time": "2018-12-08T23:46:12.167011Z",
    410      "start_time": "2018-12-08T23:46:12.054686Z"
    411     }
    412    },
    413    "outputs": [],
    414    "source": [
    415     "ngrams = pd.read_parquet('corpus_stats/ngrams.parquet')"
    416    ]
    417   },
    418   {
    419    "cell_type": "code",
    420    "execution_count": 41,
    421    "metadata": {
    422     "ExecuteTime": {
    423      "end_time": "2018-12-08T23:46:12.428814Z",
    424      "start_time": "2018-12-08T23:46:12.358566Z"
    425     }
    426    },
    427    "outputs": [
    428     {
    429      "name": "stdout",
    430      "output_type": "stream",
    431      "text": [
    432       "<class 'pandas.core.frame.DataFrame'>\n",
    433       "Int64Index: 721562 entries, 10742145 to 4887103\n",
    434       "Data columns (total 3 columns):\n",
    435       "phrase    721562 non-null object\n",
    436       "score     721562 non-null float64\n",
    437       "length    721562 non-null int64\n",
    438       "dtypes: float64(1), int64(1), object(1)\n",
    439       "memory usage: 22.0+ MB\n"
    440      ]
    441     }
    442    ],
    443    "source": [
    444     "ngrams.info()"
    445    ]
    446   },
    447   {
    448    "cell_type": "code",
    449    "execution_count": 46,
    450    "metadata": {
    451     "ExecuteTime": {
    452      "end_time": "2018-12-08T23:47:20.650064Z",
    453      "start_time": "2018-12-08T23:47:20.551220Z"
    454     }
    455    },
    456    "outputs": [
    457     {
    458      "data": {
    459       "text/plain": [
    460        "count    721562.000000\n",
    461        "mean          0.631225\n",
    462        "std           0.125067\n",
    463        "min           0.500000\n",
    464        "10%           0.512507\n",
    465        "20%           0.526746\n",
    466        "30%           0.543690\n",
    467        "40%           0.564299\n",
    468        "50%           0.589516\n",
    469        "60%           0.621228\n",
    470        "70%           0.663055\n",
    471        "80%           0.722132\n",
    472        "90%           0.824150\n",
    473        "max           1.000000\n",
    474        "Name: score, dtype: float64"
    475       ]
    476      },
    477      "execution_count": 46,
    478      "metadata": {},
    479      "output_type": "execute_result"
    480     }
    481    ],
    482    "source": [
    483     "percentiles=np.arange(.1, 1, .1).round(2)\n",
    484     "ngrams.score.describe(percentiles=percentiles)"
    485    ]
    486   },
    487   {
    488    "cell_type": "code",
    489    "execution_count": 72,
    490    "metadata": {
    491     "ExecuteTime": {
    492      "end_time": "2018-12-10T07:56:42.135744Z",
    493      "start_time": "2018-12-10T07:56:42.086001Z"
    494     }
    495    },
    496    "outputs": [
    497     {
    498      "data": {
    499       "text/html": [
    500        "<div>\n",
    501        "<style scoped>\n",
    502        "    .dataframe tbody tr th:only-of-type {\n",
    503        "        vertical-align: middle;\n",
    504        "    }\n",
    505        "\n",
    506        "    .dataframe tbody tr th {\n",
    507        "        vertical-align: top;\n",
    508        "    }\n",
    509        "\n",
    510        "    .dataframe thead th {\n",
    511        "        text-align: right;\n",
    512        "    }\n",
    513        "</style>\n",
    514        "<table border=\"1\" class=\"dataframe\">\n",
    515        "  <thead>\n",
    516        "    <tr style=\"text-align: right;\">\n",
    517        "      <th></th>\n",
    518        "      <th>phrase</th>\n",
    519        "      <th>score</th>\n",
    520        "      <th>length</th>\n",
    521        "    </tr>\n",
    522        "  </thead>\n",
    523        "  <tbody>\n",
    524        "    <tr>\n",
    525        "      <th>13138522</th>\n",
    526        "      <td>topsoe uop</td>\n",
    527        "      <td>0.700002</td>\n",
    528        "      <td>2</td>\n",
    529        "    </tr>\n",
    530        "    <tr>\n",
    531        "      <th>22155584</th>\n",
    532        "      <td>aastra prairiefyre</td>\n",
    533        "      <td>0.700009</td>\n",
    534        "      <td>2</td>\n",
    535        "    </tr>\n",
    536        "    <tr>\n",
    537        "      <th>21581977</th>\n",
    538        "      <td>sre tre</td>\n",
    539        "      <td>0.700009</td>\n",
    540        "      <td>2</td>\n",
    541        "    </tr>\n",
    542        "    <tr>\n",
    543        "      <th>9717859</th>\n",
    544        "      <td>twp nng</td>\n",
    545        "      <td>0.700017</td>\n",
    546        "      <td>2</td>\n",
    547        "    </tr>\n",
    548        "    <tr>\n",
    549        "      <th>1507180</th>\n",
    550        "      <td>ecomobile telkonet</td>\n",
    551        "      <td>0.700017</td>\n",
    552        "      <td>2</td>\n",
    553        "    </tr>\n",
    554        "    <tr>\n",
    555        "      <th>26474295</th>\n",
    556        "      <td>knsd kxas</td>\n",
    557        "      <td>0.700017</td>\n",
    558        "      <td>2</td>\n",
    559        "    </tr>\n",
    560        "    <tr>\n",
    561        "      <th>17960106</th>\n",
    562        "      <td>oxalate ssri</td>\n",
    563        "      <td>0.700017</td>\n",
    564        "      <td>2</td>\n",
    565        "    </tr>\n",
    566        "    <tr>\n",
    567        "      <th>6936430</th>\n",
    568        "      <td>swirl estimote</td>\n",
    569        "      <td>0.700017</td>\n",
    570        "      <td>2</td>\n",
    571        "    </tr>\n",
    572        "    <tr>\n",
    573        "      <th>25398447</th>\n",
    574        "      <td>gdtna gdte</td>\n",
    575        "      <td>0.700024</td>\n",
    576        "      <td>2</td>\n",
    577        "    </tr>\n",
    578        "    <tr>\n",
    579        "      <th>14638108</th>\n",
    580        "      <td>chun guang</td>\n",
    581        "      <td>0.700024</td>\n",
    582        "      <td>2</td>\n",
    583        "    </tr>\n",
    584        "  </tbody>\n",
    585        "</table>\n",
    586        "</div>"
    587       ],
    588       "text/plain": [
    589        "                      phrase     score  length\n",
    590        "13138522          topsoe uop  0.700002       2\n",
    591        "22155584  aastra prairiefyre  0.700009       2\n",
    592        "21581977             sre tre  0.700009       2\n",
    593        "9717859              twp nng  0.700017       2\n",
    594        "1507180   ecomobile telkonet  0.700017       2\n",
    595        "26474295           knsd kxas  0.700017       2\n",
    596        "17960106        oxalate ssri  0.700017       2\n",
    597        "6936430       swirl estimote  0.700017       2\n",
    598        "25398447          gdtna gdte  0.700024       2\n",
    599        "14638108          chun guang  0.700024       2"
    600       ]
    601      },
    602      "execution_count": 72,
    603      "metadata": {},
    604      "output_type": "execute_result"
    605     }
    606    ],
    607    "source": [
    608     "ngrams[ngrams.score>.7].sort_values(['length', 'score']).head(10)"
    609    ]
    610   },
    611   {
    612    "cell_type": "code",
    613    "execution_count": 51,
    614    "metadata": {
    615     "ExecuteTime": {
    616      "end_time": "2018-12-08T23:49:20.481793Z",
    617      "start_time": "2018-12-08T23:49:20.399896Z"
    618     }
    619    },
    620    "outputs": [],
    621    "source": [
    622     "vocab = pd.read_csv('corpus_stats/sections_vocab.csv').dropna()"
    623    ]
    624   },
    625   {
    626    "cell_type": "code",
    627    "execution_count": 52,
    628    "metadata": {
    629     "ExecuteTime": {
    630      "end_time": "2018-12-08T23:49:21.447127Z",
    631      "start_time": "2018-12-08T23:49:21.429999Z"
    632     }
    633    },
    634    "outputs": [
    635     {
    636      "name": "stdout",
    637      "output_type": "stream",
    638      "text": [
    639       "<class 'pandas.core.frame.DataFrame'>\n",
    640       "Int64Index: 201443 entries, 0 to 201444\n",
    641       "Data columns (total 2 columns):\n",
    642       "token    201443 non-null object\n",
    643       "n        201443 non-null int64\n",
    644       "dtypes: int64(1), object(1)\n",
    645       "memory usage: 4.6+ MB\n"
    646      ]
    647     }
    648    ],
    649    "source": [
    650     "vocab.info()"
    651    ]
    652   },
    653   {
    654    "cell_type": "code",
    655    "execution_count": 53,
    656    "metadata": {
    657     "ExecuteTime": {
    658      "end_time": "2018-12-08T23:49:26.121094Z",
    659      "start_time": "2018-12-08T23:49:26.087771Z"
    660     }
    661    },
    662    "outputs": [
    663     {
    664      "data": {
    665       "text/plain": [
    666        "count     201443\n",
    667        "mean        1440\n",
    668        "std        22366\n",
    669        "min            1\n",
    670        "10%            1\n",
    671        "20%            2\n",
    672        "30%            3\n",
    673        "40%            4\n",
    674        "50%            7\n",
    675        "60%           12\n",
    676        "70%           24\n",
    677        "80%           61\n",
    678        "90%          260\n",
    679        "max      2576751\n",
    680        "Name: n, dtype: int64"
    681       ]
    682      },
    683      "execution_count": 53,
    684      "metadata": {},
    685      "output_type": "execute_result"
    686     }
    687    ],
    688    "source": [
    689     "vocab.n.describe(percentiles).astype(int)"
    690    ]
    691   },
    692   {
    693    "cell_type": "code",
    694    "execution_count": 57,
    695    "metadata": {
    696     "ExecuteTime": {
    697      "end_time": "2018-12-08T23:52:32.605872Z",
    698      "start_time": "2018-12-08T23:51:25.921419Z"
    699     }
    700    },
    701    "outputs": [],
    702    "source": [
    703     "tokens = Counter()\n",
    704     "for l in Path('data', 'ngrams', 'ngrams_3.txt').open():\n",
    705     "    tokens.update(l.split())"
    706    ]
    707   },
    708   {
    709    "cell_type": "code",
    710    "execution_count": 58,
    711    "metadata": {
    712     "ExecuteTime": {
    713      "end_time": "2018-12-08T23:52:33.446549Z",
    714      "start_time": "2018-12-08T23:52:33.151560Z"
    715     }
    716    },
    717    "outputs": [],
    718    "source": [
    719     "tokens = pd.DataFrame(tokens.most_common(),\n",
    720     "                     columns=['token', 'count'])"
    721    ]
    722   },
    723   {
    724    "cell_type": "code",
    725    "execution_count": 59,
    726    "metadata": {
    727     "ExecuteTime": {
    728      "end_time": "2018-12-08T23:52:33.550537Z",
    729      "start_time": "2018-12-08T23:52:33.489729Z"
    730     }
    731    },
    732    "outputs": [
    733     {
    734      "name": "stdout",
    735      "output_type": "stream",
    736      "text": [
    737       "<class 'pandas.core.frame.DataFrame'>\n",
    738       "RangeIndex: 664963 entries, 0 to 664962\n",
    739       "Data columns (total 2 columns):\n",
    740       "token    664963 non-null object\n",
    741       "count    664963 non-null int64\n",
    742       "dtypes: int64(1), object(1)\n",
    743       "memory usage: 10.1+ MB\n"
    744      ]
    745     }
    746    ],
    747    "source": [
    748     "tokens.info()"
    749    ]
    750   },
    751   {
    752    "cell_type": "code",
    753    "execution_count": 60,
    754    "metadata": {
    755     "ExecuteTime": {
    756      "end_time": "2018-12-08T23:52:41.859378Z",
    757      "start_time": "2018-12-08T23:52:41.542641Z"
    758     }
    759    },
    760    "outputs": [
    761     {
    762      "data": {
    763       "text/plain": [
    764        "count    546779\n",
    765        "mean         56\n",
    766        "std        1947\n",
    767        "min           1\n",
    768        "10%           1\n",
    769        "20%           1\n",
    770        "30%           2\n",
    771        "40%           2\n",
    772        "50%           3\n",
    773        "60%           3\n",
    774        "70%           4\n",
    775        "80%           6\n",
    776        "90%          13\n",
    777        "max      513694\n",
    778        "Name: count, dtype: int64"
    779       ]
    780      },
    781      "execution_count": 60,
    782      "metadata": {},
    783      "output_type": "execute_result"
    784     }
    785    ],
    786    "source": [
    787     "tokens.loc[tokens.token.str.contains('_'), 'count'].describe(percentiles).astype(int)"
    788    ]
    789   },
    790   {
    791    "cell_type": "code",
    792    "execution_count": 74,
    793    "metadata": {
    794     "ExecuteTime": {
    795      "end_time": "2018-12-10T07:57:44.279871Z",
    796      "start_time": "2018-12-10T07:57:43.976999Z"
    797     }
    798    },
    799    "outputs": [],
    800    "source": [
    801     "tokens[tokens.token.str.contains('_')].head(20).to_csv('ngram_examples.csv', index=False)"
    802    ]
    803   },
    804   {
    805    "cell_type": "markdown",
    806    "metadata": {},
    807    "source": [
    808     "## Get returns"
    809    ]
    810   },
    811   {
    812    "cell_type": "code",
    813    "execution_count": null,
    814    "metadata": {},
    815    "outputs": [],
    816    "source": [
    817     "with pd.HDFStore('../data/assets.h5') as store:\n",
    818     "    stocks = store['quandl/wiki/stocks']\n",
    819     "    prices = store['quandl/wiki/prices'].adj_close"
    820    ]
    821   },
    822   {
    823    "cell_type": "code",
    824    "execution_count": null,
    825    "metadata": {},
    826    "outputs": [],
    827    "source": [
    828     "sec = pd.read_csv('data/report_index.csv').rename(columns=str.lower)\n",
    829     "sec.date_filed = pd.to_datetime(sec.date_filed)"
    830    ]
    831   },
    832   {
    833    "cell_type": "code",
    834    "execution_count": null,
    835    "metadata": {},
    836    "outputs": [],
    837    "source": [
    838     "idx = pd.IndexSlice"
    839    ]
    840   },
    841   {
    842    "cell_type": "code",
    843    "execution_count": null,
    844    "metadata": {},
    845    "outputs": [],
    846    "source": [
    847     "first = sec.date_filed.min() + relativedelta(months=-1)\n",
    848     "last = sec.date_filed.max() + relativedelta(months=1)\n",
    849     "prices = (prices\n",
    850     "          .loc[idx[first:last, :]]\n",
    851     "          .unstack().resample('D')\n",
    852     "          .ffill()\n",
    853     "          .dropna(how='all', axis=1)\n",
    854     "          .filter(sec.ticker.unique()))"
    855    ]
    856   },
    857   {
    858    "cell_type": "code",
    859    "execution_count": null,
    860    "metadata": {},
    861    "outputs": [],
    862    "source": [
    863     "sec = sec.loc[sec.ticker.isin(prices.columns), ['ticker', 'date_filed']]\n",
    864     "\n",
    865     "price_data = []\n",
    866     "for ticker, date in sec.values.tolist():\n",
    867     "    target = date + relativedelta(months=1)\n",
    868     "    s = prices.loc[date: target, ticker]\n",
    869     "    price_data.append(s.iloc[-1] / s.iloc[0] - 1)\n",
    870     "\n",
    871     "df = pd.DataFrame(price_data,\n",
    872     "                  columns=['returns'],\n",
    873     "                  index=sec.index)\n",
    874     "\n",
    875     "print(df.returns.describe())\n",
    876     "sec['returns'] = price_data\n",
    877     "print(sec.info())\n",
    878     "sec.dropna().to_csv('data/sec_returns.csv', index=False)"
    879    ]
    880   }
    881  ],
    882  "metadata": {
    883   "kernelspec": {
    884    "display_name": "Python 3",
    885    "language": "python",
    886    "name": "python3"
    887   },
    888   "language_info": {
    889    "codemirror_mode": {
    890     "name": "ipython",
    891     "version": 3
    892    },
    893    "file_extension": ".py",
    894    "mimetype": "text/x-python",
    895    "name": "python",
    896    "nbconvert_exporter": "python",
    897    "pygments_lexer": "ipython3",
    898    "version": "3.6.8"
    899   },
    900   "toc": {
    901    "base_numbering": 1,
    902    "nav_menu": {},
    903    "number_sections": true,
    904    "sideBar": true,
    905    "skip_h1_title": false,
    906    "title_cell": "Table of Contents",
    907    "title_sidebar": "Contents",
    908    "toc_cell": false,
    909    "toc_position": {},
    910    "toc_section_display": true,
    911    "toc_window_display": true
    912   }
    913  },
    914  "nbformat": 4,
    915  "nbformat_minor": 2
    916 }