ml-finance-python
python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
preprocessing.ipynb
(25693B)
1 {
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# Word vectors from SEC filings using gensim"
8 ]
9 },
10 {
11 "cell_type": "markdown",
12 "metadata": {},
13 "source": [
14 "In this section, we will learn word and phrase vectors from annual SEC filings using gensim to illustrate the potential value of word embeddings for algorithmic trading. In the following sections, we will combine these vectors as features with price returns to train neural networks to predict equity prices from the content of security filings.\n",
15 "\n",
16 "In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see chapter 3 on Alternative Data). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling"
17 ]
18 },
19 {
20 "cell_type": "markdown",
21 "metadata": {},
22 "source": [
23 "## Imports & Settings"
24 ]
25 },
26 {
27 "cell_type": "code",
28 "execution_count": 2,
29 "metadata": {
30 "ExecuteTime": {
31 "end_time": "2018-12-08T22:35:17.862176Z",
32 "start_time": "2018-12-08T22:35:17.757049Z"
33 }
34 },
35 "outputs": [],
36 "source": [
37 "from pathlib import Path\n",
38 "import numpy as np\n",
39 "import pandas as pd\n",
40 "from time import time\n",
41 "from collections import Counter\n",
42 "import logging\n",
43 "from gensim.models import Word2Vec\n",
44 "from gensim.models.word2vec import LineSentence"
45 ]
46 },
47 {
48 "cell_type": "code",
49 "execution_count": 3,
50 "metadata": {
51 "ExecuteTime": {
52 "end_time": "2018-12-08T22:26:08.716608Z",
53 "start_time": "2018-12-08T22:26:08.713845Z"
54 }
55 },
56 "outputs": [],
57 "source": [
58 "pd.set_option('display.expand_frame_repr', False)\n",
59 "np.random.seed(42)"
60 ]
61 },
62 {
63 "cell_type": "code",
64 "execution_count": null,
65 "metadata": {},
66 "outputs": [],
67 "source": [
68 "def format_time(t):\n",
69 " m, s = divmod(t, 60)\n",
70 " h, m = divmod(m, 60)\n",
71 " return '{:02.0f}:{:02.0f}:{:02.0f}'.format(h, m, s)"
72 ]
73 },
74 {
75 "cell_type": "markdown",
76 "metadata": {},
77 "source": [
78 "### Logging Setup"
79 ]
80 },
81 {
82 "cell_type": "code",
83 "execution_count": 4,
84 "metadata": {
85 "ExecuteTime": {
86 "end_time": "2018-12-08T22:26:09.622852Z",
87 "start_time": "2018-12-08T22:26:09.618313Z"
88 }
89 },
90 "outputs": [],
91 "source": [
92 "logging.basicConfig(\n",
93 " filename='preprocessing.log',\n",
94 " level=logging.DEBUG,\n",
95 " format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n",
96 " datefmt='%H:%M:%S')"
97 ]
98 },
99 {
100 "cell_type": "markdown",
101 "metadata": {},
102 "source": [
103 "### Paths"
104 ]
105 },
106 {
107 "cell_type": "markdown",
108 "metadata": {},
109 "source": [
110 "Each filing is a separate text file and a master index contains filing metadata. We extract the most informative sections, namely\n",
111 "- Item 1 and 1A: Business and Risk Factors\n",
112 "- Item 7 and 7A: Management's Discussion and Disclosures about Market Risks\n",
113 "\n",
114 "The notebook preprocessing shows how to parse and tokenize the text using spaCy, similar to the approach in chapter 14. We do not lemmatize the tokens to preserve nuances of word usage.\n",
115 "\n",
116 "We use gensim to detect phrases. The Phrases module scores the tokens and the Phraser class transforms the text data accordingly. The notebook shows how to repeat the process to create longer phrases."
117 ]
118 },
119 {
120 "cell_type": "code",
121 "execution_count": 80,
122 "metadata": {
123 "ExecuteTime": {
124 "end_time": "2018-12-08T22:05:44.659946Z",
125 "start_time": "2018-12-08T22:05:44.650955Z"
126 }
127 },
128 "outputs": [],
129 "source": [
130 "filing_path = Path('data/filings')"
131 ]
132 },
133 {
134 "cell_type": "code",
135 "execution_count": null,
136 "metadata": {},
137 "outputs": [],
138 "source": [
139 "sections_path = Path('data/sections')\n",
140 "if not sections_path.exists():\n",
141 " sections_path.mkdir(exist_ok=True)"
142 ]
143 },
144 {
145 "cell_type": "markdown",
146 "metadata": {},
147 "source": [
148 "## Identify Sections"
149 ]
150 },
151 {
152 "cell_type": "code",
153 "execution_count": null,
154 "metadata": {},
155 "outputs": [],
156 "source": [
157 "for i, filing in enumerate(filing_path.glob('*.txt')):\n",
158 " if i % 500 == 0:\n",
159 " print(i, end=' ', flush=True)\n",
160 " filing_id = int(filing.stem)\n",
161 " items = {}\n",
162 " for section in filing.read_text().lower().split('°'):\n",
163 " if section.startswith('item '):\n",
164 " if len(section.split()) > 1:\n",
165 " item = section.split()[1].replace('.', '').replace(':', '').replace(',', '')\n",
166 " text = ' '.join([t for t in section.split()[2:]])\n",
167 " if items.get(item) is None or len(items.get(item)) < len(text):\n",
168 " items[item] = text\n",
169 "\n",
170 " txt = pd.Series(items).reset_index()\n",
171 " txt.columns = ['item', 'text']\n",
172 " txt.to_csv(sections_path / (filing.stem + '.csv'), index=False)"
173 ]
174 },
175 {
176 "cell_type": "markdown",
177 "metadata": {},
178 "source": [
179 "## Parse Sections"
180 ]
181 },
182 {
183 "cell_type": "markdown",
184 "metadata": {},
185 "source": [
186 "Select the following sections:"
187 ]
188 },
189 {
190 "cell_type": "code",
191 "execution_count": 81,
192 "metadata": {
193 "ExecuteTime": {
194 "end_time": "2018-12-08T22:15:15.102683Z",
195 "start_time": "2018-12-08T22:15:15.100109Z"
196 }
197 },
198 "outputs": [],
199 "source": [
200 "sections = ['1', '1a', '7', '7a']"
201 ]
202 },
203 {
204 "cell_type": "code",
205 "execution_count": null,
206 "metadata": {},
207 "outputs": [],
208 "source": [
209 "clean_path = Path('data/selected_sections')\n",
210 "if not clean_path.exists():\n",
211 " clean_path.mkdir(exist_ok=True)"
212 ]
213 },
214 {
215 "cell_type": "code",
216 "execution_count": null,
217 "metadata": {},
218 "outputs": [],
219 "source": [
220 "nlp = spacy.load('en', disable=['ner'])\n",
221 "nlp.max_length = 6000000"
222 ]
223 },
224 {
225 "cell_type": "code",
226 "execution_count": null,
227 "metadata": {},
228 "outputs": [],
229 "source": [
230 "vocab = Counter()\n",
231 "t = total_tokens = 0\n",
232 "stats = []\n",
233 "\n",
234 "start = time()\n",
235 "done = 1\n",
236 "for text_file in sections_path.glob('*.csv'):\n",
237 " file_id = int(text_file.stem)\n",
238 " clean_file = clean_path / f'{file_id}.csv'\n",
239 " if clean_file.exists():\n",
240 " continue\n",
241 " items = pd.read_csv(text_file).dropna()\n",
242 " items.item = items.item.astype(str)\n",
243 " items = items[items.item.isin(sections)]\n",
244 " if done % 100 == 0:\n",
245 " duration = time() - start\n",
246 " to_go = (to_do - done) * duration / done\n",
247 " print(f'{done:>5}\\t{format_time(duration)}\\t{total_tokens / duration:,.0f}\\t{format_time(to_go)}')\n",
248 " \n",
249 " clean_doc = []\n",
250 " for _, (item, text) in items.iterrows():\n",
251 " doc = nlp(text)\n",
252 " for s, sentence in enumerate(doc.sents):\n",
253 " clean_sentence = []\n",
254 " if sentence is not None:\n",
255 " for t, token in enumerate(sentence, 1):\n",
256 " if not any([token.is_stop,\n",
257 " token.is_digit,\n",
258 " not token.is_alpha,\n",
259 " token.is_punct,\n",
260 " token.is_space,\n",
261 " token.lemma_ == '-PRON-',\n",
262 " token.pos_ in ['PUNCT', 'SYM', 'X']]):\n",
263 " clean_sentence.append(token.text.lower())\n",
264 " total_tokens += t\n",
265 " if len(clean_sentence) > 0:\n",
266 " clean_doc.append([item, s, ' '.join(clean_sentence)])\n",
267 " (pd.DataFrame(clean_doc,\n",
268 " columns=['item', 'sentence', 'text'])\n",
269 " .dropna()\n",
270 " .to_csv(clean_file, index=False))\n",
271 " done += 1"
272 ]
273 },
274 {
275 "cell_type": "markdown",
276 "metadata": {},
277 "source": [
278 "## Create ngrams"
279 ]
280 },
281 {
282 "cell_type": "code",
283 "execution_count": 4,
284 "metadata": {
285 "ExecuteTime": {
286 "end_time": "2018-12-08T22:36:42.347622Z",
287 "start_time": "2018-12-08T22:36:42.343529Z"
288 }
289 },
290 "outputs": [],
291 "source": [
292 "ngram_path = Path('data', 'ngrams')\n",
293 "stats_path = Path('corpus_stats')"
294 ]
295 },
296 {
297 "cell_type": "code",
298 "execution_count": 5,
299 "metadata": {
300 "ExecuteTime": {
301 "end_time": "2018-12-08T22:36:57.526969Z",
302 "start_time": "2018-12-08T22:36:57.522768Z"
303 }
304 },
305 "outputs": [],
306 "source": [
307 "def create_unigrams(min_length=3):\n",
308 " texts = []\n",
309 " sentence_counter = Counter()\n",
310 " unigrams = ngram_path / 'ngrams_1.txt'\n",
311 " vocab = Counter()\n",
312 " for f in path.glob('*.csv'):\n",
313 " df = pd.read_csv(f)\n",
314 " df.item = df.item.astype(str)\n",
315 " df = df[df.item.isin(items)]\n",
316 " sentence_counter.update(df.groupby('item').size().to_dict())\n",
317 " for sentence in df.text.str.split().tolist():\n",
318 " if len(sentence) >= min_length:\n",
319 " vocab.update(sentence)\n",
320 " texts.append(' '.join(sentence))\n",
321 " (pd.DataFrame(sentence_counter.most_common(), \n",
322 " columns=['item', 'sentences'])\n",
323 " .to_csv(stats_path / 'selected_sentences.csv', index=False))\n",
324 " (pd.DataFrame(vocab.most_common(), columns=['token', 'n'])\n",
325 " .to_csv(stats_path / 'sections_vocab.csv', index=False))\n",
326 " unigrams.write_text('\\n'.join(texts))\n",
327 " return [l.split() for l in texts]"
328 ]
329 },
330 {
331 "cell_type": "code",
332 "execution_count": null,
333 "metadata": {},
334 "outputs": [],
335 "source": [
336 "start = time()\n",
337 "if not unigrams.exists():\n",
338 " texts = create_unigrams()\n",
339 "else:\n",
340 " texts = [l.split() for l in unigrams.open()]\n",
341 "print('Reading: ', format_time(time() - start))"
342 ]
343 },
344 {
345 "cell_type": "code",
346 "execution_count": null,
347 "metadata": {},
348 "outputs": [],
349 "source": [
350 "def create_ngrams(max_length=3):\n",
351 " \"\"\"Using gensim to create ngrams\"\"\"\n",
352 "\n",
353 " n_grams = pd.DataFrame()\n",
354 " start = time()\n",
355 " for n in range(2, max_length + 1):\n",
356 " print(n, end=' ', flush=True)\n",
357 "\n",
358 " sentences = LineSentence(f'ngrams_{n - 1}.txt')\n",
359 " phrases = Phrases(sentences=sentences,\n",
360 " min_count=25, # ignore terms with a lower count\n",
361 " threshold=0.5, # accept phrases with higher score\n",
362 " max_vocab_size=40000000, # prune of less common words to limit memory use\n",
363 " delimiter=b'_', # how to join ngram tokens\n",
364 " progress_per=50000, # log progress every\n",
365 " scoring='npmi')\n",
366 "\n",
367 " s = pd.DataFrame([[k.decode('utf-8'), v]\n",
368 " for k, v in phrases.export_phrases(sentences)]\n",
369 " , columns=['phrase', 'score']).assign(length=n)\n",
370 "\n",
371 " n_grams = pd.concat([n_grams, s])\n",
372 " grams = Phraser(phrases)\n",
373 " sentences = grams[sentences]\n",
374 " Path(f'ngrams_{n}.txt').write_text('\\n'.join([' '.join(s) for s in sentences]))\n",
375 "\n",
376 " n_grams = n_grams.sort_values('score', ascending=False)\n",
377 " n_grams.phrase = n_grams.phrase.str.replace('_', ' ')\n",
378 " n_grams['ngram'] = n_grams.phrase.str.replace(' ', '_')\n",
379 "\n",
380 " with pd.HDFStore('vocab.h5') as store:\n",
381 " store.put('ngrams', n_grams)\n",
382 "\n",
383 " print('\\n\\tDuration: ', format_time(time() - start))\n",
384 " print('\\tngrams: {:,d}\\n'.format(len(n_grams)))\n",
385 " print(n_grams.groupby('length').size())"
386 ]
387 },
388 {
389 "cell_type": "code",
390 "execution_count": null,
391 "metadata": {},
392 "outputs": [],
393 "source": [
394 "create_ngrams()"
395 ]
396 },
397 {
398 "cell_type": "markdown",
399 "metadata": {},
400 "source": [
401 "## Inspect Corpus"
402 ]
403 },
404 {
405 "cell_type": "code",
406 "execution_count": 40,
407 "metadata": {
408 "ExecuteTime": {
409 "end_time": "2018-12-08T23:46:12.167011Z",
410 "start_time": "2018-12-08T23:46:12.054686Z"
411 }
412 },
413 "outputs": [],
414 "source": [
415 "ngrams = pd.read_parquet('corpus_stats/ngrams.parquet')"
416 ]
417 },
418 {
419 "cell_type": "code",
420 "execution_count": 41,
421 "metadata": {
422 "ExecuteTime": {
423 "end_time": "2018-12-08T23:46:12.428814Z",
424 "start_time": "2018-12-08T23:46:12.358566Z"
425 }
426 },
427 "outputs": [
428 {
429 "name": "stdout",
430 "output_type": "stream",
431 "text": [
432 "<class 'pandas.core.frame.DataFrame'>\n",
433 "Int64Index: 721562 entries, 10742145 to 4887103\n",
434 "Data columns (total 3 columns):\n",
435 "phrase 721562 non-null object\n",
436 "score 721562 non-null float64\n",
437 "length 721562 non-null int64\n",
438 "dtypes: float64(1), int64(1), object(1)\n",
439 "memory usage: 22.0+ MB\n"
440 ]
441 }
442 ],
443 "source": [
444 "ngrams.info()"
445 ]
446 },
447 {
448 "cell_type": "code",
449 "execution_count": 46,
450 "metadata": {
451 "ExecuteTime": {
452 "end_time": "2018-12-08T23:47:20.650064Z",
453 "start_time": "2018-12-08T23:47:20.551220Z"
454 }
455 },
456 "outputs": [
457 {
458 "data": {
459 "text/plain": [
460 "count 721562.000000\n",
461 "mean 0.631225\n",
462 "std 0.125067\n",
463 "min 0.500000\n",
464 "10% 0.512507\n",
465 "20% 0.526746\n",
466 "30% 0.543690\n",
467 "40% 0.564299\n",
468 "50% 0.589516\n",
469 "60% 0.621228\n",
470 "70% 0.663055\n",
471 "80% 0.722132\n",
472 "90% 0.824150\n",
473 "max 1.000000\n",
474 "Name: score, dtype: float64"
475 ]
476 },
477 "execution_count": 46,
478 "metadata": {},
479 "output_type": "execute_result"
480 }
481 ],
482 "source": [
483 "percentiles=np.arange(.1, 1, .1).round(2)\n",
484 "ngrams.score.describe(percentiles=percentiles)"
485 ]
486 },
487 {
488 "cell_type": "code",
489 "execution_count": 72,
490 "metadata": {
491 "ExecuteTime": {
492 "end_time": "2018-12-10T07:56:42.135744Z",
493 "start_time": "2018-12-10T07:56:42.086001Z"
494 }
495 },
496 "outputs": [
497 {
498 "data": {
499 "text/html": [
500 "<div>\n",
501 "<style scoped>\n",
502 " .dataframe tbody tr th:only-of-type {\n",
503 " vertical-align: middle;\n",
504 " }\n",
505 "\n",
506 " .dataframe tbody tr th {\n",
507 " vertical-align: top;\n",
508 " }\n",
509 "\n",
510 " .dataframe thead th {\n",
511 " text-align: right;\n",
512 " }\n",
513 "</style>\n",
514 "<table border=\"1\" class=\"dataframe\">\n",
515 " <thead>\n",
516 " <tr style=\"text-align: right;\">\n",
517 " <th></th>\n",
518 " <th>phrase</th>\n",
519 " <th>score</th>\n",
520 " <th>length</th>\n",
521 " </tr>\n",
522 " </thead>\n",
523 " <tbody>\n",
524 " <tr>\n",
525 " <th>13138522</th>\n",
526 " <td>topsoe uop</td>\n",
527 " <td>0.700002</td>\n",
528 " <td>2</td>\n",
529 " </tr>\n",
530 " <tr>\n",
531 " <th>22155584</th>\n",
532 " <td>aastra prairiefyre</td>\n",
533 " <td>0.700009</td>\n",
534 " <td>2</td>\n",
535 " </tr>\n",
536 " <tr>\n",
537 " <th>21581977</th>\n",
538 " <td>sre tre</td>\n",
539 " <td>0.700009</td>\n",
540 " <td>2</td>\n",
541 " </tr>\n",
542 " <tr>\n",
543 " <th>9717859</th>\n",
544 " <td>twp nng</td>\n",
545 " <td>0.700017</td>\n",
546 " <td>2</td>\n",
547 " </tr>\n",
548 " <tr>\n",
549 " <th>1507180</th>\n",
550 " <td>ecomobile telkonet</td>\n",
551 " <td>0.700017</td>\n",
552 " <td>2</td>\n",
553 " </tr>\n",
554 " <tr>\n",
555 " <th>26474295</th>\n",
556 " <td>knsd kxas</td>\n",
557 " <td>0.700017</td>\n",
558 " <td>2</td>\n",
559 " </tr>\n",
560 " <tr>\n",
561 " <th>17960106</th>\n",
562 " <td>oxalate ssri</td>\n",
563 " <td>0.700017</td>\n",
564 " <td>2</td>\n",
565 " </tr>\n",
566 " <tr>\n",
567 " <th>6936430</th>\n",
568 " <td>swirl estimote</td>\n",
569 " <td>0.700017</td>\n",
570 " <td>2</td>\n",
571 " </tr>\n",
572 " <tr>\n",
573 " <th>25398447</th>\n",
574 " <td>gdtna gdte</td>\n",
575 " <td>0.700024</td>\n",
576 " <td>2</td>\n",
577 " </tr>\n",
578 " <tr>\n",
579 " <th>14638108</th>\n",
580 " <td>chun guang</td>\n",
581 " <td>0.700024</td>\n",
582 " <td>2</td>\n",
583 " </tr>\n",
584 " </tbody>\n",
585 "</table>\n",
586 "</div>"
587 ],
588 "text/plain": [
589 " phrase score length\n",
590 "13138522 topsoe uop 0.700002 2\n",
591 "22155584 aastra prairiefyre 0.700009 2\n",
592 "21581977 sre tre 0.700009 2\n",
593 "9717859 twp nng 0.700017 2\n",
594 "1507180 ecomobile telkonet 0.700017 2\n",
595 "26474295 knsd kxas 0.700017 2\n",
596 "17960106 oxalate ssri 0.700017 2\n",
597 "6936430 swirl estimote 0.700017 2\n",
598 "25398447 gdtna gdte 0.700024 2\n",
599 "14638108 chun guang 0.700024 2"
600 ]
601 },
602 "execution_count": 72,
603 "metadata": {},
604 "output_type": "execute_result"
605 }
606 ],
607 "source": [
608 "ngrams[ngrams.score>.7].sort_values(['length', 'score']).head(10)"
609 ]
610 },
611 {
612 "cell_type": "code",
613 "execution_count": 51,
614 "metadata": {
615 "ExecuteTime": {
616 "end_time": "2018-12-08T23:49:20.481793Z",
617 "start_time": "2018-12-08T23:49:20.399896Z"
618 }
619 },
620 "outputs": [],
621 "source": [
622 "vocab = pd.read_csv('corpus_stats/sections_vocab.csv').dropna()"
623 ]
624 },
625 {
626 "cell_type": "code",
627 "execution_count": 52,
628 "metadata": {
629 "ExecuteTime": {
630 "end_time": "2018-12-08T23:49:21.447127Z",
631 "start_time": "2018-12-08T23:49:21.429999Z"
632 }
633 },
634 "outputs": [
635 {
636 "name": "stdout",
637 "output_type": "stream",
638 "text": [
639 "<class 'pandas.core.frame.DataFrame'>\n",
640 "Int64Index: 201443 entries, 0 to 201444\n",
641 "Data columns (total 2 columns):\n",
642 "token 201443 non-null object\n",
643 "n 201443 non-null int64\n",
644 "dtypes: int64(1), object(1)\n",
645 "memory usage: 4.6+ MB\n"
646 ]
647 }
648 ],
649 "source": [
650 "vocab.info()"
651 ]
652 },
653 {
654 "cell_type": "code",
655 "execution_count": 53,
656 "metadata": {
657 "ExecuteTime": {
658 "end_time": "2018-12-08T23:49:26.121094Z",
659 "start_time": "2018-12-08T23:49:26.087771Z"
660 }
661 },
662 "outputs": [
663 {
664 "data": {
665 "text/plain": [
666 "count 201443\n",
667 "mean 1440\n",
668 "std 22366\n",
669 "min 1\n",
670 "10% 1\n",
671 "20% 2\n",
672 "30% 3\n",
673 "40% 4\n",
674 "50% 7\n",
675 "60% 12\n",
676 "70% 24\n",
677 "80% 61\n",
678 "90% 260\n",
679 "max 2576751\n",
680 "Name: n, dtype: int64"
681 ]
682 },
683 "execution_count": 53,
684 "metadata": {},
685 "output_type": "execute_result"
686 }
687 ],
688 "source": [
689 "vocab.n.describe(percentiles).astype(int)"
690 ]
691 },
692 {
693 "cell_type": "code",
694 "execution_count": 57,
695 "metadata": {
696 "ExecuteTime": {
697 "end_time": "2018-12-08T23:52:32.605872Z",
698 "start_time": "2018-12-08T23:51:25.921419Z"
699 }
700 },
701 "outputs": [],
702 "source": [
703 "tokens = Counter()\n",
704 "for l in Path('data', 'ngrams', 'ngrams_3.txt').open():\n",
705 " tokens.update(l.split())"
706 ]
707 },
708 {
709 "cell_type": "code",
710 "execution_count": 58,
711 "metadata": {
712 "ExecuteTime": {
713 "end_time": "2018-12-08T23:52:33.446549Z",
714 "start_time": "2018-12-08T23:52:33.151560Z"
715 }
716 },
717 "outputs": [],
718 "source": [
719 "tokens = pd.DataFrame(tokens.most_common(),\n",
720 " columns=['token', 'count'])"
721 ]
722 },
723 {
724 "cell_type": "code",
725 "execution_count": 59,
726 "metadata": {
727 "ExecuteTime": {
728 "end_time": "2018-12-08T23:52:33.550537Z",
729 "start_time": "2018-12-08T23:52:33.489729Z"
730 }
731 },
732 "outputs": [
733 {
734 "name": "stdout",
735 "output_type": "stream",
736 "text": [
737 "<class 'pandas.core.frame.DataFrame'>\n",
738 "RangeIndex: 664963 entries, 0 to 664962\n",
739 "Data columns (total 2 columns):\n",
740 "token 664963 non-null object\n",
741 "count 664963 non-null int64\n",
742 "dtypes: int64(1), object(1)\n",
743 "memory usage: 10.1+ MB\n"
744 ]
745 }
746 ],
747 "source": [
748 "tokens.info()"
749 ]
750 },
751 {
752 "cell_type": "code",
753 "execution_count": 60,
754 "metadata": {
755 "ExecuteTime": {
756 "end_time": "2018-12-08T23:52:41.859378Z",
757 "start_time": "2018-12-08T23:52:41.542641Z"
758 }
759 },
760 "outputs": [
761 {
762 "data": {
763 "text/plain": [
764 "count 546779\n",
765 "mean 56\n",
766 "std 1947\n",
767 "min 1\n",
768 "10% 1\n",
769 "20% 1\n",
770 "30% 2\n",
771 "40% 2\n",
772 "50% 3\n",
773 "60% 3\n",
774 "70% 4\n",
775 "80% 6\n",
776 "90% 13\n",
777 "max 513694\n",
778 "Name: count, dtype: int64"
779 ]
780 },
781 "execution_count": 60,
782 "metadata": {},
783 "output_type": "execute_result"
784 }
785 ],
786 "source": [
787 "tokens.loc[tokens.token.str.contains('_'), 'count'].describe(percentiles).astype(int)"
788 ]
789 },
790 {
791 "cell_type": "code",
792 "execution_count": 74,
793 "metadata": {
794 "ExecuteTime": {
795 "end_time": "2018-12-10T07:57:44.279871Z",
796 "start_time": "2018-12-10T07:57:43.976999Z"
797 }
798 },
799 "outputs": [],
800 "source": [
801 "tokens[tokens.token.str.contains('_')].head(20).to_csv('ngram_examples.csv', index=False)"
802 ]
803 },
804 {
805 "cell_type": "markdown",
806 "metadata": {},
807 "source": [
808 "## Get returns"
809 ]
810 },
811 {
812 "cell_type": "code",
813 "execution_count": null,
814 "metadata": {},
815 "outputs": [],
816 "source": [
817 "with pd.HDFStore('../data/assets.h5') as store:\n",
818 " stocks = store['quandl/wiki/stocks']\n",
819 " prices = store['quandl/wiki/prices'].adj_close"
820 ]
821 },
822 {
823 "cell_type": "code",
824 "execution_count": null,
825 "metadata": {},
826 "outputs": [],
827 "source": [
828 "sec = pd.read_csv('data/report_index.csv').rename(columns=str.lower)\n",
829 "sec.date_filed = pd.to_datetime(sec.date_filed)"
830 ]
831 },
832 {
833 "cell_type": "code",
834 "execution_count": null,
835 "metadata": {},
836 "outputs": [],
837 "source": [
838 "idx = pd.IndexSlice"
839 ]
840 },
841 {
842 "cell_type": "code",
843 "execution_count": null,
844 "metadata": {},
845 "outputs": [],
846 "source": [
847 "first = sec.date_filed.min() + relativedelta(months=-1)\n",
848 "last = sec.date_filed.max() + relativedelta(months=1)\n",
849 "prices = (prices\n",
850 " .loc[idx[first:last, :]]\n",
851 " .unstack().resample('D')\n",
852 " .ffill()\n",
853 " .dropna(how='all', axis=1)\n",
854 " .filter(sec.ticker.unique()))"
855 ]
856 },
857 {
858 "cell_type": "code",
859 "execution_count": null,
860 "metadata": {},
861 "outputs": [],
862 "source": [
863 "sec = sec.loc[sec.ticker.isin(prices.columns), ['ticker', 'date_filed']]\n",
864 "\n",
865 "price_data = []\n",
866 "for ticker, date in sec.values.tolist():\n",
867 " target = date + relativedelta(months=1)\n",
868 " s = prices.loc[date: target, ticker]\n",
869 " price_data.append(s.iloc[-1] / s.iloc[0] - 1)\n",
870 "\n",
871 "df = pd.DataFrame(price_data,\n",
872 " columns=['returns'],\n",
873 " index=sec.index)\n",
874 "\n",
875 "print(df.returns.describe())\n",
876 "sec['returns'] = price_data\n",
877 "print(sec.info())\n",
878 "sec.dropna().to_csv('data/sec_returns.csv', index=False)"
879 ]
880 }
881 ],
882 "metadata": {
883 "kernelspec": {
884 "display_name": "Python 3",
885 "language": "python",
886 "name": "python3"
887 },
888 "language_info": {
889 "codemirror_mode": {
890 "name": "ipython",
891 "version": 3
892 },
893 "file_extension": ".py",
894 "mimetype": "text/x-python",
895 "name": "python",
896 "nbconvert_exporter": "python",
897 "pygments_lexer": "ipython3",
898 "version": "3.6.8"
899 },
900 "toc": {
901 "base_numbering": 1,
902 "nav_menu": {},
903 "number_sections": true,
904 "sideBar": true,
905 "skip_h1_title": false,
906 "title_cell": "Table of Contents",
907 "title_sidebar": "Contents",
908 "toc_cell": false,
909 "toc_position": {},
910 "toc_section_display": true,
911 "toc_window_display": true
912 }
913 },
914 "nbformat": 4,
915 "nbformat_minor": 2
916 }