ml-finance-python
python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
02_nlp_with_textblob.ipynb
(29110B)
1 {
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# NLP with TextBlob"
8 ]
9 },
10 {
11 "cell_type": "markdown",
12 "metadata": {},
13 "source": [
14 "TextBlob is a python library that provides a simple API for common NLP tasks and builds on the Natural Language Toolkit (nltk) and the Pattern web mining libraries. TextBlob facilitates part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others."
15 ]
16 },
17 {
18 "cell_type": "markdown",
19 "metadata": {},
20 "source": [
21 "## Imports & Settings"
22 ]
23 },
24 {
25 "cell_type": "code",
26 "execution_count": 1,
27 "metadata": {
28 "ExecuteTime": {
29 "end_time": "2018-11-26T02:54:26.696873Z",
30 "start_time": "2018-11-26T02:54:26.461634Z"
31 }
32 },
33 "outputs": [],
34 "source": [
35 "% matplotlib inline\n",
36 "import warnings\n",
37 "from pathlib import Path\n",
38 "\n",
39 "import numpy as np\n",
40 "import pandas as pd\n",
41 "\n",
42 "# Visualization\n",
43 "import matplotlib.pyplot as plt\n",
44 "import seaborn as sns\n",
45 "\n",
46 "# spacy, textblob and nltk for language processing\n",
47 "from textblob import TextBlob, Word\n",
48 "from nltk.stem.snowball import SnowballStemmer\n",
49 "\n",
50 "# sklearn for feature extraction & modeling\n",
51 "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer\n",
52 "from sklearn.model_selection import train_test_split\n",
53 "from sklearn.naive_bayes import MultinomialNB # Naive Bayes\n",
54 "from sklearn.linear_model import LogisticRegression\n",
55 "from sklearn import metrics\n",
56 "from sklearn.externals import joblib"
57 ]
58 },
59 {
60 "cell_type": "code",
61 "execution_count": 2,
62 "metadata": {
63 "ExecuteTime": {
64 "end_time": "2018-11-26T02:54:26.700048Z",
65 "start_time": "2018-11-26T02:54:26.698100Z"
66 }
67 },
68 "outputs": [],
69 "source": [
70 "np.random.seed(42)\n",
71 "pd.set_option('float_format', '{:,.2f}'.format)"
72 ]
73 },
74 {
75 "cell_type": "markdown",
76 "metadata": {},
77 "source": [
78 "## Load BBC Data"
79 ]
80 },
81 {
82 "cell_type": "markdown",
83 "metadata": {},
84 "source": [
85 "To illustrate the use of TextBlob, we sample a BBC sports article with the headline ‘Robinson ready for difficult task’. Similar to spaCy and other libraries, the first step is to pass the document through a pipeline represented by the TextBlob object to assign annotations required for various tasks."
86 ]
87 },
88 {
89 "cell_type": "code",
90 "execution_count": 3,
91 "metadata": {
92 "ExecuteTime": {
93 "end_time": "2018-11-26T02:54:28.476829Z",
94 "start_time": "2018-11-26T02:54:28.400043Z"
95 }
96 },
97 "outputs": [],
98 "source": [
99 "path = Path('data', 'bbc')\n",
100 "files = path.glob('**/*.txt')\n",
101 "doc_list = []\n",
102 "for i, file in enumerate(files):\n",
103 " topic = file.parts[-2]\n",
104 " article = file.read_text(encoding='latin1').split('\\n')\n",
105 " heading = article[0].strip()\n",
106 " body = ' '.join([l.strip() for l in article[1:]]).strip()\n",
107 " doc_list.append([topic, heading, body])"
108 ]
109 },
110 {
111 "cell_type": "code",
112 "execution_count": 4,
113 "metadata": {
114 "ExecuteTime": {
115 "end_time": "2018-11-26T02:54:28.805397Z",
116 "start_time": "2018-11-26T02:54:28.786497Z"
117 }
118 },
119 "outputs": [
120 {
121 "name": "stdout",
122 "output_type": "stream",
123 "text": [
124 "<class 'pandas.core.frame.DataFrame'>\n",
125 "RangeIndex: 2225 entries, 0 to 2224\n",
126 "Data columns (total 3 columns):\n",
127 "topic 2225 non-null object\n",
128 "heading 2225 non-null object\n",
129 "body 2225 non-null object\n",
130 "dtypes: object(3)\n",
131 "memory usage: 52.2+ KB\n"
132 ]
133 }
134 ],
135 "source": [
136 "docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'body'])\n",
137 "docs.info()"
138 ]
139 },
140 {
141 "cell_type": "markdown",
142 "metadata": {
143 "ExecuteTime": {
144 "end_time": "2018-11-21T15:03:08.577908Z",
145 "start_time": "2018-11-21T15:03:08.572433Z"
146 }
147 },
148 "source": [
149 "## Introduction to TextBlob\n",
150 "\n",
151 "You should already have downloaded TextBlob, a Python library used to explore common NLP tasks."
152 ]
153 },
154 {
155 "cell_type": "markdown",
156 "metadata": {},
157 "source": [
158 "### Select random article"
159 ]
160 },
161 {
162 "cell_type": "code",
163 "execution_count": 5,
164 "metadata": {
165 "ExecuteTime": {
166 "end_time": "2018-11-26T02:54:32.080548Z",
167 "start_time": "2018-11-26T02:54:32.072173Z"
168 }
169 },
170 "outputs": [],
171 "source": [
172 "article = docs.sample(1).squeeze()"
173 ]
174 },
175 {
176 "cell_type": "code",
177 "execution_count": 6,
178 "metadata": {
179 "ExecuteTime": {
180 "end_time": "2018-11-26T02:54:32.490561Z",
181 "start_time": "2018-11-26T02:54:32.482638Z"
182 }
183 },
184 "outputs": [
185 {
186 "name": "stdout",
187 "output_type": "stream",
188 "text": [
189 "Topic:\tSport\n",
190 "\n",
191 "Robinson ready for difficult task\n",
192 "\n",
193 "England coach Andy Robinson faces the first major test of his tenure as he tries to get back to winning ways after the Six Nations defeat by Wales. Robinson is likely to make changes in the back row and centre after the 11-9 loss as he contemplates Sunday's set-to with France at Twickenham. Lewis Moody and Martin Corry could both return after missing the game with hamstring and shoulder problems. And the midfield pairing of Mathew Tait and Jamie Noon is also under threat. Olly Barkley immediately allowed England to generate better field position with his kicking game after replacing debutant Tait just before the hour. The Bath fly-half-cum-centre is likely to start against France, with either Tait or Noon dropping out. Tait, given little opportunity to shine in attack, received praise from Robinson afterwards, even if the coach admitted Cardiff was an \"unforgiving place\" for the teenage prodigy. Robinson now has a tricky decision over whether to withdraw from the firing line, after just one outing, a player he regards as central to England's future. Tait himself, at least outwardly, appeared unaffected by the punishing treatment dished out to him by Gavin Henson in particular. \"I want more of that definitely,\" he said. \"Hopefully I can train hard this week and get selected for next week but we'll have to look at the video and wait and see. \"We were playing on our own 22 for a lot of the first half so it was quite difficult. I thought we defended reasonably well but we've just got to pick it up for France.\" His Newcastle team-mate Noon hardly covered himself in glory in his first major Test. He missed a tackle on Michael Owen in the build-up to Wales' try, conceded a penalty at the breakdown, was turned over in another tackle and fumbled Gavin Henson's cross-kick into touch, all inside the first quarter. His contribution improved in the second half, but England clearly need more of a playmaker in the inside centre role. Up front, the line-out remains fallible, despite a superb performance from Chris Jones, whose athleticism came to the fore after stepping into the side for Moody. It is more likely the Leicester flanker will return on the open side for the more physical challenge posed by the French forwards, with Andy Hazell likely to make way. Lock Ben Kay also justified his recall with an impressive all-round display on his return to the side, but elsewhere England positives were thin on the ground.\n"
194 ]
195 }
196 ],
197 "source": [
198 "print(f'Topic:\\t{article.topic.capitalize()}\\n\\n{article.heading}\\n')\n",
199 "print(article.body.strip())"
200 ]
201 },
202 {
203 "cell_type": "code",
204 "execution_count": 7,
205 "metadata": {
206 "ExecuteTime": {
207 "end_time": "2018-11-26T02:54:37.512003Z",
208 "start_time": "2018-11-26T02:54:37.510065Z"
209 }
210 },
211 "outputs": [],
212 "source": [
213 "parsed_body = TextBlob(article.body)"
214 ]
215 },
216 {
217 "cell_type": "markdown",
218 "metadata": {},
219 "source": [
220 "### Tokenization"
221 ]
222 },
223 {
224 "cell_type": "code",
225 "execution_count": 8,
226 "metadata": {
227 "ExecuteTime": {
228 "end_time": "2018-11-26T02:54:38.178003Z",
229 "start_time": "2018-11-26T02:54:38.131744Z"
230 }
231 },
232 "outputs": [
233 {
234 "data": {
235 "text/plain": [
236 "WordList(['England', 'coach', 'Andy', 'Robinson', 'faces', 'the', 'first', 'major', 'test', 'of', 'his', 'tenure', 'as', 'he', 'tries', 'to', 'get', 'back', 'to', 'winning', 'ways', 'after', 'the', 'Six', 'Nations', 'defeat', 'by', 'Wales', 'Robinson', 'is', 'likely', 'to', 'make', 'changes', 'in', 'the', 'back', 'row', 'and', 'centre', 'after', 'the', '11-9', 'loss', 'as', 'he', 'contemplates', 'Sunday', \"'s\", 'set-to', 'with', 'France', 'at', 'Twickenham', 'Lewis', 'Moody', 'and', 'Martin', 'Corry', 'could', 'both', 'return', 'after', 'missing', 'the', 'game', 'with', 'hamstring', 'and', 'shoulder', 'problems', 'And', 'the', 'midfield', 'pairing', 'of', 'Mathew', 'Tait', 'and', 'Jamie', 'Noon', 'is', 'also', 'under', 'threat', 'Olly', 'Barkley', 'immediately', 'allowed', 'England', 'to', 'generate', 'better', 'field', 'position', 'with', 'his', 'kicking', 'game', 'after', 'replacing', 'debutant', 'Tait', 'just', 'before', 'the', 'hour', 'The', 'Bath', 'fly-half-cum-centre', 'is', 'likely', 'to', 'start', 'against', 'France', 'with', 'either', 'Tait', 'or', 'Noon', 'dropping', 'out', 'Tait', 'given', 'little', 'opportunity', 'to', 'shine', 'in', 'attack', 'received', 'praise', 'from', 'Robinson', 'afterwards', 'even', 'if', 'the', 'coach', 'admitted', 'Cardiff', 'was', 'an', 'unforgiving', 'place', 'for', 'the', 'teenage', 'prodigy', 'Robinson', 'now', 'has', 'a', 'tricky', 'decision', 'over', 'whether', 'to', 'withdraw', 'from', 'the', 'firing', 'line', 'after', 'just', 'one', 'outing', 'a', 'player', 'he', 'regards', 'as', 'central', 'to', 'England', \"'s\", 'future', 'Tait', 'himself', 'at', 'least', 'outwardly', 'appeared', 'unaffected', 'by', 'the', 'punishing', 'treatment', 'dished', 'out', 'to', 'him', 'by', 'Gavin', 'Henson', 'in', 'particular', 'I', 'want', 'more', 'of', 'that', 'definitely', 'he', 'said', 'Hopefully', 'I', 'can', 'train', 'hard', 'this', 'week', 'and', 'get', 'selected', 'for', 'next', 'week', 'but', 'we', \"'ll\", 'have', 'to', 'look', 'at', 'the', 'video', 'and', 'wait', 'and', 'see', 'We', 'were', 'playing', 'on', 'our', 'own', '22', 'for', 'a', 'lot', 'of', 'the', 'first', 'half', 'so', 'it', 'was', 'quite', 'difficult', 'I', 'thought', 'we', 'defended', 'reasonably', 'well', 'but', 'we', \"'ve\", 'just', 'got', 'to', 'pick', 'it', 'up', 'for', 'France', 'His', 'Newcastle', 'team-mate', 'Noon', 'hardly', 'covered', 'himself', 'in', 'glory', 'in', 'his', 'first', 'major', 'Test', 'He', 'missed', 'a', 'tackle', 'on', 'Michael', 'Owen', 'in', 'the', 'build-up', 'to', 'Wales', 'try', 'conceded', 'a', 'penalty', 'at', 'the', 'breakdown', 'was', 'turned', 'over', 'in', 'another', 'tackle', 'and', 'fumbled', 'Gavin', 'Henson', \"'s\", 'cross-kick', 'into', 'touch', 'all', 'inside', 'the', 'first', 'quarter', 'His', 'contribution', 'improved', 'in', 'the', 'second', 'half', 'but', 'England', 'clearly', 'need', 'more', 'of', 'a', 'playmaker', 'in', 'the', 'inside', 'centre', 'role', 'Up', 'front', 'the', 'line-out', 'remains', 'fallible', 'despite', 'a', 'superb', 'performance', 'from', 'Chris', 'Jones', 'whose', 'athleticism', 'came', 'to', 'the', 'fore', 'after', 'stepping', 'into', 'the', 'side', 'for', 'Moody', 'It', 'is', 'more', 'likely', 'the', 'Leicester', 'flanker', 'will', 'return', 'on', 'the', 'open', 'side', 'for', 'the', 'more', 'physical', 'challenge', 'posed', 'by', 'the', 'French', 'forwards', 'with', 'Andy', 'Hazell', 'likely', 'to', 'make', 'way', 'Lock', 'Ben', 'Kay', 'also', 'justified', 'his', 'recall', 'with', 'an', 'impressive', 'all-round', 'display', 'on', 'his', 'return', 'to', 'the', 'side', 'but', 'elsewhere', 'England', 'positives', 'were', 'thin', 'on', 'the', 'ground'])"
237 ]
238 },
239 "execution_count": 8,
240 "metadata": {},
241 "output_type": "execute_result"
242 }
243 ],
244 "source": [
245 "parsed_body.words"
246 ]
247 },
248 {
249 "cell_type": "markdown",
250 "metadata": {},
251 "source": [
252 "### Sentence boundary detection"
253 ]
254 },
255 {
256 "cell_type": "code",
257 "execution_count": 9,
258 "metadata": {
259 "ExecuteTime": {
260 "end_time": "2018-11-26T02:54:38.943758Z",
261 "start_time": "2018-11-26T02:54:38.940200Z"
262 }
263 },
264 "outputs": [
265 {
266 "data": {
267 "text/plain": [
268 "[Sentence(\"England coach Andy Robinson faces the first major test of his tenure as he tries to get back to winning ways after the Six Nations defeat by Wales.\"),\n",
269 " Sentence(\"Robinson is likely to make changes in the back row and centre after the 11-9 loss as he contemplates Sunday's set-to with France at Twickenham.\"),\n",
270 " Sentence(\"Lewis Moody and Martin Corry could both return after missing the game with hamstring and shoulder problems.\"),\n",
271 " Sentence(\"And the midfield pairing of Mathew Tait and Jamie Noon is also under threat.\"),\n",
272 " Sentence(\"Olly Barkley immediately allowed England to generate better field position with his kicking game after replacing debutant Tait just before the hour.\"),\n",
273 " Sentence(\"The Bath fly-half-cum-centre is likely to start against France, with either Tait or Noon dropping out.\"),\n",
274 " Sentence(\"Tait, given little opportunity to shine in attack, received praise from Robinson afterwards, even if the coach admitted Cardiff was an \"unforgiving place\" for the teenage prodigy.\"),\n",
275 " Sentence(\"Robinson now has a tricky decision over whether to withdraw from the firing line, after just one outing, a player he regards as central to England's future.\"),\n",
276 " Sentence(\"Tait himself, at least outwardly, appeared unaffected by the punishing treatment dished out to him by Gavin Henson in particular.\"),\n",
277 " Sentence(\"\"I want more of that definitely,\" he said.\"),\n",
278 " Sentence(\"\"Hopefully I can train hard this week and get selected for next week but we'll have to look at the video and wait and see.\"),\n",
279 " Sentence(\"\"We were playing on our own 22 for a lot of the first half so it was quite difficult.\"),\n",
280 " Sentence(\"I thought we defended reasonably well but we've just got to pick it up for France.\"\"),\n",
281 " Sentence(\"His Newcastle team-mate Noon hardly covered himself in glory in his first major Test.\"),\n",
282 " Sentence(\"He missed a tackle on Michael Owen in the build-up to Wales' try, conceded a penalty at the breakdown, was turned over in another tackle and fumbled Gavin Henson's cross-kick into touch, all inside the first quarter.\"),\n",
283 " Sentence(\"His contribution improved in the second half, but England clearly need more of a playmaker in the inside centre role.\"),\n",
284 " Sentence(\"Up front, the line-out remains fallible, despite a superb performance from Chris Jones, whose athleticism came to the fore after stepping into the side for Moody.\"),\n",
285 " Sentence(\"It is more likely the Leicester flanker will return on the open side for the more physical challenge posed by the French forwards, with Andy Hazell likely to make way.\"),\n",
286 " Sentence(\"Lock Ben Kay also justified his recall with an impressive all-round display on his return to the side, but elsewhere England positives were thin on the ground.\")]"
287 ]
288 },
289 "execution_count": 9,
290 "metadata": {},
291 "output_type": "execute_result"
292 }
293 ],
294 "source": [
295 "parsed_body.sentences"
296 ]
297 },
298 {
299 "cell_type": "markdown",
300 "metadata": {},
301 "source": [
302 "### Stemming"
303 ]
304 },
305 {
306 "cell_type": "markdown",
307 "metadata": {},
308 "source": [
309 "To perform stemming, we instantiate the SnowballStemmer from the nltk library, call its .stem() method on each token and display tokens that were modified as a result:"
310 ]
311 },
312 {
313 "cell_type": "code",
314 "execution_count": 10,
315 "metadata": {
316 "ExecuteTime": {
317 "end_time": "2018-11-26T02:54:39.684948Z",
318 "start_time": "2018-11-26T02:54:39.645914Z"
319 }
320 },
321 "outputs": [
322 {
323 "data": {
324 "text/plain": [
325 "[('Andy', 'andi'),\n",
326 " ('faces', 'face'),\n",
327 " ('tenure', 'tenur'),\n",
328 " ('tries', 'tri'),\n",
329 " ('winning', 'win'),\n",
330 " ('ways', 'way'),\n",
331 " ('Nations', 'nation'),\n",
332 " ('Wales', 'wale'),\n",
333 " ('likely', 'like'),\n",
334 " ('changes', 'chang'),\n",
335 " ('centre', 'centr'),\n",
336 " ('contemplates', 'contempl'),\n",
337 " ('France', 'franc'),\n",
338 " ('Lewis', 'lewi'),\n",
339 " ('Moody', 'moodi'),\n",
340 " ('Corry', 'corri'),\n",
341 " ('missing', 'miss'),\n",
342 " ('hamstring', 'hamstr'),\n",
343 " ('problems', 'problem'),\n",
344 " ('pairing', 'pair'),\n",
345 " ('Jamie', 'jami'),\n",
346 " ('Olly', 'olli'),\n",
347 " ('immediately', 'immedi'),\n",
348 " ('allowed', 'allow'),\n",
349 " ('generate', 'generat'),\n",
350 " ('position', 'posit'),\n",
351 " ('kicking', 'kick'),\n",
352 " ('replacing', 'replac'),\n",
353 " ('debutant', 'debut'),\n",
354 " ('before', 'befor'),\n",
355 " ('fly-half-cum-centre', 'fly-half-cum-centr'),\n",
356 " ('likely', 'like'),\n",
357 " ('France', 'franc'),\n",
358 " ('dropping', 'drop'),\n",
359 " ('little', 'littl'),\n",
360 " ('opportunity', 'opportun'),\n",
361 " ('received', 'receiv'),\n",
362 " ('praise', 'prais'),\n",
363 " ('afterwards', 'afterward'),\n",
364 " ('admitted', 'admit'),\n",
365 " ('unforgiving', 'unforgiv'),\n",
366 " ('teenage', 'teenag'),\n",
367 " ('prodigy', 'prodigi'),\n",
368 " ('tricky', 'tricki'),\n",
369 " ('decision', 'decis'),\n",
370 " ('firing', 'fire'),\n",
371 " ('regards', 'regard'),\n",
372 " ('future', 'futur'),\n",
373 " ('outwardly', 'outward'),\n",
374 " ('appeared', 'appear'),\n",
375 " ('unaffected', 'unaffect'),\n",
376 " ('punishing', 'punish'),\n",
377 " ('dished', 'dish'),\n",
378 " ('definitely', 'definit'),\n",
379 " ('Hopefully', 'hope'),\n",
380 " ('selected', 'select'),\n",
381 " (\"'ll\", 'll'),\n",
382 " ('playing', 'play'),\n",
383 " ('quite', 'quit'),\n",
384 " ('defended', 'defend'),\n",
385 " ('reasonably', 'reason'),\n",
386 " (\"'ve\", 've'),\n",
387 " ('France', 'franc'),\n",
388 " ('Newcastle', 'newcastl'),\n",
389 " ('team-mate', 'team-mat'),\n",
390 " ('hardly', 'hard'),\n",
391 " ('covered', 'cover'),\n",
392 " ('glory', 'glori'),\n",
393 " ('missed', 'miss'),\n",
394 " ('tackle', 'tackl'),\n",
395 " ('Wales', 'wale'),\n",
396 " ('try', 'tri'),\n",
397 " ('conceded', 'conced'),\n",
398 " ('penalty', 'penalti'),\n",
399 " ('turned', 'turn'),\n",
400 " ('another', 'anoth'),\n",
401 " ('tackle', 'tackl'),\n",
402 " ('fumbled', 'fumbl'),\n",
403 " ('inside', 'insid'),\n",
404 " ('contribution', 'contribut'),\n",
405 " ('improved', 'improv'),\n",
406 " ('clearly', 'clear'),\n",
407 " ('playmaker', 'playmak'),\n",
408 " ('inside', 'insid'),\n",
409 " ('centre', 'centr'),\n",
410 " ('remains', 'remain'),\n",
411 " ('fallible', 'fallibl'),\n",
412 " ('despite', 'despit'),\n",
413 " ('performance', 'perform'),\n",
414 " ('Jones', 'jone'),\n",
415 " ('athleticism', 'athletic'),\n",
416 " ('stepping', 'step'),\n",
417 " ('Moody', 'moodi'),\n",
418 " ('likely', 'like'),\n",
419 " ('Leicester', 'leicest'),\n",
420 " ('physical', 'physic'),\n",
421 " ('challenge', 'challeng'),\n",
422 " ('posed', 'pose'),\n",
423 " ('forwards', 'forward'),\n",
424 " ('Andy', 'andi'),\n",
425 " ('Hazell', 'hazel'),\n",
426 " ('likely', 'like'),\n",
427 " ('justified', 'justifi'),\n",
428 " ('recall', 'recal'),\n",
429 " ('impressive', 'impress'),\n",
430 " ('elsewhere', 'elsewher'),\n",
431 " ('positives', 'posit')]"
432 ]
433 },
434 "execution_count": 10,
435 "metadata": {},
436 "output_type": "execute_result"
437 }
438 ],
439 "source": [
440 "# Initialize stemmer.\n",
441 "stemmer = SnowballStemmer('english')\n",
442 "\n",
443 "# Stem each word.\n",
444 "[(word, stemmer.stem(word)) for i, word in enumerate(parsed_body.words) \n",
445 " if word.lower() != stemmer.stem(parsed_body.words[i])]"
446 ]
447 },
448 {
449 "cell_type": "markdown",
450 "metadata": {},
451 "source": [
452 "### Lemmatization"
453 ]
454 },
455 {
456 "cell_type": "code",
457 "execution_count": 11,
458 "metadata": {
459 "ExecuteTime": {
460 "end_time": "2018-11-26T02:54:47.978888Z",
461 "start_time": "2018-11-26T02:54:46.897896Z"
462 }
463 },
464 "outputs": [
465 {
466 "data": {
467 "text/plain": [
468 "[('faces', 'face'),\n",
469 " ('as', 'a'),\n",
470 " ('tries', 'try'),\n",
471 " ('ways', 'way'),\n",
472 " ('changes', 'change'),\n",
473 " ('as', 'a'),\n",
474 " ('problems', 'problem'),\n",
475 " ('was', 'wa'),\n",
476 " ('has', 'ha'),\n",
477 " ('regards', 'regard'),\n",
478 " ('as', 'a'),\n",
479 " ('was', 'wa'),\n",
480 " ('was', 'wa'),\n",
481 " ('forwards', 'forward'),\n",
482 " ('positives', 'positive')]"
483 ]
484 },
485 "execution_count": 11,
486 "metadata": {},
487 "output_type": "execute_result"
488 }
489 ],
490 "source": [
491 "[(word, word.lemmatize()) for i, word in enumerate(parsed_body.words) \n",
492 " if word != parsed_body.words[i].lemmatize()]"
493 ]
494 },
495 {
496 "cell_type": "markdown",
497 "metadata": {},
498 "source": [
499 "Lemmatization relies on parts-of-speech (POS) tagging; `spaCy` performs POS tagging, here we make assumptions, e.g. that each token is verb."
500 ]
501 },
502 {
503 "cell_type": "code",
504 "execution_count": 12,
505 "metadata": {
506 "ExecuteTime": {
507 "end_time": "2018-11-26T02:54:47.987180Z",
508 "start_time": "2018-11-26T02:54:47.980209Z"
509 }
510 },
511 "outputs": [
512 {
513 "data": {
514 "text/plain": [
515 "[('faces', 'face'),\n",
516 " ('tries', 'try'),\n",
517 " ('winning', 'win'),\n",
518 " ('is', 'be'),\n",
519 " ('changes', 'change'),\n",
520 " ('contemplates', 'contemplate'),\n",
521 " ('missing', 'miss'),\n",
522 " ('pairing', 'pair'),\n",
523 " ('is', 'be'),\n",
524 " ('allowed', 'allow'),\n",
525 " ('kicking', 'kick'),\n",
526 " ('replacing', 'replace'),\n",
527 " ('is', 'be'),\n",
528 " ('dropping', 'drop'),\n",
529 " ('given', 'give'),\n",
530 " ('received', 'receive'),\n",
531 " ('admitted', 'admit'),\n",
532 " ('was', 'be'),\n",
533 " ('has', 'have'),\n",
534 " ('firing', 'fire'),\n",
535 " ('outing', 'out'),\n",
536 " ('regards', 'regard'),\n",
537 " ('appeared', 'appear'),\n",
538 " ('punishing', 'punish'),\n",
539 " ('dished', 'dish'),\n",
540 " ('said', 'say'),\n",
541 " ('selected', 'select'),\n",
542 " ('were', 'be'),\n",
543 " ('playing', 'play'),\n",
544 " ('was', 'be'),\n",
545 " ('thought', 'think'),\n",
546 " ('defended', 'defend'),\n",
547 " ('got', 'get'),\n",
548 " ('covered', 'cover'),\n",
549 " ('missed', 'miss'),\n",
550 " ('conceded', 'concede'),\n",
551 " ('was', 'be'),\n",
552 " ('turned', 'turn'),\n",
553 " ('fumbled', 'fumble'),\n",
554 " ('improved', 'improve'),\n",
555 " ('remains', 'remain'),\n",
556 " ('came', 'come'),\n",
557 " ('stepping', 'step'),\n",
558 " ('is', 'be'),\n",
559 " ('posed', 'pose'),\n",
560 " ('forwards', 'forward'),\n",
561 " ('justified', 'justify'),\n",
562 " ('were', 'be'),\n",
563 " ('ground', 'grind')]"
564 ]
565 },
566 "execution_count": 12,
567 "metadata": {},
568 "output_type": "execute_result"
569 }
570 ],
571 "source": [
572 "[(word, word.lemmatize(pos='v')) for i, word in enumerate(parsed_body.words) \n",
573 " if word != parsed_body.words[i].lemmatize(pos='v')]"
574 ]
575 },
576 {
577 "cell_type": "markdown",
578 "metadata": {},
579 "source": [
580 "### Sentiment & Polarity"
581 ]
582 },
583 {
584 "cell_type": "markdown",
585 "metadata": {},
586 "source": [
587 "TextBlob provides polarity and subjectivity estimates for parsed documents using dictionaries provided by the Pattern library. These dictionaries lexicon map adjectives frequently found in product reviews to sentiment polarity scores, ranging from -1 to +1 (negative ↔ positive) and a similar subjectivity score (objective ↔ subjective).\n",
588 "\n",
589 "The .sentiment attribute provides the average for each over the relevant tokens, whereas the .sentiment_assessments attribute lists the underlying values for each token"
590 ]
591 },
592 {
593 "cell_type": "code",
594 "execution_count": 15,
595 "metadata": {
596 "ExecuteTime": {
597 "end_time": "2018-11-26T02:57:50.363858Z",
598 "start_time": "2018-11-26T02:57:50.359319Z"
599 }
600 },
601 "outputs": [
602 {
603 "data": {
604 "text/plain": [
605 "Sentiment(polarity=0.088031914893617, subjectivity=0.46456433637284694)"
606 ]
607 },
608 "execution_count": 15,
609 "metadata": {},
610 "output_type": "execute_result"
611 }
612 ],
613 "source": [
614 "parsed_body.sentiment"
615 ]
616 },
617 {
618 "cell_type": "code",
619 "execution_count": 14,
620 "metadata": {
621 "ExecuteTime": {
622 "end_time": "2018-11-26T02:57:25.305753Z",
623 "start_time": "2018-11-26T02:57:25.301670Z"
624 }
625 },
626 "outputs": [
627 {
628 "data": {
629 "text/plain": [
630 "Sentiment(polarity=0.088031914893617, subjectivity=0.46456433637284694, assessments=[(['first'], 0.25, 0.3333333333333333, None), (['major'], 0.0625, 0.5, None), (['tries'], -0.1, 0.4, None), (['back'], 0.0, 0.0, None), (['winning'], 0.5, 0.75, None), (['likely'], 0.0, 1.0, None), (['back'], 0.0, 0.0, None), (['missing'], -0.2, 0.05, None), (['game'], -0.4, 0.4, None), (['better'], 0.5, 0.5, None), (['game'], -0.4, 0.4, None), (['likely'], 0.0, 1.0, None), (['little'], -0.1875, 0.5, None), (['teenage'], 0.0, 0.0, None), (['central'], 0.0, 0.25, None), (['future'], 0.0, 0.125, None), (['least'], -0.3, 0.4, None), (['unaffected'], -0.05, 0.1, None), (['particular'], 0.16666666666666666, 0.3333333333333333, None), (['more'], 0.5, 0.5, None), (['definitely'], 0.0, 0.5, None), (['hard'], -0.2916666666666667, 0.5416666666666666, None), (['next'], 0.0, 0.0, None), (['own'], 0.6, 1.0, None), (['first'], 0.25, 0.3333333333333333, None), (['half'], -0.16666666666666666, 0.16666666666666666, None), (['difficult'], -0.5, 1.0, None), (['reasonably'], 0.2, 0.6, None), (['hardly'], -0.2916666666666667, 0.5416666666666666, None), (['first'], 0.25, 0.3333333333333333, None), (['major'], 0.0625, 0.5, None), (['first'], 0.25, 0.3333333333333333, None), (['second'], 0.0, 0.0, None), (['half'], -0.16666666666666666, 0.16666666666666666, None), (['clearly'], 0.10000000000000002, 0.3833333333333333, None), (['more'], 0.5, 0.5, None), (['superb'], 1.0, 1.0, None), (['more'], 0.5, 0.5, None), (['likely'], 0.0, 1.0, None), (['open'], 0.0, 0.5, None), (['more'], 0.5, 0.5, None), (['physical'], 0.0, 0.14285714285714285, None), (['french'], 0.0, 0.0, None), (['likely'], 0.0, 1.0, None), (['justified'], 0.4, 0.9, None), (['impressive'], 1.0, 1.0, None), (['thin'], -0.4, 0.8500000000000001, None)])"
631 ]
632 },
633 "execution_count": 14,
634 "metadata": {},
635 "output_type": "execute_result"
636 }
637 ],
638 "source": [
639 "parsed_body.sentiment_assessments"
640 ]
641 },
642 {
643 "cell_type": "markdown",
644 "metadata": {},
645 "source": [
646 "### Combine Textblob Lemmatization with `CountVectorizer`"
647 ]
648 },
649 {
650 "cell_type": "code",
651 "execution_count": 13,
652 "metadata": {
653 "ExecuteTime": {
654 "end_time": "2018-11-22T18:26:04.152859Z",
655 "start_time": "2018-11-22T18:26:04.150466Z"
656 }
657 },
658 "outputs": [],
659 "source": [
660 "def lemmatizer(text):\n",
661 " words = TextBlob(text.lower()).words\n",
662 " return [word.lemmatize() for word in words]"
663 ]
664 },
665 {
666 "cell_type": "code",
667 "execution_count": 14,
668 "metadata": {
669 "ExecuteTime": {
670 "end_time": "2018-11-22T18:26:04.160781Z",
671 "start_time": "2018-11-22T18:26:04.154322Z"
672 }
673 },
674 "outputs": [],
675 "source": [
676 "vectorizer = CountVectorizer(analyzer=lemmatizer, decode_error='replace')"
677 ]
678 }
679 ],
680 "metadata": {
681 "kernelspec": {
682 "display_name": "Python 3",
683 "language": "python",
684 "name": "python3"
685 },
686 "language_info": {
687 "codemirror_mode": {
688 "name": "ipython",
689 "version": 3
690 },
691 "file_extension": ".py",
692 "mimetype": "text/x-python",
693 "name": "python",
694 "nbconvert_exporter": "python",
695 "pygments_lexer": "ipython3",
696 "version": "3.7.0"
697 },
698 "toc": {
699 "base_numbering": 1,
700 "nav_menu": {},
701 "number_sections": true,
702 "sideBar": true,
703 "skip_h1_title": true,
704 "title_cell": "Table of Contents",
705 "title_sidebar": "Contents",
706 "toc_cell": false,
707 "toc_position": {},
708 "toc_section_display": true,
709 "toc_window_display": true
710 }
711 },
712 "nbformat": 4,
713 "nbformat_minor": 2
714 }