ml-finance-python
python scripts for finance machine learning
git clone https://9o.is/git/ml-finance-python.git
01_nlp_pipeline with_spaCy.ipynb
(123973B)
1 {
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# NLP Pipeline with spaCy"
8 ]
9 },
10 {
11 "cell_type": "markdown",
12 "metadata": {},
13 "source": [
14 "[spaCy](https://spacy.io/) is a widely used python library with a comprehensive feature set for fast text processing in multiple languages. \n",
15 "\n",
16 "The usage of the tokenization and annotation engines requires the installation of language models. The features we will use in this chapter only require the small models, the larger models also include word vectors that we will cover in chapter 15."
17 ]
18 },
19 {
20 "cell_type": "markdown",
21 "metadata": {
22 "slideshow": {
23 "slide_type": "slide"
24 }
25 },
26 "source": [
27 ""
28 ]
29 },
30 {
31 "cell_type": "markdown",
32 "metadata": {
33 "slideshow": {
34 "slide_type": "slide"
35 }
36 },
37 "source": [
38 "## Setup"
39 ]
40 },
41 {
42 "cell_type": "markdown",
43 "metadata": {
44 "slideshow": {
45 "slide_type": "slide"
46 }
47 },
48 "source": [
49 "### Imports"
50 ]
51 },
52 {
53 "cell_type": "code",
54 "execution_count": 2,
55 "metadata": {
56 "ExecuteTime": {
57 "end_time": "2018-11-22T18:23:28.422236Z",
58 "start_time": "2018-11-22T18:23:27.574043Z"
59 },
60 "slideshow": {
61 "slide_type": "fragment"
62 }
63 },
64 "outputs": [],
65 "source": [
66 "%matplotlib inline\n",
67 "import sys\n",
68 "import tarfile\n",
69 "from pathlib import Path\n",
70 "\n",
71 "import numpy as np\n",
72 "import pandas as pd\n",
73 "import matplotlib.pyplot as plt\n",
74 "\n",
75 "import spacy\n",
76 "from spacy import displacy\n",
77 "import textacy\n",
78 "\n",
79 "from IPython.display import SVG, display"
80 ]
81 },
82 {
83 "cell_type": "code",
84 "execution_count": 2,
85 "metadata": {
86 "ExecuteTime": {
87 "end_time": "2018-11-22T18:23:28.425592Z",
88 "start_time": "2018-11-22T18:23:28.423600Z"
89 },
90 "slideshow": {
91 "slide_type": "fragment"
92 }
93 },
94 "outputs": [],
95 "source": [
96 "pd.set_option('float_format', '{:,.2f}'.format)"
97 ]
98 },
99 {
100 "cell_type": "markdown",
101 "metadata": {
102 "slideshow": {
103 "slide_type": "slide"
104 }
105 },
106 "source": [
107 "### SpaCy Language Model Installation\n",
108 "\n",
109 "In addition to the `spaCy` library, we need [language models](https://spacy.io/usage/models)."
110 ]
111 },
112 {
113 "cell_type": "markdown",
114 "metadata": {
115 "slideshow": {
116 "slide_type": "fragment"
117 }
118 },
119 "source": [
120 "#### English"
121 ]
122 },
123 {
124 "cell_type": "code",
125 "execution_count": 3,
126 "metadata": {
127 "ExecuteTime": {
128 "end_time": "2018-11-22T18:23:29.715775Z",
129 "start_time": "2018-11-22T18:23:28.426957Z"
130 },
131 "slideshow": {
132 "slide_type": "fragment"
133 }
134 },
135 "outputs": [
136 {
137 "name": "stdout",
138 "output_type": "stream",
139 "text": [
140 "Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages (2.0.0)\n",
141 "\n",
142 "\u001b[93m Linking successful\u001b[0m\n",
143 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/en_core_web_sm\n",
144 " -->\n",
145 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy/data/en_core_web_sm\n",
146 "\n",
147 " You can now load the model via spacy.load('en_core_web_sm')\n",
148 "\n"
149 ]
150 }
151 ],
152 "source": [
153 "%%bash\n",
154 "python -m spacy download en_core_web_sm\n",
155 "\n",
156 "# more comprehensive models:\n",
157 "# {sys.executable} -m spacy download en_core_web_md\n",
158 "# {sys.executable} -m spacy download en_core_web_lg"
159 ]
160 },
161 {
162 "cell_type": "markdown",
163 "metadata": {
164 "slideshow": {
165 "slide_type": "slide"
166 }
167 },
168 "source": [
169 "#### Spanish"
170 ]
171 },
172 {
173 "cell_type": "markdown",
174 "metadata": {
175 "slideshow": {
176 "slide_type": "fragment"
177 }
178 },
179 "source": [
180 "[Spanish language models](https://spacy.io/models/es#es_core_news_sm) trained on [AnCora Corpus](http://clic.ub.edu/corpus/) and [WikiNER](http://schwa.org/projects/resources/wiki/Wikiner)"
181 ]
182 },
183 {
184 "cell_type": "code",
185 "execution_count": 4,
186 "metadata": {
187 "ExecuteTime": {
188 "end_time": "2018-11-22T18:23:30.703639Z",
189 "start_time": "2018-11-22T18:23:29.717730Z"
190 },
191 "slideshow": {
192 "slide_type": "fragment"
193 }
194 },
195 "outputs": [
196 {
197 "name": "stdout",
198 "output_type": "stream",
199 "text": [
200 "Requirement already satisfied: es_core_news_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.0.0/es_core_news_sm-2.0.0.tar.gz#egg=es_core_news_sm==2.0.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages (2.0.0)\n",
201 "\n",
202 "\u001b[93m Linking successful\u001b[0m\n",
203 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/es_core_news_sm\n",
204 " -->\n",
205 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy/data/es_core_news_sm\n",
206 "\n",
207 " You can now load the model via spacy.load('es_core_news_sm')\n",
208 "\n"
209 ]
210 }
211 ],
212 "source": [
213 "%%bash\n",
214 "python -m spacy download es_core_news_sm\n",
215 "\n",
216 "# more comprehensive model:\n",
217 "# {sys.executable} -m spacy download es_core_news_md"
218 ]
219 },
220 {
221 "cell_type": "markdown",
222 "metadata": {},
223 "source": [
224 "Create shortcut names"
225 ]
226 },
227 {
228 "cell_type": "code",
229 "execution_count": 6,
230 "metadata": {
231 "ExecuteTime": {
232 "end_time": "2018-11-22T18:23:31.456755Z",
233 "start_time": "2018-11-22T18:23:30.705604Z"
234 },
235 "slideshow": {
236 "slide_type": "fragment"
237 }
238 },
239 "outputs": [
240 {
241 "name": "stdout",
242 "output_type": "stream",
243 "text": [
244 "\n",
245 "\u001b[93m Linking successful\u001b[0m\n",
246 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/en_core_web_sm\n",
247 " -->\n",
248 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy/data/en\n",
249 "\n",
250 " You can now load the model via spacy.load('en')\n",
251 "\n",
252 "\n",
253 "\u001b[93m Linking successful\u001b[0m\n",
254 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/es_core_news_sm\n",
255 " -->\n",
256 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy/data/es\n",
257 "\n",
258 " You can now load the model via spacy.load('es')\n",
259 "\n"
260 ]
261 }
262 ],
263 "source": [
264 "%%bash\n",
265 "python -m spacy link en_core_web_sm en --force;\n",
266 "python -m spacy link es_core_news_sm es --force;"
267 ]
268 },
269 {
270 "cell_type": "markdown",
271 "metadata": {
272 "slideshow": {
273 "slide_type": "slide"
274 }
275 },
276 "source": [
277 "#### Validate Installation"
278 ]
279 },
280 {
281 "cell_type": "code",
282 "execution_count": 7,
283 "metadata": {
284 "ExecuteTime": {
285 "end_time": "2018-11-22T18:23:32.225454Z",
286 "start_time": "2018-11-22T18:23:31.458167Z"
287 },
288 "scrolled": true,
289 "slideshow": {
290 "slide_type": "fragment"
291 }
292 },
293 "outputs": [
294 {
295 "name": "stdout",
296 "output_type": "stream",
297 "text": [
298 "\r\n",
299 "\u001b[93m Installed models (spaCy v2.0.16)\u001b[0m\r\n",
300 " /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy\r\n",
301 "\r\n",
302 " TYPE NAME MODEL VERSION \r\n",
303 " package es-core-news-sm es_core_news_sm \u001b[38;5;2m2.0.0\u001b[0m \u001b[38;5;2m✔\u001b[0m \r\n",
304 " package en-core-web-sm en_core_web_sm \u001b[38;5;2m2.0.0\u001b[0m \u001b[38;5;2m✔\u001b[0m \r\n",
305 " link en_core_web_sm en_core_web_sm \u001b[38;5;2m2.0.0\u001b[0m \u001b[38;5;2m✔\u001b[0m \r\n",
306 " link en en_core_web_sm \u001b[38;5;2m2.0.0\u001b[0m \u001b[38;5;2m✔\u001b[0m \r\n",
307 " link es es_core_news_sm \u001b[38;5;2m2.0.0\u001b[0m \u001b[38;5;2m✔\u001b[0m \r\n",
308 " link es_core_news_sm es_core_news_sm \u001b[38;5;2m2.0.0\u001b[0m \u001b[38;5;2m✔\u001b[0m \r\n"
309 ]
310 }
311 ],
312 "source": [
313 "# validate installation\n",
314 "!{sys.executable} -m spacy validate"
315 ]
316 },
317 {
318 "cell_type": "markdown",
319 "metadata": {
320 "slideshow": {
321 "slide_type": "slide"
322 }
323 },
324 "source": [
325 "## Get Data"
326 ]
327 },
328 {
329 "cell_type": "markdown",
330 "metadata": {},
331 "source": [
332 "Download and unzip into the folder `bbc` in your main data directory."
333 ]
334 },
335 {
336 "cell_type": "code",
337 "execution_count": 3,
338 "metadata": {},
339 "outputs": [],
340 "source": [
341 "DATA_DIR = Path('../data')"
342 ]
343 },
344 {
345 "cell_type": "markdown",
346 "metadata": {
347 "ExecuteTime": {
348 "end_time": "2018-11-22T18:22:17.894231Z",
349 "start_time": "2018-11-22T18:22:17.889858Z"
350 }
351 },
352 "source": [
353 "- [BBC Articles](http://mlg.ucd.ie/datasets/bbc.html), use raw text files (download [link](http://mlg.ucd.ie/files/datasets/bbcsport-fulltext.zip)\n",
354 "- [TED2013](http://opus.nlpl.eu/TED2013.php), a parallel corpus of TED talk subtitles in 15 langugages (sample provided)"
355 ]
356 },
357 {
358 "cell_type": "markdown",
359 "metadata": {
360 "slideshow": {
361 "slide_type": "slide"
362 }
363 },
364 "source": [
365 "## SpaCy Pipeline & Architecture"
366 ]
367 },
368 {
369 "cell_type": "markdown",
370 "metadata": {
371 "slideshow": {
372 "slide_type": "fragment"
373 }
374 },
375 "source": [
376 "### The Processing Pipeline\n",
377 "\n",
378 "When you call a spaCy model on a text, spaCy \n",
379 "\n",
380 "1) tokenizes the text to produce a `Doc` object. \n",
381 "\n",
382 "2) passes the `Doc` object through the processing pipeline that may be customized, and for the default models consists of\n",
383 "- a tagger, \n",
384 "- a parser and \n",
385 "- an entity recognizer. \n",
386 "\n",
387 "Each pipeline component returns the processed Doc, which is then passed on to the next component.\n",
388 "\n",
389 ""
390 ]
391 },
392 {
393 "cell_type": "markdown",
394 "metadata": {
395 "slideshow": {
396 "slide_type": "slide"
397 }
398 },
399 "source": [
400 "### Key Data Structures\n",
401 "\n",
402 "The central data structures in spaCy are the **Doc** and the **Vocab**. Text annotations are also designed to allow a single source of truth:\n",
403 "\n",
404 "- The **`Doc`** object owns the sequence of tokens and all their annotations. `Span` and `Token` are views that point into it. It is constructed by the `Tokenizer`, and then modified in place by the components of the pipeline. \n",
405 "- The **`Vocab`** object owns a set of look-up tables that make common information available across documents. \n",
406 "- The **`Language`** object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.\n",
407 "\n",
408 ""
409 ]
410 },
411 {
412 "cell_type": "markdown",
413 "metadata": {
414 "slideshow": {
415 "slide_type": "slide"
416 }
417 },
418 "source": [
419 "## SpaCy in Action"
420 ]
421 },
422 {
423 "cell_type": "markdown",
424 "metadata": {
425 "slideshow": {
426 "slide_type": "fragment"
427 }
428 },
429 "source": [
430 "### Create & Explore the Language Object"
431 ]
432 },
433 {
434 "cell_type": "markdown",
435 "metadata": {},
436 "source": [
437 "Once installed and linked, we can instantiate a spaCy language model and then call it on a document. As a result, spaCy produces a Doc object that tokenizes the text and processes it according to configurable pipeline components that by default consist of a tagger, a parser, and a named-entity recognizer."
438 ]
439 },
440 {
441 "cell_type": "code",
442 "execution_count": 8,
443 "metadata": {
444 "ExecuteTime": {
445 "end_time": "2018-11-22T18:23:32.685811Z",
446 "start_time": "2018-11-22T18:23:32.231489Z"
447 },
448 "slideshow": {
449 "slide_type": "fragment"
450 }
451 },
452 "outputs": [],
453 "source": [
454 "nlp = spacy.load('en') "
455 ]
456 },
457 {
458 "cell_type": "code",
459 "execution_count": 9,
460 "metadata": {
461 "ExecuteTime": {
462 "end_time": "2018-11-22T18:23:32.690366Z",
463 "start_time": "2018-11-22T18:23:32.686964Z"
464 },
465 "slideshow": {
466 "slide_type": "fragment"
467 }
468 },
469 "outputs": [
470 {
471 "data": {
472 "text/plain": [
473 "spacy.lang.en.English"
474 ]
475 },
476 "execution_count": 9,
477 "metadata": {},
478 "output_type": "execute_result"
479 }
480 ],
481 "source": [
482 "type(nlp)"
483 ]
484 },
485 {
486 "cell_type": "code",
487 "execution_count": 10,
488 "metadata": {
489 "ExecuteTime": {
490 "end_time": "2018-11-22T18:23:32.698471Z",
491 "start_time": "2018-11-22T18:23:32.691679Z"
492 },
493 "slideshow": {
494 "slide_type": "fragment"
495 }
496 },
497 "outputs": [
498 {
499 "data": {
500 "text/plain": [
501 "'en'"
502 ]
503 },
504 "execution_count": 10,
505 "metadata": {},
506 "output_type": "execute_result"
507 }
508 ],
509 "source": [
510 "nlp.lang"
511 ]
512 },
513 {
514 "cell_type": "code",
515 "execution_count": 11,
516 "metadata": {
517 "ExecuteTime": {
518 "end_time": "2018-11-22T18:23:32.710213Z",
519 "start_time": "2018-11-22T18:23:32.699643Z"
520 },
521 "slideshow": {
522 "slide_type": "slide"
523 }
524 },
525 "outputs": [
526 {
527 "name": "stdout",
528 "output_type": "stream",
529 "text": [
530 "\n",
531 " \u001b[93mInfo about model en\u001b[0m\n",
532 "\n",
533 " lang en \n",
534 " pipeline ['tagger', 'parser', 'ner']\n",
535 " accuracy {'token_acc': 99.8698372794, 'ents_p': 84.9664503965, 'ents_r': 85.6312524451, 'uas': 91.7237657538, 'tags_acc': 97.0403350292, 'ents_f': 85.2975560875, 'las': 89.800872413}\n",
536 " name core_web_sm \n",
537 " license CC BY-SA 3.0 \n",
538 " author Explosion AI \n",
539 " url https://explosion.ai\n",
540 " vectors {'keys': 0, 'width': 0, 'vectors': 0}\n",
541 " sources ['OntoNotes 5', 'Common Crawl']\n",
542 " version 2.0.0 \n",
543 " spacy_version >=2.0.0a18 \n",
544 " parent_package spacy \n",
545 " speed {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407}\n",
546 " email contact@explosion.ai\n",
547 " description English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.\n",
548 " link /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy/data/en\n",
549 " source /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/en_core_web_sm\n",
550 "\n"
551 ]
552 },
553 {
554 "data": {
555 "text/plain": [
556 "{'lang': 'en',\n",
557 " 'pipeline': ['tagger', 'parser', 'ner'],\n",
558 " 'accuracy': {'token_acc': 99.8698372794,\n",
559 " 'ents_p': 84.9664503965,\n",
560 " 'ents_r': 85.6312524451,\n",
561 " 'uas': 91.7237657538,\n",
562 " 'tags_acc': 97.0403350292,\n",
563 " 'ents_f': 85.2975560875,\n",
564 " 'las': 89.800872413},\n",
565 " 'name': 'core_web_sm',\n",
566 " 'license': 'CC BY-SA 3.0',\n",
567 " 'author': 'Explosion AI',\n",
568 " 'url': 'https://explosion.ai',\n",
569 " 'vectors': {'keys': 0, 'width': 0, 'vectors': 0},\n",
570 " 'sources': ['OntoNotes 5', 'Common Crawl'],\n",
571 " 'version': '2.0.0',\n",
572 " 'spacy_version': '>=2.0.0a18',\n",
573 " 'parent_package': 'spacy',\n",
574 " 'speed': {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407},\n",
575 " 'email': 'contact@explosion.ai',\n",
576 " 'description': 'English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.',\n",
577 " 'link': '/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/spacy/data/en',\n",
578 " 'source': '/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.6/site-packages/en_core_web_sm'}"
579 ]
580 },
581 "execution_count": 11,
582 "metadata": {},
583 "output_type": "execute_result"
584 }
585 ],
586 "source": [
587 "spacy.info('en')"
588 ]
589 },
590 {
591 "cell_type": "code",
592 "execution_count": 12,
593 "metadata": {
594 "ExecuteTime": {
595 "end_time": "2018-11-22T18:23:32.715948Z",
596 "start_time": "2018-11-22T18:23:32.711601Z"
597 },
598 "slideshow": {
599 "slide_type": "slide"
600 }
601 },
602 "outputs": [],
603 "source": [
604 "def get_attributes(f):\n",
605 " print([a for a in dir(f) if not a.startswith('_')], end=' ')"
606 ]
607 },
608 {
609 "cell_type": "code",
610 "execution_count": 13,
611 "metadata": {
612 "ExecuteTime": {
613 "end_time": "2018-11-22T18:23:32.724335Z",
614 "start_time": "2018-11-22T18:23:32.717007Z"
615 },
616 "slideshow": {
617 "slide_type": "fragment"
618 }
619 },
620 "outputs": [
621 {
622 "name": "stdout",
623 "output_type": "stream",
624 "text": [
625 "['Defaults', 'add_pipe', 'begin_training', 'create_pipe', 'disable_pipes', 'entity', 'evaluate', 'factories', 'from_bytes', 'from_disk', 'get_pipe', 'has_pipe', 'lang', 'make_doc', 'matcher', 'max_length', 'meta', 'parser', 'path', 'pipe', 'pipe_names', 'pipeline', 'preprocess_gold', 'remove_pipe', 'rename_pipe', 'replace_pipe', 'tagger', 'tensorizer', 'to_bytes', 'to_disk', 'tokenizer', 'update', 'use_params', 'vocab'] "
626 ]
627 }
628 ],
629 "source": [
630 "get_attributes(nlp)"
631 ]
632 },
633 {
634 "cell_type": "markdown",
635 "metadata": {
636 "slideshow": {
637 "slide_type": "slide"
638 }
639 },
640 "source": [
641 "### Explore the Pipeline"
642 ]
643 },
644 {
645 "cell_type": "markdown",
646 "metadata": {},
647 "source": [
648 "Let’s illustrate the pipeline using a simple sentence:"
649 ]
650 },
651 {
652 "cell_type": "code",
653 "execution_count": 14,
654 "metadata": {
655 "ExecuteTime": {
656 "end_time": "2018-11-22T18:23:32.756807Z",
657 "start_time": "2018-11-22T18:23:32.725598Z"
658 },
659 "slideshow": {
660 "slide_type": "fragment"
661 }
662 },
663 "outputs": [],
664 "source": [
665 "sample_text = 'Apple is looking at buying U.K. startup for $1 billion'\n",
666 "doc = nlp(sample_text)"
667 ]
668 },
669 {
670 "cell_type": "code",
671 "execution_count": 15,
672 "metadata": {
673 "ExecuteTime": {
674 "end_time": "2018-11-22T18:23:32.761010Z",
675 "start_time": "2018-11-22T18:23:32.758505Z"
676 },
677 "slideshow": {
678 "slide_type": "fragment"
679 }
680 },
681 "outputs": [
682 {
683 "name": "stdout",
684 "output_type": "stream",
685 "text": [
686 "['cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_parsed', 'is_sentenced', 'is_tagged', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'user_data', 'user_hooks', 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab'] "
687 ]
688 }
689 ],
690 "source": [
691 "get_attributes(doc)"
692 ]
693 },
694 {
695 "cell_type": "code",
696 "execution_count": 16,
697 "metadata": {
698 "ExecuteTime": {
699 "end_time": "2018-11-22T18:23:32.774134Z",
700 "start_time": "2018-11-22T18:23:32.767812Z"
701 },
702 "slideshow": {
703 "slide_type": "slide"
704 }
705 },
706 "outputs": [
707 {
708 "data": {
709 "text/plain": [
710 "True"
711 ]
712 },
713 "execution_count": 16,
714 "metadata": {},
715 "output_type": "execute_result"
716 }
717 ],
718 "source": [
719 "doc.is_parsed"
720 ]
721 },
722 {
723 "cell_type": "code",
724 "execution_count": 17,
725 "metadata": {
726 "ExecuteTime": {
727 "end_time": "2018-11-22T18:23:32.779002Z",
728 "start_time": "2018-11-22T18:23:32.775876Z"
729 },
730 "slideshow": {
731 "slide_type": "fragment"
732 }
733 },
734 "outputs": [
735 {
736 "data": {
737 "text/plain": [
738 "True"
739 ]
740 },
741 "execution_count": 17,
742 "metadata": {},
743 "output_type": "execute_result"
744 }
745 ],
746 "source": [
747 "doc.is_sentenced"
748 ]
749 },
750 {
751 "cell_type": "code",
752 "execution_count": 18,
753 "metadata": {
754 "ExecuteTime": {
755 "end_time": "2018-11-22T18:23:32.788762Z",
756 "start_time": "2018-11-22T18:23:32.780469Z"
757 },
758 "slideshow": {
759 "slide_type": "fragment"
760 }
761 },
762 "outputs": [
763 {
764 "data": {
765 "text/plain": [
766 "True"
767 ]
768 },
769 "execution_count": 18,
770 "metadata": {},
771 "output_type": "execute_result"
772 }
773 ],
774 "source": [
775 "doc.is_tagged"
776 ]
777 },
778 {
779 "cell_type": "code",
780 "execution_count": 19,
781 "metadata": {
782 "ExecuteTime": {
783 "end_time": "2018-11-22T18:23:32.802336Z",
784 "start_time": "2018-11-22T18:23:32.794045Z"
785 },
786 "slideshow": {
787 "slide_type": "fragment"
788 }
789 },
790 "outputs": [
791 {
792 "data": {
793 "text/plain": [
794 "'Apple is looking at buying U.K. startup for $1 billion'"
795 ]
796 },
797 "execution_count": 19,
798 "metadata": {},
799 "output_type": "execute_result"
800 }
801 ],
802 "source": [
803 "doc.text"
804 ]
805 },
806 {
807 "cell_type": "code",
808 "execution_count": 20,
809 "metadata": {
810 "ExecuteTime": {
811 "end_time": "2018-11-22T18:23:32.807584Z",
812 "start_time": "2018-11-22T18:23:32.804542Z"
813 },
814 "slideshow": {
815 "slide_type": "slide"
816 }
817 },
818 "outputs": [
819 {
820 "name": "stdout",
821 "output_type": "stream",
822 "text": [
823 "['add_flag', 'cfg', 'data_dir', 'from_bytes', 'from_disk', 'get_vector', 'has_vector', 'lang', 'length', 'lex_attr_getters', 'lexemes_from_bytes', 'lexemes_to_bytes', 'morphology', 'prune_vectors', 'reset_vectors', 'set_vector', 'strings', 'to_bytes', 'to_disk', 'vectors', 'vectors_length'] "
824 ]
825 }
826 ],
827 "source": [
828 "get_attributes(doc.vocab)"
829 ]
830 },
831 {
832 "cell_type": "code",
833 "execution_count": 21,
834 "metadata": {
835 "ExecuteTime": {
836 "end_time": "2018-11-22T18:23:32.818729Z",
837 "start_time": "2018-11-22T18:23:32.809437Z"
838 },
839 "slideshow": {
840 "slide_type": "fragment"
841 }
842 },
843 "outputs": [
844 {
845 "data": {
846 "text/plain": [
847 "57852"
848 ]
849 },
850 "execution_count": 21,
851 "metadata": {},
852 "output_type": "execute_result"
853 }
854 ],
855 "source": [
856 "doc.vocab.length"
857 ]
858 },
859 {
860 "cell_type": "markdown",
861 "metadata": {
862 "slideshow": {
863 "slide_type": "slide"
864 }
865 },
866 "source": [
867 "#### Explore `Token` annotations"
868 ]
869 },
870 {
871 "cell_type": "markdown",
872 "metadata": {},
873 "source": [
874 "The parsed document content is iterable and each element has numerous attributes produced by the processing pipeline. The below sample illustrates how to access the following attributes:"
875 ]
876 },
877 {
878 "cell_type": "code",
879 "execution_count": 22,
880 "metadata": {
881 "ExecuteTime": {
882 "end_time": "2018-11-22T18:23:32.827692Z",
883 "start_time": "2018-11-22T18:23:32.820215Z"
884 },
885 "slideshow": {
886 "slide_type": "fragment"
887 }
888 },
889 "outputs": [
890 {
891 "data": {
892 "text/plain": [
893 "0 Apple\n",
894 "1 is\n",
895 "2 looking\n",
896 "3 at\n",
897 "4 buying\n",
898 "5 U.K.\n",
899 "6 startup\n",
900 "7 for\n",
901 "8 $\n",
902 "9 1\n",
903 "10 billion\n",
904 "dtype: object"
905 ]
906 },
907 "execution_count": 22,
908 "metadata": {},
909 "output_type": "execute_result"
910 }
911 ],
912 "source": [
913 "pd.Series([token.text for token in doc])"
914 ]
915 },
916 {
917 "cell_type": "code",
918 "execution_count": 23,
919 "metadata": {
920 "ExecuteTime": {
921 "end_time": "2018-11-22T18:24:11.570806Z",
922 "start_time": "2018-11-22T18:24:11.549908Z"
923 },
924 "slideshow": {
925 "slide_type": "slide"
926 }
927 },
928 "outputs": [
929 {
930 "data": {
931 "text/html": [
932 "<div>\n",
933 "<style scoped>\n",
934 " .dataframe tbody tr th:only-of-type {\n",
935 " vertical-align: middle;\n",
936 " }\n",
937 "\n",
938 " .dataframe tbody tr th {\n",
939 " vertical-align: top;\n",
940 " }\n",
941 "\n",
942 " .dataframe thead th {\n",
943 " text-align: right;\n",
944 " }\n",
945 "</style>\n",
946 "<table border=\"1\" class=\"dataframe\">\n",
947 " <thead>\n",
948 " <tr style=\"text-align: right;\">\n",
949 " <th></th>\n",
950 " <th>text</th>\n",
951 " <th>lemma</th>\n",
952 " <th>pos</th>\n",
953 " <th>tag</th>\n",
954 " <th>dep</th>\n",
955 " <th>shape</th>\n",
956 " <th>is_alpha</th>\n",
957 " <th>is_stop</th>\n",
958 " </tr>\n",
959 " </thead>\n",
960 " <tbody>\n",
961 " <tr>\n",
962 " <th>0</th>\n",
963 " <td>Apple</td>\n",
964 " <td>apple</td>\n",
965 " <td>PROPN</td>\n",
966 " <td>NNP</td>\n",
967 " <td>nsubj</td>\n",
968 " <td>Xxxxx</td>\n",
969 " <td>True</td>\n",
970 " <td>False</td>\n",
971 " </tr>\n",
972 " <tr>\n",
973 " <th>1</th>\n",
974 " <td>is</td>\n",
975 " <td>be</td>\n",
976 " <td>VERB</td>\n",
977 " <td>VBZ</td>\n",
978 " <td>aux</td>\n",
979 " <td>xx</td>\n",
980 " <td>True</td>\n",
981 " <td>True</td>\n",
982 " </tr>\n",
983 " <tr>\n",
984 " <th>2</th>\n",
985 " <td>looking</td>\n",
986 " <td>look</td>\n",
987 " <td>VERB</td>\n",
988 " <td>VBG</td>\n",
989 " <td>ROOT</td>\n",
990 " <td>xxxx</td>\n",
991 " <td>True</td>\n",
992 " <td>False</td>\n",
993 " </tr>\n",
994 " <tr>\n",
995 " <th>3</th>\n",
996 " <td>at</td>\n",
997 " <td>at</td>\n",
998 " <td>ADP</td>\n",
999 " <td>IN</td>\n",
1000 " <td>prep</td>\n",
1001 " <td>xx</td>\n",
1002 " <td>True</td>\n",
1003 " <td>True</td>\n",
1004 " </tr>\n",
1005 " <tr>\n",
1006 " <th>4</th>\n",
1007 " <td>buying</td>\n",
1008 " <td>buy</td>\n",
1009 " <td>VERB</td>\n",
1010 " <td>VBG</td>\n",
1011 " <td>pcomp</td>\n",
1012 " <td>xxxx</td>\n",
1013 " <td>True</td>\n",
1014 " <td>False</td>\n",
1015 " </tr>\n",
1016 " <tr>\n",
1017 " <th>5</th>\n",
1018 " <td>U.K.</td>\n",
1019 " <td>u.k.</td>\n",
1020 " <td>PROPN</td>\n",
1021 " <td>NNP</td>\n",
1022 " <td>compound</td>\n",
1023 " <td>X.X.</td>\n",
1024 " <td>False</td>\n",
1025 " <td>False</td>\n",
1026 " </tr>\n",
1027 " <tr>\n",
1028 " <th>6</th>\n",
1029 " <td>startup</td>\n",
1030 " <td>startup</td>\n",
1031 " <td>NOUN</td>\n",
1032 " <td>NN</td>\n",
1033 " <td>dobj</td>\n",
1034 " <td>xxxx</td>\n",
1035 " <td>True</td>\n",
1036 " <td>False</td>\n",
1037 " </tr>\n",
1038 " <tr>\n",
1039 " <th>7</th>\n",
1040 " <td>for</td>\n",
1041 " <td>for</td>\n",
1042 " <td>ADP</td>\n",
1043 " <td>IN</td>\n",
1044 " <td>prep</td>\n",
1045 " <td>xxx</td>\n",
1046 " <td>True</td>\n",
1047 " <td>True</td>\n",
1048 " </tr>\n",
1049 " <tr>\n",
1050 " <th>8</th>\n",
1051 " <td>$</td>\n",
1052 " <td>$</td>\n",
1053 " <td>SYM</td>\n",
1054 " <td>$</td>\n",
1055 " <td>quantmod</td>\n",
1056 " <td>$</td>\n",
1057 " <td>False</td>\n",
1058 " <td>False</td>\n",
1059 " </tr>\n",
1060 " <tr>\n",
1061 " <th>9</th>\n",
1062 " <td>1</td>\n",
1063 " <td>1</td>\n",
1064 " <td>NUM</td>\n",
1065 " <td>CD</td>\n",
1066 " <td>compound</td>\n",
1067 " <td>d</td>\n",
1068 " <td>False</td>\n",
1069 " <td>False</td>\n",
1070 " </tr>\n",
1071 " <tr>\n",
1072 " <th>10</th>\n",
1073 " <td>billion</td>\n",
1074 " <td>billion</td>\n",
1075 " <td>NUM</td>\n",
1076 " <td>CD</td>\n",
1077 " <td>pobj</td>\n",
1078 " <td>xxxx</td>\n",
1079 " <td>True</td>\n",
1080 " <td>False</td>\n",
1081 " </tr>\n",
1082 " </tbody>\n",
1083 "</table>\n",
1084 "</div>"
1085 ],
1086 "text/plain": [
1087 " text lemma pos tag dep shape is_alpha is_stop\n",
1088 "0 Apple apple PROPN NNP nsubj Xxxxx True False\n",
1089 "1 is be VERB VBZ aux xx True True\n",
1090 "2 looking look VERB VBG ROOT xxxx True False\n",
1091 "3 at at ADP IN prep xx True True\n",
1092 "4 buying buy VERB VBG pcomp xxxx True False\n",
1093 "5 U.K. u.k. PROPN NNP compound X.X. False False\n",
1094 "6 startup startup NOUN NN dobj xxxx True False\n",
1095 "7 for for ADP IN prep xxx True True\n",
1096 "8 $ $ SYM $ quantmod $ False False\n",
1097 "9 1 1 NUM CD compound d False False\n",
1098 "10 billion billion NUM CD pobj xxxx True False"
1099 ]
1100 },
1101 "execution_count": 23,
1102 "metadata": {},
1103 "output_type": "execute_result"
1104 }
1105 ],
1106 "source": [
1107 "pd.DataFrame([[t.text, t.lemma_, t.pos_, t.tag_, t.dep_, t.shape_, t.is_alpha, t.is_stop]\n",
1108 " for t in doc],\n",
1109 " columns=['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop'])"
1110 ]
1111 },
1112 {
1113 "cell_type": "markdown",
1114 "metadata": {
1115 "slideshow": {
1116 "slide_type": "slide"
1117 }
1118 },
1119 "source": [
1120 "#### Visualize POS Dependencies"
1121 ]
1122 },
1123 {
1124 "cell_type": "markdown",
1125 "metadata": {},
1126 "source": [
1127 "We can visualize the syntactic dependency in a browser or notebook"
1128 ]
1129 },
1130 {
1131 "cell_type": "code",
1132 "execution_count": 25,
1133 "metadata": {},
1134 "outputs": [],
1135 "source": [
1136 "options = {'compact': True, 'bg': '#09a3d5',\n",
1137 " 'color': 'white', 'font': 'Source Sans Pro', 'notebook': True}"
1138 ]
1139 },
1140 {
1141 "cell_type": "code",
1142 "execution_count": 28,
1143 "metadata": {
1144 "ExecuteTime": {
1145 "end_time": "2018-11-22T18:24:23.879627Z",
1146 "start_time": "2018-11-22T18:24:11.572586Z"
1147 },
1148 "scrolled": true,
1149 "slideshow": {
1150 "slide_type": "fragment"
1151 }
1152 },
1153 "outputs": [
1154 {
1155 "name": "stdout",
1156 "output_type": "stream",
1157 "text": [
1158 "\n",
1159 "\u001b[93m Serving on port 5000...\u001b[0m\n",
1160 " Using the 'dep' visualizer\n",
1161 "\n"
1162 ]
1163 },
1164 {
1165 "name": "stderr",
1166 "output_type": "stream",
1167 "text": [
1168 "127.0.0.1 - - [20/Apr/2019 18:24:26] \"GET / HTTP/1.1\" 200 8357\n",
1169 "127.0.0.1 - - [20/Apr/2019 18:24:26] \"GET /favicon.ico HTTP/1.1\" 200 8357\n"
1170 ]
1171 },
1172 {
1173 "name": "stdout",
1174 "output_type": "stream",
1175 "text": [
1176 "\n",
1177 " Shutting down server on port 5000.\n",
1178 "\n"
1179 ]
1180 }
1181 ],
1182 "source": [
1183 "displacy.serve(doc, style='dep', options=options)"
1184 ]
1185 },
1186 {
1187 "cell_type": "code",
1188 "execution_count": 26,
1189 "metadata": {
1190 "ExecuteTime": {
1191 "end_time": "2018-11-22T18:24:23.883977Z",
1192 "start_time": "2018-11-22T18:24:23.880703Z"
1193 },
1194 "slideshow": {
1195 "slide_type": "slide"
1196 }
1197 },
1198 "outputs": [
1199 {
1200 "data": {
1201 "text/html": [
1202 "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" id=\"0\" class=\"displacy\" width=\"1700\" height=\"362.0\" style=\"max-width: none; height: 362.0px; color: white; background: #09a3d5; font-family: Source Sans Pro\">\n",
1203 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1204 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Apple</tspan>\n",
1205 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">PROPN</tspan>\n",
1206 "</text>\n",
1207 "\n",
1208 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1209 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"200\">is</tspan>\n",
1210 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"200\">VERB</tspan>\n",
1211 "</text>\n",
1212 "\n",
1213 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1214 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">looking</tspan>\n",
1215 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">VERB</tspan>\n",
1216 "</text>\n",
1217 "\n",
1218 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1219 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"500\">at</tspan>\n",
1220 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"500\">ADP</tspan>\n",
1221 "</text>\n",
1222 "\n",
1223 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1224 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">buying</tspan>\n",
1225 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">VERB</tspan>\n",
1226 "</text>\n",
1227 "\n",
1228 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1229 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"800\">U.K.</tspan>\n",
1230 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"800\">PROPN</tspan>\n",
1231 "</text>\n",
1232 "\n",
1233 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1234 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"950\">startup</tspan>\n",
1235 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"950\">NOUN</tspan>\n",
1236 "</text>\n",
1237 "\n",
1238 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1239 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">for</tspan>\n",
1240 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">ADP</tspan>\n",
1241 "</text>\n",
1242 "\n",
1243 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1244 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">$</tspan>\n",
1245 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">SYM</tspan>\n",
1246 "</text>\n",
1247 "\n",
1248 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1249 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1400\">1</tspan>\n",
1250 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1400\">NUM</tspan>\n",
1251 "</text>\n",
1252 "\n",
1253 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
1254 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1550\">billion</tspan>\n",
1255 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1550\">NUM</tspan>\n",
1256 "</text>\n",
1257 "\n",
1258 "<g class=\"displacy-arrow\">\n",
1259 " <path class=\"displacy-arc\" id=\"arrow-0-0\" stroke-width=\"2px\" d=\"M62,227.0 62,177.0 347.0,177.0 347.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1260 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1261 " <textPath xlink:href=\"#arrow-0-0\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
1262 " </text>\n",
1263 " <path class=\"displacy-arrowhead\" d=\"M62,229.0 L58,221.0 66,221.0\" fill=\"currentColor\"/>\n",
1264 "</g>\n",
1265 "\n",
1266 "<g class=\"displacy-arrow\">\n",
1267 " <path class=\"displacy-arc\" id=\"arrow-0-1\" stroke-width=\"2px\" d=\"M212,227.0 212,202.0 344.0,202.0 344.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1268 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1269 " <textPath xlink:href=\"#arrow-0-1\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">aux</textPath>\n",
1270 " </text>\n",
1271 " <path class=\"displacy-arrowhead\" d=\"M212,229.0 L208,221.0 216,221.0\" fill=\"currentColor\"/>\n",
1272 "</g>\n",
1273 "\n",
1274 "<g class=\"displacy-arrow\">\n",
1275 " <path class=\"displacy-arc\" id=\"arrow-0-2\" stroke-width=\"2px\" d=\"M362,227.0 362,202.0 494.0,202.0 494.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1276 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1277 " <textPath xlink:href=\"#arrow-0-2\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
1278 " </text>\n",
1279 " <path class=\"displacy-arrowhead\" d=\"M494.0,229.0 L498.0,221.0 490.0,221.0\" fill=\"currentColor\"/>\n",
1280 "</g>\n",
1281 "\n",
1282 "<g class=\"displacy-arrow\">\n",
1283 " <path class=\"displacy-arc\" id=\"arrow-0-3\" stroke-width=\"2px\" d=\"M512,227.0 512,202.0 644.0,202.0 644.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1284 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1285 " <textPath xlink:href=\"#arrow-0-3\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pcomp</textPath>\n",
1286 " </text>\n",
1287 " <path class=\"displacy-arrowhead\" d=\"M644.0,229.0 L648.0,221.0 640.0,221.0\" fill=\"currentColor\"/>\n",
1288 "</g>\n",
1289 "\n",
1290 "<g class=\"displacy-arrow\">\n",
1291 " <path class=\"displacy-arc\" id=\"arrow-0-4\" stroke-width=\"2px\" d=\"M812,227.0 812,202.0 944.0,202.0 944.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1292 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1293 " <textPath xlink:href=\"#arrow-0-4\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
1294 " </text>\n",
1295 " <path class=\"displacy-arrowhead\" d=\"M812,229.0 L808,221.0 816,221.0\" fill=\"currentColor\"/>\n",
1296 "</g>\n",
1297 "\n",
1298 "<g class=\"displacy-arrow\">\n",
1299 " <path class=\"displacy-arc\" id=\"arrow-0-5\" stroke-width=\"2px\" d=\"M662,227.0 662,177.0 947.0,177.0 947.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1300 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1301 " <textPath xlink:href=\"#arrow-0-5\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
1302 " </text>\n",
1303 " <path class=\"displacy-arrowhead\" d=\"M947.0,229.0 L951.0,221.0 943.0,221.0\" fill=\"currentColor\"/>\n",
1304 "</g>\n",
1305 "\n",
1306 "<g class=\"displacy-arrow\">\n",
1307 " <path class=\"displacy-arc\" id=\"arrow-0-6\" stroke-width=\"2px\" d=\"M662,227.0 662,152.0 1100.0,152.0 1100.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1308 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1309 " <textPath xlink:href=\"#arrow-0-6\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
1310 " </text>\n",
1311 " <path class=\"displacy-arrowhead\" d=\"M1100.0,229.0 L1104.0,221.0 1096.0,221.0\" fill=\"currentColor\"/>\n",
1312 "</g>\n",
1313 "\n",
1314 "<g class=\"displacy-arrow\">\n",
1315 " <path class=\"displacy-arc\" id=\"arrow-0-7\" stroke-width=\"2px\" d=\"M1262,227.0 1262,177.0 1547.0,177.0 1547.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1316 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1317 " <textPath xlink:href=\"#arrow-0-7\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">quantmod</textPath>\n",
1318 " </text>\n",
1319 " <path class=\"displacy-arrowhead\" d=\"M1262,229.0 L1258,221.0 1266,221.0\" fill=\"currentColor\"/>\n",
1320 "</g>\n",
1321 "\n",
1322 "<g class=\"displacy-arrow\">\n",
1323 " <path class=\"displacy-arc\" id=\"arrow-0-8\" stroke-width=\"2px\" d=\"M1412,227.0 1412,202.0 1544.0,202.0 1544.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1324 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1325 " <textPath xlink:href=\"#arrow-0-8\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
1326 " </text>\n",
1327 " <path class=\"displacy-arrowhead\" d=\"M1412,229.0 L1408,221.0 1416,221.0\" fill=\"currentColor\"/>\n",
1328 "</g>\n",
1329 "\n",
1330 "<g class=\"displacy-arrow\">\n",
1331 " <path class=\"displacy-arc\" id=\"arrow-0-9\" stroke-width=\"2px\" d=\"M1112,227.0 1112,152.0 1550.0,152.0 1550.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1332 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1333 " <textPath xlink:href=\"#arrow-0-9\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
1334 " </text>\n",
1335 " <path class=\"displacy-arrowhead\" d=\"M1550.0,229.0 L1554.0,221.0 1546.0,221.0\" fill=\"currentColor\"/>\n",
1336 "</g>\n",
1337 "</svg>"
1338 ],
1339 "text/plain": [
1340 "<IPython.core.display.HTML object>"
1341 ]
1342 },
1343 "metadata": {},
1344 "output_type": "display_data"
1345 }
1346 ],
1347 "source": [
1348 "displacy.render(doc, style='dep', options=options, jupyter=True)"
1349 ]
1350 },
1351 {
1352 "cell_type": "markdown",
1353 "metadata": {
1354 "slideshow": {
1355 "slide_type": "slide"
1356 }
1357 },
1358 "source": [
1359 "#### Visualize Named Entities"
1360 ]
1361 },
1362 {
1363 "cell_type": "code",
1364 "execution_count": 27,
1365 "metadata": {
1366 "ExecuteTime": {
1367 "end_time": "2018-11-22T18:24:23.892155Z",
1368 "start_time": "2018-11-22T18:24:23.885944Z"
1369 },
1370 "slideshow": {
1371 "slide_type": "slide"
1372 }
1373 },
1374 "outputs": [
1375 {
1376 "data": {
1377 "text/html": [
1378 "<div class=\"entities\" style=\"line-height: 2.5\">\n",
1379 "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
1380 " Apple\n",
1381 " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
1382 "</mark>\n",
1383 " is looking at buying \n",
1384 "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
1385 " U.K.\n",
1386 " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
1387 "</mark>\n",
1388 " startup for \n",
1389 "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
1390 " $1 billion\n",
1391 " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
1392 "</mark>\n",
1393 "</div>"
1394 ],
1395 "text/plain": [
1396 "<IPython.core.display.HTML object>"
1397 ]
1398 },
1399 "metadata": {},
1400 "output_type": "display_data"
1401 }
1402 ],
1403 "source": [
1404 "displacy.render(doc, style='ent', jupyter=True)"
1405 ]
1406 },
1407 {
1408 "cell_type": "markdown",
1409 "metadata": {
1410 "slideshow": {
1411 "slide_type": "slide"
1412 }
1413 },
1414 "source": [
1415 "### Read BBC Data"
1416 ]
1417 },
1418 {
1419 "cell_type": "markdown",
1420 "metadata": {},
1421 "source": [
1422 "We will now read a larger set of 2,225 BBC News articles (see GitHub for data source details) that belong to five categories and are stored in individual text files. We \n",
1423 "- call the .glob() method of the pathlib’s Path object, \n",
1424 "- iterate over the resulting list of paths, \n",
1425 "- read all lines of the news article excluding the heading in the first line, and \n",
1426 "- append the cleaned result to a list"
1427 ]
1428 },
1429 {
1430 "cell_type": "code",
1431 "execution_count": 7,
1432 "metadata": {
1433 "ExecuteTime": {
1434 "end_time": "2018-11-22T18:24:44.865932Z",
1435 "start_time": "2018-11-22T18:24:44.780346Z"
1436 },
1437 "slideshow": {
1438 "slide_type": "fragment"
1439 }
1440 },
1441 "outputs": [],
1442 "source": [
1443 "files = (DATA_DIR / 'bbc').glob('**/*.txt')\n",
1444 "bbc_articles = []\n",
1445 "for i, file in enumerate(files):\n",
1446 " with file.open(encoding='latin1') as f:\n",
1447 " lines = f.readlines()\n",
1448 " body = ' '.join([l.strip() for l in lines[1:]]).strip()\n",
1449 " bbc_articles.append(body)"
1450 ]
1451 },
1452 {
1453 "cell_type": "code",
1454 "execution_count": 5,
1455 "metadata": {
1456 "ExecuteTime": {
1457 "end_time": "2018-11-22T18:24:44.870224Z",
1458 "start_time": "2018-11-22T18:24:44.867824Z"
1459 },
1460 "slideshow": {
1461 "slide_type": "fragment"
1462 }
1463 },
1464 "outputs": [
1465 {
1466 "data": {
1467 "text/plain": [
1468 "2225"
1469 ]
1470 },
1471 "execution_count": 5,
1472 "metadata": {},
1473 "output_type": "execute_result"
1474 }
1475 ],
1476 "source": [
1477 "len(bbc_articles)"
1478 ]
1479 },
1480 {
1481 "cell_type": "code",
1482 "execution_count": 6,
1483 "metadata": {
1484 "ExecuteTime": {
1485 "end_time": "2018-11-22T18:24:44.877781Z",
1486 "start_time": "2018-11-22T18:24:44.871120Z"
1487 },
1488 "slideshow": {
1489 "slide_type": "slide"
1490 }
1491 },
1492 "outputs": [
1493 {
1494 "data": {
1495 "text/plain": [
1496 "'Voting is under way for the annual Bloggies which recognise the best web blogs - online spaces where people publish their thoughts - of the year. Nominations were announced on Sunday, but traffic to the official site was so heavy that the website was temporarily closed because of too many visitors. Weblogs have been nominated in 30 categories, from the top regional blog, to the best-kept-secret blog. Blogs had a huge year, with a top US dictionary naming \"blog\" word of 2004. Technorati, a blog search engine, tracks about six million blogs and says that more than 12,000 are added daily. A blog is created every 5.8 seconds, according to US research think-tank Pew Internet and American Life, but less than 40% of the total are updated at least once every two months. Nikolai Nolan, who has run the Bloggies for the past five years, told the BBC News website he was not too surprised by the amount of voters who crowded the site. \"The awards always get a lot of traffic; this was just my first year on a server with a bandwidth limit, so I had to guess how much I\\'d need,\" he said. There were many new finalists this year, he added, and a few that had won Bloggies before. Several entries reflected specific news events. \"There are four nominations for the South-East Asia Earthquake and Tsunami Blog, which is a pretty timely one for 2005,\" said Mr Nolan. The big Bloggies battle will be for the ultimate prize of blog of the year. The nominated blogs are wide-ranging covering what is in the news to quirky sites of interest. Fighting it out for the coveted award are Gawker, This Fish Needs a Bicycle, Wonkette, Boing Boing, and Gothamist. In a sign that blogs are playing an increasingly key part in spreading news and current affairs, The South-East Asia Earthquake and Tsunami Blog is also nominated in the best overall category. GreenFairyDotcom, Londonist, Hicksdesign, PlasticBag and London Underground Tube Blog are the nominees in the best British or Irish weblog. Included in the other categories is best \"meme\". This is for the top \"replicating idea that spread about weblogs\". Nominations include Flickr, a web photo album which lets people upload, tag, share and publish their images to blogs. Podcasting has also made an appearance in the category. It is an increasingly popular idea that makes use of RSS (really simple syndication) and audio technology to let people easily make their own radio shows, and distribute them automatically onto portable devices. Many are done by those who already have text-based blogs, so they are almost like audio blogs. Three new categories have been added to the list this year, including best food, best entertainment, and best writing of a weblog. One of the categories that was scrapped though was best music blog. The winners of the fifth annual Bloggies are chosen by the public. Public voting closes on 3 February and the winners will be announced sometime between 13 and 15 March.'"
1497 ]
1498 },
1499 "execution_count": 6,
1500 "metadata": {},
1501 "output_type": "execute_result"
1502 }
1503 ],
1504 "source": [
1505 "bbc_articles[0]"
1506 ]
1507 },
1508 {
1509 "cell_type": "markdown",
1510 "metadata": {
1511 "slideshow": {
1512 "slide_type": "slide"
1513 }
1514 },
1515 "source": [
1516 "### Parse first article through Pipeline"
1517 ]
1518 },
1519 {
1520 "cell_type": "code",
1521 "execution_count": 31,
1522 "metadata": {
1523 "ExecuteTime": {
1524 "end_time": "2018-11-22T18:24:44.885852Z",
1525 "start_time": "2018-11-22T18:24:44.879325Z"
1526 },
1527 "slideshow": {
1528 "slide_type": "fragment"
1529 }
1530 },
1531 "outputs": [
1532 {
1533 "data": {
1534 "text/plain": [
1535 "['tagger', 'parser', 'ner']"
1536 ]
1537 },
1538 "execution_count": 31,
1539 "metadata": {},
1540 "output_type": "execute_result"
1541 }
1542 ],
1543 "source": [
1544 "nlp.pipe_names"
1545 ]
1546 },
1547 {
1548 "cell_type": "code",
1549 "execution_count": 32,
1550 "metadata": {
1551 "ExecuteTime": {
1552 "end_time": "2018-11-22T18:24:44.985896Z",
1553 "start_time": "2018-11-22T18:24:44.887735Z"
1554 },
1555 "slideshow": {
1556 "slide_type": "fragment"
1557 }
1558 },
1559 "outputs": [
1560 {
1561 "data": {
1562 "text/plain": [
1563 "spacy.tokens.doc.Doc"
1564 ]
1565 },
1566 "execution_count": 32,
1567 "metadata": {},
1568 "output_type": "execute_result"
1569 }
1570 ],
1571 "source": [
1572 "doc = nlp(bbc_articles[0])\n",
1573 "type(doc)"
1574 ]
1575 },
1576 {
1577 "cell_type": "markdown",
1578 "metadata": {
1579 "slideshow": {
1580 "slide_type": "slide"
1581 }
1582 },
1583 "source": [
1584 "### Detect sentence boundary\n",
1585 "Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. \n",
1586 "\n",
1587 "Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text."
1588 ]
1589 },
1590 {
1591 "cell_type": "markdown",
1592 "metadata": {},
1593 "source": [
1594 "spaCy computes sentence boundaries from the syntactic parse tree so that punctuation and capitalization play an important but not decisive role. As a result, boundaries will coincide with clause boundaries, even for poorly punctuated text.\n",
1595 "\n",
1596 "We can access the parsed sentences using the .sents attribute:"
1597 ]
1598 },
1599 {
1600 "cell_type": "code",
1601 "execution_count": 37,
1602 "metadata": {
1603 "ExecuteTime": {
1604 "end_time": "2018-11-22T18:24:44.990720Z",
1605 "start_time": "2018-11-22T18:24:44.987118Z"
1606 },
1607 "slideshow": {
1608 "slide_type": "fragment"
1609 }
1610 },
1611 "outputs": [
1612 {
1613 "data": {
1614 "text/plain": [
1615 "[Voting is under way for the annual Bloggies which recognise the best web blogs - online spaces where people publish their thoughts - of the year. ,\n",
1616 " Nominations were announced on Sunday, but traffic to the official site was so heavy that the website was temporarily closed because of too many visitors.,\n",
1617 " Weblogs have been nominated in 30 categories, from the top regional blog, to the best-kept-secret blog.]"
1618 ]
1619 },
1620 "execution_count": 37,
1621 "metadata": {},
1622 "output_type": "execute_result"
1623 }
1624 ],
1625 "source": [
1626 "sentences = [s for s in doc.sents]\n",
1627 "sentences[:3]"
1628 ]
1629 },
1630 {
1631 "cell_type": "code",
1632 "execution_count": 38,
1633 "metadata": {
1634 "ExecuteTime": {
1635 "end_time": "2018-11-22T18:24:45.004845Z",
1636 "start_time": "2018-11-22T18:24:44.992156Z"
1637 },
1638 "slideshow": {
1639 "slide_type": "fragment"
1640 }
1641 },
1642 "outputs": [
1643 {
1644 "name": "stdout",
1645 "output_type": "stream",
1646 "text": [
1647 "['as_doc', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'label', 'label_', 'lefts', 'lemma_', 'lower_', 'merge', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'set_extension', 'similarity', 'start', 'start_char', 'string', 'subtree', 'text', 'text_with_ws', 'to_array', 'upper_', 'vector', 'vector_norm', 'vocab'] "
1648 ]
1649 }
1650 ],
1651 "source": [
1652 "get_attributes(sentences[0])"
1653 ]
1654 },
1655 {
1656 "cell_type": "code",
1657 "execution_count": 39,
1658 "metadata": {
1659 "ExecuteTime": {
1660 "end_time": "2018-11-22T18:24:45.024224Z",
1661 "start_time": "2018-11-22T18:24:45.006282Z"
1662 },
1663 "slideshow": {
1664 "slide_type": "slide"
1665 }
1666 },
1667 "outputs": [
1668 {
1669 "data": {
1670 "text/html": [
1671 "<div>\n",
1672 "<style scoped>\n",
1673 " .dataframe tbody tr th:only-of-type {\n",
1674 " vertical-align: middle;\n",
1675 " }\n",
1676 "\n",
1677 " .dataframe tbody tr th {\n",
1678 " vertical-align: top;\n",
1679 " }\n",
1680 "\n",
1681 " .dataframe thead th {\n",
1682 " text-align: right;\n",
1683 " }\n",
1684 "</style>\n",
1685 "<table border=\"1\" class=\"dataframe\">\n",
1686 " <thead>\n",
1687 " <tr style=\"text-align: right;\">\n",
1688 " <th></th>\n",
1689 " <th>Token</th>\n",
1690 " <th>POS Tag</th>\n",
1691 " <th>Meaning</th>\n",
1692 " </tr>\n",
1693 " </thead>\n",
1694 " <tbody>\n",
1695 " <tr>\n",
1696 " <th>0</th>\n",
1697 " <td>Voting</td>\n",
1698 " <td>NOUN</td>\n",
1699 " <td>noun</td>\n",
1700 " </tr>\n",
1701 " <tr>\n",
1702 " <th>1</th>\n",
1703 " <td>is</td>\n",
1704 " <td>VERB</td>\n",
1705 " <td>verb</td>\n",
1706 " </tr>\n",
1707 " <tr>\n",
1708 " <th>2</th>\n",
1709 " <td>under</td>\n",
1710 " <td>ADP</td>\n",
1711 " <td>adposition</td>\n",
1712 " </tr>\n",
1713 " <tr>\n",
1714 " <th>3</th>\n",
1715 " <td>way</td>\n",
1716 " <td>NOUN</td>\n",
1717 " <td>noun</td>\n",
1718 " </tr>\n",
1719 " <tr>\n",
1720 " <th>4</th>\n",
1721 " <td>for</td>\n",
1722 " <td>ADP</td>\n",
1723 " <td>adposition</td>\n",
1724 " </tr>\n",
1725 " <tr>\n",
1726 " <th>5</th>\n",
1727 " <td>the</td>\n",
1728 " <td>DET</td>\n",
1729 " <td>determiner</td>\n",
1730 " </tr>\n",
1731 " <tr>\n",
1732 " <th>6</th>\n",
1733 " <td>annual</td>\n",
1734 " <td>ADJ</td>\n",
1735 " <td>adjective</td>\n",
1736 " </tr>\n",
1737 " <tr>\n",
1738 " <th>7</th>\n",
1739 " <td>Bloggies</td>\n",
1740 " <td>PROPN</td>\n",
1741 " <td>proper noun</td>\n",
1742 " </tr>\n",
1743 " <tr>\n",
1744 " <th>8</th>\n",
1745 " <td>which</td>\n",
1746 " <td>ADJ</td>\n",
1747 " <td>adjective</td>\n",
1748 " </tr>\n",
1749 " <tr>\n",
1750 " <th>9</th>\n",
1751 " <td>recognise</td>\n",
1752 " <td>VERB</td>\n",
1753 " <td>verb</td>\n",
1754 " </tr>\n",
1755 " <tr>\n",
1756 " <th>10</th>\n",
1757 " <td>the</td>\n",
1758 " <td>DET</td>\n",
1759 " <td>determiner</td>\n",
1760 " </tr>\n",
1761 " <tr>\n",
1762 " <th>11</th>\n",
1763 " <td>best</td>\n",
1764 " <td>ADJ</td>\n",
1765 " <td>adjective</td>\n",
1766 " </tr>\n",
1767 " <tr>\n",
1768 " <th>12</th>\n",
1769 " <td>web</td>\n",
1770 " <td>NOUN</td>\n",
1771 " <td>noun</td>\n",
1772 " </tr>\n",
1773 " <tr>\n",
1774 " <th>13</th>\n",
1775 " <td>blogs</td>\n",
1776 " <td>NOUN</td>\n",
1777 " <td>noun</td>\n",
1778 " </tr>\n",
1779 " <tr>\n",
1780 " <th>14</th>\n",
1781 " <td>-</td>\n",
1782 " <td>PUNCT</td>\n",
1783 " <td>punctuation</td>\n",
1784 " </tr>\n",
1785 " </tbody>\n",
1786 "</table>\n",
1787 "</div>"
1788 ],
1789 "text/plain": [
1790 " Token POS Tag Meaning\n",
1791 "0 Voting NOUN noun\n",
1792 "1 is VERB verb\n",
1793 "2 under ADP adposition\n",
1794 "3 way NOUN noun\n",
1795 "4 for ADP adposition\n",
1796 "5 the DET determiner\n",
1797 "6 annual ADJ adjective\n",
1798 "7 Bloggies PROPN proper noun\n",
1799 "8 which ADJ adjective\n",
1800 "9 recognise VERB verb\n",
1801 "10 the DET determiner\n",
1802 "11 best ADJ adjective\n",
1803 "12 web NOUN noun\n",
1804 "13 blogs NOUN noun\n",
1805 "14 - PUNCT punctuation"
1806 ]
1807 },
1808 "execution_count": 39,
1809 "metadata": {},
1810 "output_type": "execute_result"
1811 }
1812 ],
1813 "source": [
1814 "pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[0]], \n",
1815 " columns=['Token', 'POS Tag', 'Meaning']).head(15)"
1816 ]
1817 },
1818 {
1819 "cell_type": "code",
1820 "execution_count": 40,
1821 "metadata": {
1822 "ExecuteTime": {
1823 "end_time": "2018-11-22T18:24:45.033202Z",
1824 "start_time": "2018-11-22T18:24:45.025531Z"
1825 },
1826 "slideshow": {
1827 "slide_type": "slide"
1828 }
1829 },
1830 "outputs": [
1831 {
1832 "data": {
1833 "text/html": [
1834 "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" id=\"0\" class=\"displacy\" width=\"3800\" height=\"662.0\" style=\"max-width: none; height: 662.0px; color: white; background: #09a3d5; font-family: Source Sans Pro\">\n",
1835 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1836 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Voting</tspan>\n",
1837 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">NOUN</tspan>\n",
1838 "</text>\n",
1839 "\n",
1840 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1841 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"200\">is</tspan>\n",
1842 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"200\">VERB</tspan>\n",
1843 "</text>\n",
1844 "\n",
1845 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1846 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">under</tspan>\n",
1847 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">ADP</tspan>\n",
1848 "</text>\n",
1849 "\n",
1850 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1851 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"500\">way</tspan>\n",
1852 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"500\">NOUN</tspan>\n",
1853 "</text>\n",
1854 "\n",
1855 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1856 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">for</tspan>\n",
1857 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">ADP</tspan>\n",
1858 "</text>\n",
1859 "\n",
1860 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1861 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"800\">the</tspan>\n",
1862 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"800\">DET</tspan>\n",
1863 "</text>\n",
1864 "\n",
1865 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1866 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"950\">annual</tspan>\n",
1867 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"950\">ADJ</tspan>\n",
1868 "</text>\n",
1869 "\n",
1870 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1871 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">Bloggies</tspan>\n",
1872 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">PROPN</tspan>\n",
1873 "</text>\n",
1874 "\n",
1875 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1876 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">which</tspan>\n",
1877 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">ADJ</tspan>\n",
1878 "</text>\n",
1879 "\n",
1880 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1881 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1400\">recognise</tspan>\n",
1882 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1400\">VERB</tspan>\n",
1883 "</text>\n",
1884 "\n",
1885 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1886 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1550\">the</tspan>\n",
1887 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1550\">DET</tspan>\n",
1888 "</text>\n",
1889 "\n",
1890 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1891 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1700\">best</tspan>\n",
1892 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1700\">ADJ</tspan>\n",
1893 "</text>\n",
1894 "\n",
1895 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1896 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1850\">web</tspan>\n",
1897 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1850\">NOUN</tspan>\n",
1898 "</text>\n",
1899 "\n",
1900 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1901 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2000\">blogs -</tspan>\n",
1902 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2000\">NOUN</tspan>\n",
1903 "</text>\n",
1904 "\n",
1905 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1906 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2150\">online</tspan>\n",
1907 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2150\">NOUN</tspan>\n",
1908 "</text>\n",
1909 "\n",
1910 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1911 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2300\">spaces</tspan>\n",
1912 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2300\">NOUN</tspan>\n",
1913 "</text>\n",
1914 "\n",
1915 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1916 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2450\">where</tspan>\n",
1917 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2450\">ADV</tspan>\n",
1918 "</text>\n",
1919 "\n",
1920 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1921 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2600\">people</tspan>\n",
1922 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2600\">NOUN</tspan>\n",
1923 "</text>\n",
1924 "\n",
1925 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1926 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2750\">publish</tspan>\n",
1927 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2750\">VERB</tspan>\n",
1928 "</text>\n",
1929 "\n",
1930 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1931 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2900\">their</tspan>\n",
1932 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2900\">ADJ</tspan>\n",
1933 "</text>\n",
1934 "\n",
1935 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1936 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3050\">thoughts -</tspan>\n",
1937 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3050\">NOUN</tspan>\n",
1938 "</text>\n",
1939 "\n",
1940 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1941 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3200\">of</tspan>\n",
1942 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3200\">ADP</tspan>\n",
1943 "</text>\n",
1944 "\n",
1945 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1946 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3350\">the</tspan>\n",
1947 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3350\">DET</tspan>\n",
1948 "</text>\n",
1949 "\n",
1950 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1951 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3500\">year.</tspan>\n",
1952 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3500\">NOUN</tspan>\n",
1953 "</text>\n",
1954 "\n",
1955 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"572.0\">\n",
1956 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3650\"> </tspan>\n",
1957 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3650\"></tspan>\n",
1958 "</text>\n",
1959 "\n",
1960 "<g class=\"displacy-arrow\">\n",
1961 " <path class=\"displacy-arc\" id=\"arrow-0-0\" stroke-width=\"2px\" d=\"M62,527.0 62,502.0 182.0,502.0 182.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1962 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1963 " <textPath xlink:href=\"#arrow-0-0\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
1964 " </text>\n",
1965 " <path class=\"displacy-arrowhead\" d=\"M62,529.0 L58,521.0 66,521.0\" fill=\"currentColor\"/>\n",
1966 "</g>\n",
1967 "\n",
1968 "<g class=\"displacy-arrow\">\n",
1969 " <path class=\"displacy-arc\" id=\"arrow-0-1\" stroke-width=\"2px\" d=\"M212,527.0 212,502.0 332.0,502.0 332.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1970 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1971 " <textPath xlink:href=\"#arrow-0-1\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
1972 " </text>\n",
1973 " <path class=\"displacy-arrowhead\" d=\"M332.0,529.0 L336.0,521.0 328.0,521.0\" fill=\"currentColor\"/>\n",
1974 "</g>\n",
1975 "\n",
1976 "<g class=\"displacy-arrow\">\n",
1977 " <path class=\"displacy-arc\" id=\"arrow-0-2\" stroke-width=\"2px\" d=\"M362,527.0 362,502.0 482.0,502.0 482.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1978 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1979 " <textPath xlink:href=\"#arrow-0-2\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
1980 " </text>\n",
1981 " <path class=\"displacy-arrowhead\" d=\"M482.0,529.0 L486.0,521.0 478.0,521.0\" fill=\"currentColor\"/>\n",
1982 "</g>\n",
1983 "\n",
1984 "<g class=\"displacy-arrow\">\n",
1985 " <path class=\"displacy-arc\" id=\"arrow-0-3\" stroke-width=\"2px\" d=\"M212,527.0 212,452.0 638.0,452.0 638.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1986 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1987 " <textPath xlink:href=\"#arrow-0-3\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
1988 " </text>\n",
1989 " <path class=\"displacy-arrowhead\" d=\"M638.0,529.0 L642.0,521.0 634.0,521.0\" fill=\"currentColor\"/>\n",
1990 "</g>\n",
1991 "\n",
1992 "<g class=\"displacy-arrow\">\n",
1993 " <path class=\"displacy-arc\" id=\"arrow-0-4\" stroke-width=\"2px\" d=\"M812,527.0 812,477.0 1085.0,477.0 1085.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
1994 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
1995 " <textPath xlink:href=\"#arrow-0-4\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
1996 " </text>\n",
1997 " <path class=\"displacy-arrowhead\" d=\"M812,529.0 L808,521.0 816,521.0\" fill=\"currentColor\"/>\n",
1998 "</g>\n",
1999 "\n",
2000 "<g class=\"displacy-arrow\">\n",
2001 " <path class=\"displacy-arc\" id=\"arrow-0-5\" stroke-width=\"2px\" d=\"M962,527.0 962,502.0 1082.0,502.0 1082.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2002 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2003 " <textPath xlink:href=\"#arrow-0-5\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
2004 " </text>\n",
2005 " <path class=\"displacy-arrowhead\" d=\"M962,529.0 L958,521.0 966,521.0\" fill=\"currentColor\"/>\n",
2006 "</g>\n",
2007 "\n",
2008 "<g class=\"displacy-arrow\">\n",
2009 " <path class=\"displacy-arc\" id=\"arrow-0-6\" stroke-width=\"2px\" d=\"M662,527.0 662,452.0 1088.0,452.0 1088.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2010 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2011 " <textPath xlink:href=\"#arrow-0-6\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
2012 " </text>\n",
2013 " <path class=\"displacy-arrowhead\" d=\"M1088.0,529.0 L1092.0,521.0 1084.0,521.0\" fill=\"currentColor\"/>\n",
2014 "</g>\n",
2015 "\n",
2016 "<g class=\"displacy-arrow\">\n",
2017 " <path class=\"displacy-arc\" id=\"arrow-0-7\" stroke-width=\"2px\" d=\"M1262,527.0 1262,502.0 1382.0,502.0 1382.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2018 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2019 " <textPath xlink:href=\"#arrow-0-7\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
2020 " </text>\n",
2021 " <path class=\"displacy-arrowhead\" d=\"M1262,529.0 L1258,521.0 1266,521.0\" fill=\"currentColor\"/>\n",
2022 "</g>\n",
2023 "\n",
2024 "<g class=\"displacy-arrow\">\n",
2025 " <path class=\"displacy-arc\" id=\"arrow-0-8\" stroke-width=\"2px\" d=\"M1112,527.0 1112,477.0 1385.0,477.0 1385.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2026 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2027 " <textPath xlink:href=\"#arrow-0-8\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">relcl</textPath>\n",
2028 " </text>\n",
2029 " <path class=\"displacy-arrowhead\" d=\"M1385.0,529.0 L1389.0,521.0 1381.0,521.0\" fill=\"currentColor\"/>\n",
2030 "</g>\n",
2031 "\n",
2032 "<g class=\"displacy-arrow\">\n",
2033 " <path class=\"displacy-arc\" id=\"arrow-0-9\" stroke-width=\"2px\" d=\"M1562,527.0 1562,402.0 2294.0,402.0 2294.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2034 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2035 " <textPath xlink:href=\"#arrow-0-9\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
2036 " </text>\n",
2037 " <path class=\"displacy-arrowhead\" d=\"M1562,529.0 L1558,521.0 1566,521.0\" fill=\"currentColor\"/>\n",
2038 "</g>\n",
2039 "\n",
2040 "<g class=\"displacy-arrow\">\n",
2041 " <path class=\"displacy-arc\" id=\"arrow-0-10\" stroke-width=\"2px\" d=\"M1712,527.0 1712,427.0 2291.0,427.0 2291.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2042 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2043 " <textPath xlink:href=\"#arrow-0-10\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
2044 " </text>\n",
2045 " <path class=\"displacy-arrowhead\" d=\"M1712,529.0 L1708,521.0 1716,521.0\" fill=\"currentColor\"/>\n",
2046 "</g>\n",
2047 "\n",
2048 "<g class=\"displacy-arrow\">\n",
2049 " <path class=\"displacy-arc\" id=\"arrow-0-11\" stroke-width=\"2px\" d=\"M1862,527.0 1862,502.0 1982.0,502.0 1982.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2050 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2051 " <textPath xlink:href=\"#arrow-0-11\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
2052 " </text>\n",
2053 " <path class=\"displacy-arrowhead\" d=\"M1862,529.0 L1858,521.0 1866,521.0\" fill=\"currentColor\"/>\n",
2054 "</g>\n",
2055 "\n",
2056 "<g class=\"displacy-arrow\">\n",
2057 " <path class=\"displacy-arc\" id=\"arrow-0-12\" stroke-width=\"2px\" d=\"M2012,527.0 2012,502.0 2132.0,502.0 2132.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2058 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2059 " <textPath xlink:href=\"#arrow-0-12\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
2060 " </text>\n",
2061 " <path class=\"displacy-arrowhead\" d=\"M2012,529.0 L2008,521.0 2016,521.0\" fill=\"currentColor\"/>\n",
2062 "</g>\n",
2063 "\n",
2064 "<g class=\"displacy-arrow\">\n",
2065 " <path class=\"displacy-arc\" id=\"arrow-0-13\" stroke-width=\"2px\" d=\"M2162,527.0 2162,502.0 2282.0,502.0 2282.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2066 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2067 " <textPath xlink:href=\"#arrow-0-13\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
2068 " </text>\n",
2069 " <path class=\"displacy-arrowhead\" d=\"M2162,529.0 L2158,521.0 2166,521.0\" fill=\"currentColor\"/>\n",
2070 "</g>\n",
2071 "\n",
2072 "<g class=\"displacy-arrow\">\n",
2073 " <path class=\"displacy-arc\" id=\"arrow-0-14\" stroke-width=\"2px\" d=\"M1412,527.0 1412,377.0 2297.0,377.0 2297.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2074 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2075 " <textPath xlink:href=\"#arrow-0-14\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
2076 " </text>\n",
2077 " <path class=\"displacy-arrowhead\" d=\"M2297.0,529.0 L2301.0,521.0 2293.0,521.0\" fill=\"currentColor\"/>\n",
2078 "</g>\n",
2079 "\n",
2080 "<g class=\"displacy-arrow\">\n",
2081 " <path class=\"displacy-arc\" id=\"arrow-0-15\" stroke-width=\"2px\" d=\"M2462,527.0 2462,477.0 2735.0,477.0 2735.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2082 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2083 " <textPath xlink:href=\"#arrow-0-15\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">advmod</textPath>\n",
2084 " </text>\n",
2085 " <path class=\"displacy-arrowhead\" d=\"M2462,529.0 L2458,521.0 2466,521.0\" fill=\"currentColor\"/>\n",
2086 "</g>\n",
2087 "\n",
2088 "<g class=\"displacy-arrow\">\n",
2089 " <path class=\"displacy-arc\" id=\"arrow-0-16\" stroke-width=\"2px\" d=\"M2612,527.0 2612,502.0 2732.0,502.0 2732.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2090 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2091 " <textPath xlink:href=\"#arrow-0-16\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
2092 " </text>\n",
2093 " <path class=\"displacy-arrowhead\" d=\"M2612,529.0 L2608,521.0 2616,521.0\" fill=\"currentColor\"/>\n",
2094 "</g>\n",
2095 "\n",
2096 "<g class=\"displacy-arrow\">\n",
2097 " <path class=\"displacy-arc\" id=\"arrow-0-17\" stroke-width=\"2px\" d=\"M2312,527.0 2312,452.0 2738.0,452.0 2738.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2098 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2099 " <textPath xlink:href=\"#arrow-0-17\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">relcl</textPath>\n",
2100 " </text>\n",
2101 " <path class=\"displacy-arrowhead\" d=\"M2738.0,529.0 L2742.0,521.0 2734.0,521.0\" fill=\"currentColor\"/>\n",
2102 "</g>\n",
2103 "\n",
2104 "<g class=\"displacy-arrow\">\n",
2105 " <path class=\"displacy-arc\" id=\"arrow-0-18\" stroke-width=\"2px\" d=\"M2912,527.0 2912,502.0 3032.0,502.0 3032.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2106 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2107 " <textPath xlink:href=\"#arrow-0-18\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">poss</textPath>\n",
2108 " </text>\n",
2109 " <path class=\"displacy-arrowhead\" d=\"M2912,529.0 L2908,521.0 2916,521.0\" fill=\"currentColor\"/>\n",
2110 "</g>\n",
2111 "\n",
2112 "<g class=\"displacy-arrow\">\n",
2113 " <path class=\"displacy-arc\" id=\"arrow-0-19\" stroke-width=\"2px\" d=\"M2762,527.0 2762,477.0 3035.0,477.0 3035.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2114 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2115 " <textPath xlink:href=\"#arrow-0-19\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
2116 " </text>\n",
2117 " <path class=\"displacy-arrowhead\" d=\"M3035.0,529.0 L3039.0,521.0 3031.0,521.0\" fill=\"currentColor\"/>\n",
2118 "</g>\n",
2119 "\n",
2120 "<g class=\"displacy-arrow\">\n",
2121 " <path class=\"displacy-arc\" id=\"arrow-0-20\" stroke-width=\"2px\" d=\"M2762,527.0 2762,452.0 3188.0,452.0 3188.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2122 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2123 " <textPath xlink:href=\"#arrow-0-20\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
2124 " </text>\n",
2125 " <path class=\"displacy-arrowhead\" d=\"M3188.0,529.0 L3192.0,521.0 3184.0,521.0\" fill=\"currentColor\"/>\n",
2126 "</g>\n",
2127 "\n",
2128 "<g class=\"displacy-arrow\">\n",
2129 " <path class=\"displacy-arc\" id=\"arrow-0-21\" stroke-width=\"2px\" d=\"M3362,527.0 3362,502.0 3482.0,502.0 3482.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2130 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2131 " <textPath xlink:href=\"#arrow-0-21\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
2132 " </text>\n",
2133 " <path class=\"displacy-arrowhead\" d=\"M3362,529.0 L3358,521.0 3366,521.0\" fill=\"currentColor\"/>\n",
2134 "</g>\n",
2135 "\n",
2136 "<g class=\"displacy-arrow\">\n",
2137 " <path class=\"displacy-arc\" id=\"arrow-0-22\" stroke-width=\"2px\" d=\"M212,527.0 212,352.0 3500.0,352.0 3500.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2138 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2139 " <textPath xlink:href=\"#arrow-0-22\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">punct</textPath>\n",
2140 " </text>\n",
2141 " <path class=\"displacy-arrowhead\" d=\"M3500.0,529.0 L3504.0,521.0 3496.0,521.0\" fill=\"currentColor\"/>\n",
2142 "</g>\n",
2143 "\n",
2144 "<g class=\"displacy-arrow\">\n",
2145 " <path class=\"displacy-arc\" id=\"arrow-0-23\" stroke-width=\"2px\" d=\"M3512,527.0 3512,502.0 3632.0,502.0 3632.0,527.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2146 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2147 " <textPath xlink:href=\"#arrow-0-23\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\"></textPath>\n",
2148 " </text>\n",
2149 " <path class=\"displacy-arrowhead\" d=\"M3632.0,529.0 L3636.0,521.0 3628.0,521.0\" fill=\"currentColor\"/>\n",
2150 "</g>\n",
2151 "</svg>"
2152 ],
2153 "text/plain": [
2154 "<IPython.core.display.HTML object>"
2155 ]
2156 },
2157 "metadata": {},
2158 "output_type": "display_data"
2159 }
2160 ],
2161 "source": [
2162 "options = {'compact': True, 'bg': '#09a3d5',\n",
2163 " 'color': 'white', 'font': 'Source Sans Pro'}\n",
2164 "displacy.render(sentences[0].as_doc(), style='dep', jupyter=True, options=options)"
2165 ]
2166 },
2167 {
2168 "cell_type": "code",
2169 "execution_count": 41,
2170 "metadata": {
2171 "ExecuteTime": {
2172 "end_time": "2018-11-22T18:24:45.040053Z",
2173 "start_time": "2018-11-22T18:24:45.034511Z"
2174 },
2175 "slideshow": {
2176 "slide_type": "slide"
2177 }
2178 },
2179 "outputs": [
2180 {
2181 "name": "stdout",
2182 "output_type": "stream",
2183 "text": [
2184 "annual | DATE | Absolute or relative dates or periods\n",
2185 "the | DATE | Absolute or relative dates or periods\n",
2186 "year | DATE | Absolute or relative dates or periods\n"
2187 ]
2188 }
2189 ],
2190 "source": [
2191 "for t in sentences[0]:\n",
2192 " if t.ent_type_:\n",
2193 " print('{} | {} | {}'.format(t.text, t.ent_type_, spacy.explain(t.ent_type_)))"
2194 ]
2195 },
2196 {
2197 "cell_type": "code",
2198 "execution_count": 42,
2199 "metadata": {
2200 "ExecuteTime": {
2201 "end_time": "2018-11-22T18:24:45.051981Z",
2202 "start_time": "2018-11-22T18:24:45.041838Z"
2203 },
2204 "slideshow": {
2205 "slide_type": "fragment"
2206 }
2207 },
2208 "outputs": [
2209 {
2210 "data": {
2211 "text/html": [
2212 "<div class=\"entities\" style=\"line-height: 2.5\">Voting is under way for the \n",
2213 "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
2214 " annual\n",
2215 " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
2216 "</mark>\n",
2217 " Bloggies which recognise the best web blogs - online spaces where people publish their thoughts - of \n",
2218 "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
2219 " the year\n",
2220 " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
2221 "</mark>\n",
2222 ". </div>"
2223 ],
2224 "text/plain": [
2225 "<IPython.core.display.HTML object>"
2226 ]
2227 },
2228 "metadata": {},
2229 "output_type": "display_data"
2230 }
2231 ],
2232 "source": [
2233 "displacy.render(sentences[0].as_doc(), style='ent', jupyter=True)"
2234 ]
2235 },
2236 {
2237 "cell_type": "markdown",
2238 "metadata": {},
2239 "source": [
2240 "### Named Entity-Recognition with textacy"
2241 ]
2242 },
2243 {
2244 "cell_type": "markdown",
2245 "metadata": {},
2246 "source": [
2247 "spaCy enables named entity recognition using the .ent_type_ attribute:"
2248 ]
2249 },
2250 {
2251 "cell_type": "markdown",
2252 "metadata": {},
2253 "source": [
2254 "Textacy makes access to the named entities that appear in the first article easy:"
2255 ]
2256 },
2257 {
2258 "cell_type": "code",
2259 "execution_count": 43,
2260 "metadata": {
2261 "ExecuteTime": {
2262 "end_time": "2018-11-22T18:24:45.061541Z",
2263 "start_time": "2018-11-22T18:24:45.053653Z"
2264 }
2265 },
2266 "outputs": [
2267 {
2268 "data": {
2269 "text/plain": [
2270 "year 4\n",
2271 "US 2\n",
2272 "annual 2\n",
2273 "Tsunami Blog 2\n",
2274 "South-East Asia Earthquake 2\n",
2275 "dtype: int64"
2276 ]
2277 },
2278 "execution_count": 43,
2279 "metadata": {},
2280 "output_type": "execute_result"
2281 }
2282 ],
2283 "source": [
2284 "from textacy.extract import named_entities\n",
2285 "entities = [e.text for e in named_entities(doc)]\n",
2286 "pd.Series(entities).value_counts().head()"
2287 ]
2288 },
2289 {
2290 "cell_type": "markdown",
2291 "metadata": {},
2292 "source": [
2293 "### N-Grams with textacy"
2294 ]
2295 },
2296 {
2297 "cell_type": "markdown",
2298 "metadata": {},
2299 "source": [
2300 "N-grams combine N consecutive tokens. This can be useful for the bag-of-words model because, depending on the textual context, treating, e.g, ‘data scientist’ as a single token may be more meaningful than the two distinct tokens ‘data’ and ‘scientist’.\n",
2301 "\n",
2302 "Textacy makes it easy to view the ngrams of a given length n occurring with at least min_freq times:"
2303 ]
2304 },
2305 {
2306 "cell_type": "code",
2307 "execution_count": 44,
2308 "metadata": {
2309 "ExecuteTime": {
2310 "end_time": "2018-11-22T18:24:45.072572Z",
2311 "start_time": "2018-11-22T18:24:45.062472Z"
2312 }
2313 },
2314 "outputs": [
2315 {
2316 "data": {
2317 "text/plain": [
2318 "Tsunami Blog 2\n",
2319 "annual Bloggies 2\n",
2320 "East Asia 2\n",
2321 "Asia Earthquake 2\n",
2322 "dtype: int64"
2323 ]
2324 },
2325 "execution_count": 44,
2326 "metadata": {},
2327 "output_type": "execute_result"
2328 }
2329 ],
2330 "source": [
2331 "from textacy.extract import ngrams\n",
2332 "pd.Series([n.text for n in ngrams(doc, n=2, min_freq=2)]).value_counts()"
2333 ]
2334 },
2335 {
2336 "cell_type": "markdown",
2337 "metadata": {
2338 "slideshow": {
2339 "slide_type": "slide"
2340 }
2341 },
2342 "source": [
2343 "### The spaCy streaming Pipeline API"
2344 ]
2345 },
2346 {
2347 "cell_type": "markdown",
2348 "metadata": {},
2349 "source": [
2350 "To pass a larger number of documents through the processing pipeline, we can use spaCy’s streaming API as follows:"
2351 ]
2352 },
2353 {
2354 "cell_type": "code",
2355 "execution_count": 45,
2356 "metadata": {
2357 "ExecuteTime": {
2358 "end_time": "2018-11-22T18:26:32.075601Z",
2359 "start_time": "2018-11-22T18:24:45.073615Z"
2360 },
2361 "scrolled": true,
2362 "slideshow": {
2363 "slide_type": "fragment"
2364 }
2365 },
2366 "outputs": [
2367 {
2368 "name": "stdout",
2369 "output_type": "stream",
2370 "text": [
2371 "0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 "
2372 ]
2373 }
2374 ],
2375 "source": [
2376 "iter_texts = (bbc_articles[i] for i in range(len(bbc_articles)))\n",
2377 "for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50, n_threads=8)):\n",
2378 " if i % 100 == 0:\n",
2379 " print(i, end = ' ')\n",
2380 " assert doc.is_parsed"
2381 ]
2382 },
2383 {
2384 "cell_type": "markdown",
2385 "metadata": {
2386 "slideshow": {
2387 "slide_type": "slide"
2388 }
2389 },
2390 "source": [
2391 "### Multi-language Features"
2392 ]
2393 },
2394 {
2395 "cell_type": "markdown",
2396 "metadata": {},
2397 "source": [
2398 "spaCy includes trained language models for English, German, Spanish, Portuguese, French, Italian and Dutch, as well as a multi-language model for named-entity recognition. Cross-language usage is straightforward since the API does not change.\n",
2399 "\n",
2400 "We will illustrate the Spanish language model using a parallel corpus of TED talk subtitles. For this purpose, we instantiate both language models"
2401 ]
2402 },
2403 {
2404 "cell_type": "markdown",
2405 "metadata": {
2406 "slideshow": {
2407 "slide_type": "fragment"
2408 }
2409 },
2410 "source": [
2411 "#### Create a Spanish Language Object"
2412 ]
2413 },
2414 {
2415 "cell_type": "code",
2416 "execution_count": 46,
2417 "metadata": {
2418 "ExecuteTime": {
2419 "end_time": "2018-11-22T18:26:32.778571Z",
2420 "start_time": "2018-11-22T18:26:32.076935Z"
2421 },
2422 "slideshow": {
2423 "slide_type": "fragment"
2424 }
2425 },
2426 "outputs": [],
2427 "source": [
2428 "model = {}\n",
2429 "for language in ['en', 'es']:\n",
2430 " model[language] = spacy.load(language) "
2431 ]
2432 },
2433 {
2434 "cell_type": "markdown",
2435 "metadata": {
2436 "slideshow": {
2437 "slide_type": "slide"
2438 }
2439 },
2440 "source": [
2441 "#### Read bilingual TED2013 samples"
2442 ]
2443 },
2444 {
2445 "cell_type": "code",
2446 "execution_count": 48,
2447 "metadata": {
2448 "ExecuteTime": {
2449 "end_time": "2018-11-22T18:29:55.500761Z",
2450 "start_time": "2018-11-22T18:29:55.490626Z"
2451 },
2452 "slideshow": {
2453 "slide_type": "fragment"
2454 }
2455 },
2456 "outputs": [],
2457 "source": [
2458 "text = {}\n",
2459 "path = Path('data/TED')\n",
2460 "for language in ['en', 'es']:\n",
2461 " file_name = path / 'TED2013_sample.{}'.format(language)\n",
2462 " text[language] = file_name.read_text()"
2463 ]
2464 },
2465 {
2466 "cell_type": "markdown",
2467 "metadata": {
2468 "slideshow": {
2469 "slide_type": "slide"
2470 }
2471 },
2472 "source": [
2473 "#### Sentence Boundaries English vs Spanish"
2474 ]
2475 },
2476 {
2477 "cell_type": "code",
2478 "execution_count": 49,
2479 "metadata": {
2480 "ExecuteTime": {
2481 "end_time": "2018-11-22T18:29:58.265211Z",
2482 "start_time": "2018-11-22T18:29:58.161340Z"
2483 },
2484 "slideshow": {
2485 "slide_type": "fragment"
2486 }
2487 },
2488 "outputs": [
2489 {
2490 "name": "stdout",
2491 "output_type": "stream",
2492 "text": [
2493 "Sentences: en 19\n",
2494 "Sentences: es 22\n"
2495 ]
2496 }
2497 ],
2498 "source": [
2499 "parsed, sentences = {}, {}\n",
2500 "for language in ['en', 'es']:\n",
2501 " parsed[language] = model[language](text[language])\n",
2502 " sentences[language] = list(parsed[language].sents)\n",
2503 " print('Sentences:', language, len(sentences[language]))"
2504 ]
2505 },
2506 {
2507 "cell_type": "code",
2508 "execution_count": 50,
2509 "metadata": {
2510 "ExecuteTime": {
2511 "end_time": "2018-11-22T18:29:58.321011Z",
2512 "start_time": "2018-11-22T18:29:58.317422Z"
2513 },
2514 "slideshow": {
2515 "slide_type": "slide"
2516 }
2517 },
2518 "outputs": [
2519 {
2520 "name": "stdout",
2521 "output_type": "stream",
2522 "text": [
2523 "\n",
2524 " 1\n",
2525 "English:\t There's a tight and surprising link between the ocean's health and ours, says marine biologist Stephen Palumbi.\n",
2526 "Spanish:\t Existe una estrecha y sorprendente relación entre nuestra salud y la salud del océano, dice el biologo marino Stephen Palumbi.\n",
2527 "\n",
2528 " 2\n",
2529 "English:\t He shows how toxins at the bottom of the ocean food chain find their way into our bodies, with a shocking story of toxic contamination from a Japanese fish market.\n",
2530 "Spanish:\t Nos muestra, através de una impactante historia acerca de la contaminación tóxica en el mercado pesquero japonés, como las toxinas de la cadena alimenticia del fondo oceánico llegan a nuestro cuerpo.\n",
2531 "\n",
2532 " 3\n",
2533 "English:\t His work points a way forward for saving the oceans' health -- and humanity's. fish,health,mission blue,oceans,science 899 Stephen Palumbi: Following the mercury trail It can be a very complicated thing, the ocean.\n",
2534 "Spanish:\t fish,health,mission blue,oceans,science 899 Stephen Palumbi: Siguiendo el camino del mercurio.\n",
2535 "\n",
2536 " 4\n",
2537 "English:\t And it can be a very complicated thing, what human health is.\n",
2538 "Spanish:\t El océano puede ser una cosa muy complicada.\n",
2539 "\n",
2540 " 5\n",
2541 "English:\t And bringing those two together might seem a very daunting task, but what I'm going to try to say is that even in that complexity, there's some simple themes that I think, if we understand, we can really move forward.\n",
2542 "Spanish:\t Y podria ser una cosa muy complicada lo que la salud humana es.\n",
2543 "\n",
2544 " 6\n",
2545 "English:\t And those simple themes aren't really themes about the complex science of what's going on, but things that we all pretty well know.\n",
2546 "Spanish:\t Y unirlas, podría ser una tarea desalentadora.\n"
2547 ]
2548 }
2549 ],
2550 "source": [
2551 "for i, (en, es) in enumerate(zip(sentences['en'], sentences['es']), 1):\n",
2552 " print('\\n', i)\n",
2553 " print('English:\\t', en)\n",
2554 " print('Spanish:\\t', es)\n",
2555 " if i > 5: \n",
2556 " break"
2557 ]
2558 },
2559 {
2560 "cell_type": "markdown",
2561 "metadata": {
2562 "slideshow": {
2563 "slide_type": "slide"
2564 }
2565 },
2566 "source": [
2567 "#### POS Tagging English vs Spanish"
2568 ]
2569 },
2570 {
2571 "cell_type": "code",
2572 "execution_count": 51,
2573 "metadata": {
2574 "ExecuteTime": {
2575 "end_time": "2018-11-22T18:29:58.650120Z",
2576 "start_time": "2018-11-22T18:29:58.639202Z"
2577 },
2578 "slideshow": {
2579 "slide_type": "fragment"
2580 }
2581 },
2582 "outputs": [],
2583 "source": [
2584 "pos = {}\n",
2585 "for language in ['en', 'es']:\n",
2586 " pos[language] = pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[language][0]],\n",
2587 " columns=['Token', 'POS Tag', 'Meaning'])"
2588 ]
2589 },
2590 {
2591 "cell_type": "code",
2592 "execution_count": 52,
2593 "metadata": {
2594 "ExecuteTime": {
2595 "end_time": "2018-11-22T18:29:58.822232Z",
2596 "start_time": "2018-11-22T18:29:58.801912Z"
2597 },
2598 "slideshow": {
2599 "slide_type": "fragment"
2600 }
2601 },
2602 "outputs": [
2603 {
2604 "data": {
2605 "text/html": [
2606 "<div>\n",
2607 "<style scoped>\n",
2608 " .dataframe tbody tr th:only-of-type {\n",
2609 " vertical-align: middle;\n",
2610 " }\n",
2611 "\n",
2612 " .dataframe tbody tr th {\n",
2613 " vertical-align: top;\n",
2614 " }\n",
2615 "\n",
2616 " .dataframe thead th {\n",
2617 " text-align: right;\n",
2618 " }\n",
2619 "</style>\n",
2620 "<table border=\"1\" class=\"dataframe\">\n",
2621 " <thead>\n",
2622 " <tr style=\"text-align: right;\">\n",
2623 " <th></th>\n",
2624 " <th>Token</th>\n",
2625 " <th>POS Tag</th>\n",
2626 " <th>Meaning</th>\n",
2627 " <th>Token</th>\n",
2628 " <th>POS Tag</th>\n",
2629 " <th>Meaning</th>\n",
2630 " </tr>\n",
2631 " </thead>\n",
2632 " <tbody>\n",
2633 " <tr>\n",
2634 " <th>0</th>\n",
2635 " <td>There</td>\n",
2636 " <td>ADV</td>\n",
2637 " <td>adverb</td>\n",
2638 " <td>Existe</td>\n",
2639 " <td>VERB</td>\n",
2640 " <td>verb</td>\n",
2641 " </tr>\n",
2642 " <tr>\n",
2643 " <th>1</th>\n",
2644 " <td>'s</td>\n",
2645 " <td>VERB</td>\n",
2646 " <td>verb</td>\n",
2647 " <td>una</td>\n",
2648 " <td>DET</td>\n",
2649 " <td>determiner</td>\n",
2650 " </tr>\n",
2651 " <tr>\n",
2652 " <th>2</th>\n",
2653 " <td>a</td>\n",
2654 " <td>DET</td>\n",
2655 " <td>determiner</td>\n",
2656 " <td>estrecha</td>\n",
2657 " <td>ADJ</td>\n",
2658 " <td>adjective</td>\n",
2659 " </tr>\n",
2660 " <tr>\n",
2661 " <th>3</th>\n",
2662 " <td>tight</td>\n",
2663 " <td>ADJ</td>\n",
2664 " <td>adjective</td>\n",
2665 " <td>y</td>\n",
2666 " <td>CONJ</td>\n",
2667 " <td>conjunction</td>\n",
2668 " </tr>\n",
2669 " <tr>\n",
2670 " <th>4</th>\n",
2671 " <td>and</td>\n",
2672 " <td>CCONJ</td>\n",
2673 " <td>coordinating conjunction</td>\n",
2674 " <td>sorprendente</td>\n",
2675 " <td>ADJ</td>\n",
2676 " <td>adjective</td>\n",
2677 " </tr>\n",
2678 " <tr>\n",
2679 " <th>5</th>\n",
2680 " <td>surprising</td>\n",
2681 " <td>ADJ</td>\n",
2682 " <td>adjective</td>\n",
2683 " <td>relación</td>\n",
2684 " <td>NOUN</td>\n",
2685 " <td>noun</td>\n",
2686 " </tr>\n",
2687 " <tr>\n",
2688 " <th>6</th>\n",
2689 " <td>link</td>\n",
2690 " <td>NOUN</td>\n",
2691 " <td>noun</td>\n",
2692 " <td>entre</td>\n",
2693 " <td>ADP</td>\n",
2694 " <td>adposition</td>\n",
2695 " </tr>\n",
2696 " <tr>\n",
2697 " <th>7</th>\n",
2698 " <td>between</td>\n",
2699 " <td>ADP</td>\n",
2700 " <td>adposition</td>\n",
2701 " <td>nuestra</td>\n",
2702 " <td>DET</td>\n",
2703 " <td>determiner</td>\n",
2704 " </tr>\n",
2705 " <tr>\n",
2706 " <th>8</th>\n",
2707 " <td>the</td>\n",
2708 " <td>DET</td>\n",
2709 " <td>determiner</td>\n",
2710 " <td>salud</td>\n",
2711 " <td>NOUN</td>\n",
2712 " <td>noun</td>\n",
2713 " </tr>\n",
2714 " <tr>\n",
2715 " <th>9</th>\n",
2716 " <td>ocean</td>\n",
2717 " <td>NOUN</td>\n",
2718 " <td>noun</td>\n",
2719 " <td>y</td>\n",
2720 " <td>CONJ</td>\n",
2721 " <td>conjunction</td>\n",
2722 " </tr>\n",
2723 " <tr>\n",
2724 " <th>10</th>\n",
2725 " <td>'s</td>\n",
2726 " <td>PART</td>\n",
2727 " <td>particle</td>\n",
2728 " <td>la</td>\n",
2729 " <td>DET</td>\n",
2730 " <td>determiner</td>\n",
2731 " </tr>\n",
2732 " <tr>\n",
2733 " <th>11</th>\n",
2734 " <td>health</td>\n",
2735 " <td>NOUN</td>\n",
2736 " <td>noun</td>\n",
2737 " <td>salud</td>\n",
2738 " <td>NOUN</td>\n",
2739 " <td>noun</td>\n",
2740 " </tr>\n",
2741 " <tr>\n",
2742 " <th>12</th>\n",
2743 " <td>and</td>\n",
2744 " <td>CCONJ</td>\n",
2745 " <td>coordinating conjunction</td>\n",
2746 " <td>del</td>\n",
2747 " <td>ADP</td>\n",
2748 " <td>adposition</td>\n",
2749 " </tr>\n",
2750 " <tr>\n",
2751 " <th>13</th>\n",
2752 " <td>ours</td>\n",
2753 " <td>NOUN</td>\n",
2754 " <td>noun</td>\n",
2755 " <td>océano</td>\n",
2756 " <td>NOUN</td>\n",
2757 " <td>noun</td>\n",
2758 " </tr>\n",
2759 " <tr>\n",
2760 " <th>14</th>\n",
2761 " <td>,</td>\n",
2762 " <td>PUNCT</td>\n",
2763 " <td>punctuation</td>\n",
2764 " <td>,</td>\n",
2765 " <td>PUNCT</td>\n",
2766 " <td>punctuation</td>\n",
2767 " </tr>\n",
2768 " </tbody>\n",
2769 "</table>\n",
2770 "</div>"
2771 ],
2772 "text/plain": [
2773 " Token POS Tag Meaning Token POS Tag \\\n",
2774 "0 There ADV adverb Existe VERB \n",
2775 "1 's VERB verb una DET \n",
2776 "2 a DET determiner estrecha ADJ \n",
2777 "3 tight ADJ adjective y CONJ \n",
2778 "4 and CCONJ coordinating conjunction sorprendente ADJ \n",
2779 "5 surprising ADJ adjective relación NOUN \n",
2780 "6 link NOUN noun entre ADP \n",
2781 "7 between ADP adposition nuestra DET \n",
2782 "8 the DET determiner salud NOUN \n",
2783 "9 ocean NOUN noun y CONJ \n",
2784 "10 's PART particle la DET \n",
2785 "11 health NOUN noun salud NOUN \n",
2786 "12 and CCONJ coordinating conjunction del ADP \n",
2787 "13 ours NOUN noun océano NOUN \n",
2788 "14 , PUNCT punctuation , PUNCT \n",
2789 "\n",
2790 " Meaning \n",
2791 "0 verb \n",
2792 "1 determiner \n",
2793 "2 adjective \n",
2794 "3 conjunction \n",
2795 "4 adjective \n",
2796 "5 noun \n",
2797 "6 adposition \n",
2798 "7 determiner \n",
2799 "8 noun \n",
2800 "9 conjunction \n",
2801 "10 determiner \n",
2802 "11 noun \n",
2803 "12 adposition \n",
2804 "13 noun \n",
2805 "14 punctuation "
2806 ]
2807 },
2808 "execution_count": 52,
2809 "metadata": {},
2810 "output_type": "execute_result"
2811 }
2812 ],
2813 "source": [
2814 "bilingual_parsed = pd.concat([pos['en'], pos['es']], axis=1)\n",
2815 "bilingual_parsed.head(5).to_csv('data/bilingual.csv', index=False)\n",
2816 "bilingual_parsed.head(15)"
2817 ]
2818 },
2819 {
2820 "cell_type": "code",
2821 "execution_count": 53,
2822 "metadata": {
2823 "ExecuteTime": {
2824 "end_time": "2018-11-22T18:29:59.172621Z",
2825 "start_time": "2018-11-22T18:29:59.168637Z"
2826 },
2827 "slideshow": {
2828 "slide_type": "slide"
2829 }
2830 },
2831 "outputs": [
2832 {
2833 "data": {
2834 "text/html": [
2835 "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" id=\"0\" class=\"displacy\" width=\"3050\" height=\"437.0\" style=\"max-width: none; height: 437.0px; color: white; background: #09a3d5; font-family: Source Sans Pro\">\n",
2836 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2837 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Existe</tspan>\n",
2838 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">VERB</tspan>\n",
2839 "</text>\n",
2840 "\n",
2841 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2842 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"200\">una</tspan>\n",
2843 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"200\">DET</tspan>\n",
2844 "</text>\n",
2845 "\n",
2846 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2847 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">estrecha</tspan>\n",
2848 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">ADJ</tspan>\n",
2849 "</text>\n",
2850 "\n",
2851 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2852 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"500\">y</tspan>\n",
2853 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"500\">CONJ</tspan>\n",
2854 "</text>\n",
2855 "\n",
2856 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2857 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">sorprendente</tspan>\n",
2858 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">ADJ</tspan>\n",
2859 "</text>\n",
2860 "\n",
2861 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2862 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"800\">relación</tspan>\n",
2863 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"800\">NOUN</tspan>\n",
2864 "</text>\n",
2865 "\n",
2866 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2867 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"950\">entre</tspan>\n",
2868 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"950\">ADP</tspan>\n",
2869 "</text>\n",
2870 "\n",
2871 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2872 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">nuestra</tspan>\n",
2873 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">DET</tspan>\n",
2874 "</text>\n",
2875 "\n",
2876 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2877 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">salud</tspan>\n",
2878 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">NOUN</tspan>\n",
2879 "</text>\n",
2880 "\n",
2881 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2882 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1400\">y</tspan>\n",
2883 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1400\">CONJ</tspan>\n",
2884 "</text>\n",
2885 "\n",
2886 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2887 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1550\">la</tspan>\n",
2888 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1550\">DET</tspan>\n",
2889 "</text>\n",
2890 "\n",
2891 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2892 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1700\">salud</tspan>\n",
2893 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1700\">NOUN</tspan>\n",
2894 "</text>\n",
2895 "\n",
2896 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2897 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1850\">del</tspan>\n",
2898 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1850\">ADP</tspan>\n",
2899 "</text>\n",
2900 "\n",
2901 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2902 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2000\">océano,</tspan>\n",
2903 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2000\">NOUN</tspan>\n",
2904 "</text>\n",
2905 "\n",
2906 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2907 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2150\">dice</tspan>\n",
2908 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2150\">VERB</tspan>\n",
2909 "</text>\n",
2910 "\n",
2911 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2912 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2300\">el</tspan>\n",
2913 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2300\">DET</tspan>\n",
2914 "</text>\n",
2915 "\n",
2916 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2917 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2450\">biologo</tspan>\n",
2918 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2450\">NOUN</tspan>\n",
2919 "</text>\n",
2920 "\n",
2921 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2922 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2600\">marino</tspan>\n",
2923 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2600\">NOUN</tspan>\n",
2924 "</text>\n",
2925 "\n",
2926 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2927 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2750\">Stephen</tspan>\n",
2928 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2750\">PROPN</tspan>\n",
2929 "</text>\n",
2930 "\n",
2931 "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"347.0\">\n",
2932 " <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2900\">Palumbi.</tspan>\n",
2933 " <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2900\">PROPN</tspan>\n",
2934 "</text>\n",
2935 "\n",
2936 "<g class=\"displacy-arrow\">\n",
2937 " <path class=\"displacy-arc\" id=\"arrow-0-0\" stroke-width=\"2px\" d=\"M62,302.0 62,202.0 2150.0,202.0 2150.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2938 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2939 " <textPath xlink:href=\"#arrow-0-0\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">ccomp</textPath>\n",
2940 " </text>\n",
2941 " <path class=\"displacy-arrowhead\" d=\"M62,304.0 L58,296.0 66,296.0\" fill=\"currentColor\"/>\n",
2942 "</g>\n",
2943 "\n",
2944 "<g class=\"displacy-arrow\">\n",
2945 " <path class=\"displacy-arc\" id=\"arrow-0-1\" stroke-width=\"2px\" d=\"M212,302.0 212,277.0 341.0,277.0 341.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2946 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2947 " <textPath xlink:href=\"#arrow-0-1\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
2948 " </text>\n",
2949 " <path class=\"displacy-arrowhead\" d=\"M212,304.0 L208,296.0 216,296.0\" fill=\"currentColor\"/>\n",
2950 "</g>\n",
2951 "\n",
2952 "<g class=\"displacy-arrow\">\n",
2953 " <path class=\"displacy-arc\" id=\"arrow-0-2\" stroke-width=\"2px\" d=\"M62,302.0 62,252.0 344.0,252.0 344.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2954 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2955 " <textPath xlink:href=\"#arrow-0-2\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
2956 " </text>\n",
2957 " <path class=\"displacy-arrowhead\" d=\"M344.0,304.0 L348.0,296.0 340.0,296.0\" fill=\"currentColor\"/>\n",
2958 "</g>\n",
2959 "\n",
2960 "<g class=\"displacy-arrow\">\n",
2961 " <path class=\"displacy-arc\" id=\"arrow-0-3\" stroke-width=\"2px\" d=\"M512,302.0 512,252.0 794.0,252.0 794.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2962 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2963 " <textPath xlink:href=\"#arrow-0-3\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">cc</textPath>\n",
2964 " </text>\n",
2965 " <path class=\"displacy-arrowhead\" d=\"M512,304.0 L508,296.0 516,296.0\" fill=\"currentColor\"/>\n",
2966 "</g>\n",
2967 "\n",
2968 "<g class=\"displacy-arrow\">\n",
2969 " <path class=\"displacy-arc\" id=\"arrow-0-4\" stroke-width=\"2px\" d=\"M662,302.0 662,277.0 791.0,277.0 791.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2970 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2971 " <textPath xlink:href=\"#arrow-0-4\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
2972 " </text>\n",
2973 " <path class=\"displacy-arrowhead\" d=\"M662,304.0 L658,296.0 666,296.0\" fill=\"currentColor\"/>\n",
2974 "</g>\n",
2975 "\n",
2976 "<g class=\"displacy-arrow\">\n",
2977 " <path class=\"displacy-arc\" id=\"arrow-0-5\" stroke-width=\"2px\" d=\"M362,302.0 362,227.0 797.0,227.0 797.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2978 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2979 " <textPath xlink:href=\"#arrow-0-5\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">conj</textPath>\n",
2980 " </text>\n",
2981 " <path class=\"displacy-arrowhead\" d=\"M797.0,304.0 L801.0,296.0 793.0,296.0\" fill=\"currentColor\"/>\n",
2982 "</g>\n",
2983 "\n",
2984 "<g class=\"displacy-arrow\">\n",
2985 " <path class=\"displacy-arc\" id=\"arrow-0-6\" stroke-width=\"2px\" d=\"M962,302.0 962,252.0 1244.0,252.0 1244.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2986 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2987 " <textPath xlink:href=\"#arrow-0-6\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
2988 " </text>\n",
2989 " <path class=\"displacy-arrowhead\" d=\"M962,304.0 L958,296.0 966,296.0\" fill=\"currentColor\"/>\n",
2990 "</g>\n",
2991 "\n",
2992 "<g class=\"displacy-arrow\">\n",
2993 " <path class=\"displacy-arc\" id=\"arrow-0-7\" stroke-width=\"2px\" d=\"M1112,302.0 1112,277.0 1241.0,277.0 1241.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
2994 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
2995 " <textPath xlink:href=\"#arrow-0-7\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
2996 " </text>\n",
2997 " <path class=\"displacy-arrowhead\" d=\"M1112,304.0 L1108,296.0 1116,296.0\" fill=\"currentColor\"/>\n",
2998 "</g>\n",
2999 "\n",
3000 "<g class=\"displacy-arrow\">\n",
3001 " <path class=\"displacy-arc\" id=\"arrow-0-8\" stroke-width=\"2px\" d=\"M812,302.0 812,227.0 1247.0,227.0 1247.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3002 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3003 " <textPath xlink:href=\"#arrow-0-8\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
3004 " </text>\n",
3005 " <path class=\"displacy-arrowhead\" d=\"M1247.0,304.0 L1251.0,296.0 1243.0,296.0\" fill=\"currentColor\"/>\n",
3006 "</g>\n",
3007 "\n",
3008 "<g class=\"displacy-arrow\">\n",
3009 " <path class=\"displacy-arc\" id=\"arrow-0-9\" stroke-width=\"2px\" d=\"M1412,302.0 1412,252.0 1694.0,252.0 1694.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3010 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3011 " <textPath xlink:href=\"#arrow-0-9\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">cc</textPath>\n",
3012 " </text>\n",
3013 " <path class=\"displacy-arrowhead\" d=\"M1412,304.0 L1408,296.0 1416,296.0\" fill=\"currentColor\"/>\n",
3014 "</g>\n",
3015 "\n",
3016 "<g class=\"displacy-arrow\">\n",
3017 " <path class=\"displacy-arc\" id=\"arrow-0-10\" stroke-width=\"2px\" d=\"M1562,302.0 1562,277.0 1691.0,277.0 1691.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3018 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3019 " <textPath xlink:href=\"#arrow-0-10\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
3020 " </text>\n",
3021 " <path class=\"displacy-arrowhead\" d=\"M1562,304.0 L1558,296.0 1566,296.0\" fill=\"currentColor\"/>\n",
3022 "</g>\n",
3023 "\n",
3024 "<g class=\"displacy-arrow\">\n",
3025 " <path class=\"displacy-arc\" id=\"arrow-0-11\" stroke-width=\"2px\" d=\"M1262,302.0 1262,227.0 1697.0,227.0 1697.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3026 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3027 " <textPath xlink:href=\"#arrow-0-11\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">conj</textPath>\n",
3028 " </text>\n",
3029 " <path class=\"displacy-arrowhead\" d=\"M1697.0,304.0 L1701.0,296.0 1693.0,296.0\" fill=\"currentColor\"/>\n",
3030 "</g>\n",
3031 "\n",
3032 "<g class=\"displacy-arrow\">\n",
3033 " <path class=\"displacy-arc\" id=\"arrow-0-12\" stroke-width=\"2px\" d=\"M1862,302.0 1862,277.0 1991.0,277.0 1991.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3034 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3035 " <textPath xlink:href=\"#arrow-0-12\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
3036 " </text>\n",
3037 " <path class=\"displacy-arrowhead\" d=\"M1862,304.0 L1858,296.0 1866,296.0\" fill=\"currentColor\"/>\n",
3038 "</g>\n",
3039 "\n",
3040 "<g class=\"displacy-arrow\">\n",
3041 " <path class=\"displacy-arc\" id=\"arrow-0-13\" stroke-width=\"2px\" d=\"M1712,302.0 1712,252.0 1994.0,252.0 1994.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3042 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3043 " <textPath xlink:href=\"#arrow-0-13\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
3044 " </text>\n",
3045 " <path class=\"displacy-arrowhead\" d=\"M1994.0,304.0 L1998.0,296.0 1990.0,296.0\" fill=\"currentColor\"/>\n",
3046 "</g>\n",
3047 "\n",
3048 "<g class=\"displacy-arrow\">\n",
3049 " <path class=\"displacy-arc\" id=\"arrow-0-14\" stroke-width=\"2px\" d=\"M2312,302.0 2312,277.0 2441.0,277.0 2441.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3050 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3051 " <textPath xlink:href=\"#arrow-0-14\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
3052 " </text>\n",
3053 " <path class=\"displacy-arrowhead\" d=\"M2312,304.0 L2308,296.0 2316,296.0\" fill=\"currentColor\"/>\n",
3054 "</g>\n",
3055 "\n",
3056 "<g class=\"displacy-arrow\">\n",
3057 " <path class=\"displacy-arc\" id=\"arrow-0-15\" stroke-width=\"2px\" d=\"M2162,302.0 2162,252.0 2444.0,252.0 2444.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3058 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3059 " <textPath xlink:href=\"#arrow-0-15\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
3060 " </text>\n",
3061 " <path class=\"displacy-arrowhead\" d=\"M2444.0,304.0 L2448.0,296.0 2440.0,296.0\" fill=\"currentColor\"/>\n",
3062 "</g>\n",
3063 "\n",
3064 "<g class=\"displacy-arrow\">\n",
3065 " <path class=\"displacy-arc\" id=\"arrow-0-16\" stroke-width=\"2px\" d=\"M2462,302.0 2462,277.0 2591.0,277.0 2591.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3066 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3067 " <textPath xlink:href=\"#arrow-0-16\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
3068 " </text>\n",
3069 " <path class=\"displacy-arrowhead\" d=\"M2591.0,304.0 L2595.0,296.0 2587.0,296.0\" fill=\"currentColor\"/>\n",
3070 "</g>\n",
3071 "\n",
3072 "<g class=\"displacy-arrow\">\n",
3073 " <path class=\"displacy-arc\" id=\"arrow-0-17\" stroke-width=\"2px\" d=\"M2462,302.0 2462,252.0 2744.0,252.0 2744.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3074 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3075 " <textPath xlink:href=\"#arrow-0-17\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">appos</textPath>\n",
3076 " </text>\n",
3077 " <path class=\"displacy-arrowhead\" d=\"M2744.0,304.0 L2748.0,296.0 2740.0,296.0\" fill=\"currentColor\"/>\n",
3078 "</g>\n",
3079 "\n",
3080 "<g class=\"displacy-arrow\">\n",
3081 " <path class=\"displacy-arc\" id=\"arrow-0-18\" stroke-width=\"2px\" d=\"M2762,302.0 2762,277.0 2891.0,277.0 2891.0,302.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
3082 " <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
3083 " <textPath xlink:href=\"#arrow-0-18\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">flat</textPath>\n",
3084 " </text>\n",
3085 " <path class=\"displacy-arrowhead\" d=\"M2891.0,304.0 L2895.0,296.0 2887.0,296.0\" fill=\"currentColor\"/>\n",
3086 "</g>\n",
3087 "</svg>"
3088 ],
3089 "text/plain": [
3090 "<IPython.core.display.HTML object>"
3091 ]
3092 },
3093 "metadata": {},
3094 "output_type": "display_data"
3095 }
3096 ],
3097 "source": [
3098 "displacy.render(sentences['es'][0].as_doc(), style='dep', jupyter=True, options=options)"
3099 ]
3100 }
3101 ],
3102 "metadata": {
3103 "celltoolbar": "Slideshow",
3104 "kernelspec": {
3105 "display_name": "Python 3",
3106 "language": "python",
3107 "name": "python3"
3108 },
3109 "language_info": {
3110 "codemirror_mode": {
3111 "name": "ipython",
3112 "version": 3
3113 },
3114 "file_extension": ".py",
3115 "mimetype": "text/x-python",
3116 "name": "python",
3117 "nbconvert_exporter": "python",
3118 "pygments_lexer": "ipython3",
3119 "version": "3.6.8"
3120 },
3121 "toc": {
3122 "base_numbering": 1,
3123 "nav_menu": {},
3124 "number_sections": true,
3125 "sideBar": true,
3126 "skip_h1_title": false,
3127 "title_cell": "Table of Contents",
3128 "title_sidebar": "Contents",
3129 "toc_cell": false,
3130 "toc_position": {
3131 "height": "355px",
3132 "left": "1936.95px",
3133 "top": "146.953px",
3134 "width": "305px"
3135 },
3136 "toc_section_display": true,
3137 "toc_window_display": true
3138 }
3139 },
3140 "nbformat": 4,
3141 "nbformat_minor": 2
3142 }