Skip to content

Latest commit

 

History

History
77 lines (66 loc) · 2.2 KB

analysis.md

File metadata and controls

77 lines (66 loc) · 2.2 KB

#Text Analysis

Suppose our text data is currently arranged into a single file, where each line of that file contains all of the text in a single document. Here we can use SFrame.read_csv to parse the text data into a one-column SFrame.

import turicreate
sf = turicreate.SFrame('wikipedia_data')
Columns:
X1      str

Rows: 72269

Data:
+--------------------------------+
|               X1               |
+--------------------------------+
| alainconnes alain connes i ... |
| americannationalstandardsi ... |
| alberteinstein near the be ... |
| austriangerman as german i ... |
| arsenic arsenic is a metal ... |
| alps the alps alpen alpi a ... |
| alexiscarrel born in saint ... |
| adelaide adelaide is a coa ... |
| artist an artist is a pers ... |
| abdominalsurgery the three ... |
|              ...               |
+--------------------------------+
[72269 rows x 1 columns]
Note: Only the head of the SFrame is printed.
Text cleaning

We can easily remove all words do not occur at least twice in each document using SArray.dict_trim_by_values.

Turi Create also contains a helper function called stopwords that returns a list of common words. We can use SArray.docs.dict_trim_by_keys to remove these words from the documents as a preprocessing step. NB: Currently only English words are available.

docs = docs.dict_trim_by_keys(turicreate.text_analytics.stopwords(), exclude=True)

To confirm that we have indeed removed common words, e.g. "and", "the", etc, we can examine the first document.

print(docs[0])
{'academy': 5,
 'algebras': 2,
 'connes': 3,
 'differential': 2,
 'early': 2,
 'geometry': 2,
 'including': 2,
 'medal': 2,
 'operator': 2,
 'physics': 2,
 'sciences': 5,
 'theory': 2,
 'work': 2}