tokenize with Spacy #131

jbesomi · 2020-07-30T16:09:17Z

The actual tokenizer is very fast as it uses a simple regex pattern but at the same time very imprecise.

A better alternative might be to make use of spaCy.

Naively tokenize a Pandas Series with spaCy is very simple:

def tokenize_with_spacy(s: pd.Series) -> pd.Series:
   
    nlp = spacy.load("en_core_web_sm")# disable=["ner", "tagger", "parser"])
    
    tokenized = []
    for doc in nlp.pipe(s):
        tokenized.append(list(map(str, doc)))
    
    return pd.Series(tokenized, index=s.index)

This should be somehow fast as in theory spaCy makes use of multi-threading.

The reason we haven't implemented this yet is that we want to make sure this solution is enough faster. We want to provide a simple tool to analyze a quite large amount of text data; say 100k Pandas Row should take no longer than 15-30 seconds to tokenize ... ?

For now, the task consists in:

Compare the "spacy solution" w.r.t the actual version and benchmark the function on large datasets (150k rows or so)
1. This should be done in a single, clean notebook that will be shared there
Evaluate if we can do better by parallelizing even more the process (we can probably parallelize both by row and by sentence?)

The text was updated successfully, but these errors were encountered:

jbesomi · 2020-08-14T15:35:05Z

Dask vs. spaCy

It's faster to use pipe from spaCy or to directly use Dask (Dask Dataframe)?

Dask might be the solution we were looking for ...

mk2510 · 2020-08-25T19:27:23Z

As described in #162 , dask is not feasible from a UX perspective. Here are our results from experimenting with tokenize. See THE ATTACHED PDF for a notebook of the results.

Speed Comparison

We now compare:

current implementation without parallelization
current implementation with parallelization (see Speed-Up Preprocessing + NLP #162)
tokenize_with_spacy with spacy built-in parallelization through n_process
tokenize_with_spacy with our custom parallelization

Results below.

We can see that

our current implementation is much faster than spaCy (22 vs 51 seconds with both parallelized)
as shown in Speed-Up Preprocessing + NLP #162, our parallelization works better than spaCy's.

Thus, our options:

keep everything as proposed in Speed-Up Preprocessing + NLP #162 (-> multiprocessing applied to current solution)
option 1, but we give users a parameter use_spacy that works like our tokenize_with_spacy_own_parallelization above, and explain to them that this might give them better results but takes about 3x as long.

We don't really have a preference.

jbesomi added enhancement New feature or request discussion To discuss new improvements labels Jul 30, 2020

This was referenced Jul 30, 2020

Add initial Chinese support for hero.lang.zh.preprocessing #128

Draft

👩‍💻 API next steps: checklist #85

Open

Add remove emoji #111

Closed

Avoid downloading spaCy models when only using the preprocessing module #120

Open

jbesomi mentioned this issue Aug 7, 2020

All preprocessing functions to receive as input TokenSeries #145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize with Spacy #131

tokenize with Spacy #131

jbesomi commented Jul 30, 2020

jbesomi commented Aug 14, 2020

mk2510 commented Aug 25, 2020 •

edited

Loading

tokenize with Spacy #131

tokenize with Spacy #131

Comments

jbesomi commented Jul 30, 2020

jbesomi commented Aug 14, 2020

mk2510 commented Aug 25, 2020 • edited Loading

Speed Comparison

mk2510 commented Aug 25, 2020 •

edited

Loading