Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize with Spacy #131

Open
jbesomi opened this issue Jul 30, 2020 · 2 comments
Open

tokenize with Spacy #131

jbesomi opened this issue Jul 30, 2020 · 2 comments
Labels
discussion To discuss new improvements enhancement New feature or request

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 30, 2020

The actual tokenizer is very fast as it uses a simple regex pattern but at the same time very imprecise.

A better alternative might be to make use of spaCy.

Naively tokenize a Pandas Series with spaCy is very simple:

def tokenize_with_spacy(s: pd.Series) -> pd.Series:
   
    nlp = spacy.load("en_core_web_sm")# disable=["ner", "tagger", "parser"])
    
    tokenized = []
    for doc in nlp.pipe(s):
        tokenized.append(list(map(str, doc)))
    
    return pd.Series(tokenized, index=s.index)

This should be somehow fast as in theory spaCy makes use of multi-threading.

The reason we haven't implemented this yet is that we want to make sure this solution is enough faster. We want to provide a simple tool to analyze a quite large amount of text data; say 100k Pandas Row should take no longer than 15-30 seconds to tokenize ... ?

For now, the task consists in:

  1. Compare the "spacy solution" w.r.t the actual version and benchmark the function on large datasets (150k rows or so)
    1. This should be done in a single, clean notebook that will be shared there
  2. Evaluate if we can do better by parallelizing even more the process (we can probably parallelize both by row and by sentence?)
@jbesomi
Copy link
Owner Author

jbesomi commented Aug 14, 2020

Dask vs. spaCy

It's faster to use pipe from spaCy or to directly use Dask (Dask Dataframe)?

Dask might be the solution we were looking for ...

@mk2510
Copy link
Collaborator

mk2510 commented Aug 25, 2020

As described in #162 , dask is not feasible from a UX perspective. Here are our results from experimenting with tokenize. See THE ATTACHED PDF for a notebook of the results.

Speed Comparison

We now compare:

  1. current implementation without parallelization
  2. current implementation with parallelization (see Speed-Up Preprocessing + NLP #162)
  3. tokenize_with_spacy with spacy built-in parallelization through n_process
  4. tokenize_with_spacy with our custom parallelization

Results below.

We can see that

  • our current implementation is much faster than spaCy (22 vs 51 seconds with both parallelized)
  • as shown in Speed-Up Preprocessing + NLP #162, our parallelization works better than spaCy's.

Thus, our options:

  1. keep everything as proposed in Speed-Up Preprocessing + NLP #162 (-> multiprocessing applied to current solution)
  2. option 1, but we give users a parameter use_spacy that works like our tokenize_with_spacy_own_parallelization above, and explain to them that this might give them better results but takes about 3x as long.

We don't really have a preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion To discuss new improvements enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants