You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This should be somehow fast as in theory spaCy makes use of multi-threading.
The reason we haven't implemented this yet is that we want to make sure this solution is enough faster. We want to provide a simple tool to analyze a quite large amount of text data; say 100k Pandas Row should take no longer than 15-30 seconds to tokenize ... ?
For now, the task consists in:
Compare the "spacy solution" w.r.t the actual version and benchmark the function on large datasets (150k rows or so)
This should be done in a single, clean notebook that will be shared there
Evaluate if we can do better by parallelizing even more the process (we can probably parallelize both by row and by sentence?)
The text was updated successfully, but these errors were encountered:
As described in #162 , dask is not feasible from a UX perspective. Here are our results from experimenting with tokenize. See THE ATTACHED PDF for a notebook of the results.
option 1, but we give users a parameter use_spacy that works like our tokenize_with_spacy_own_parallelization above, and explain to them that this might give them better results but takes about 3x as long.
The actual tokenizer is very fast as it uses a simple regex pattern but at the same time very imprecise.
A better alternative might be to make use of
spaCy
.Naively tokenize a Pandas Series with spaCy is very simple:
This should be somehow fast as in theory
spaCy
makes use of multi-threading.The reason we haven't implemented this yet is that we want to make sure this solution is enough faster. We want to provide a simple tool to analyze a quite large amount of text data; say 100k Pandas Row should take no longer than 15-30 seconds to
tokenize
... ?For now, the task consists in:
The text was updated successfully, but these errors were encountered: