You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To store the cache file, we compute a hash of the function given in .map, using our own hashing function.
The hash doesn't seem to stay the same over sessions for the tokenizer.
Apparently this is because of the regex at tokenizer.pat is not well supported by our hashing function.
Hi. I believe the fix was for the nlp library. Is there a solution to handle compiled regex expressions in .map() with the caching. I want to run a simple regex pattern on a big dataset, but I am running into the issue of compiled expression not being cached.
Instead of opening a new issue, I thought I would put my query here. Let me know if a new issue would be more suitable. Thanks
The caching functionality doesn't work reliably when tokenizing a dataset. Here's a small example to reproduce it.
Roughly 3/10 times, this example recomputes the tokenization.
Is this expected behaviour?
The text was updated successfully, but these errors were encountered: