Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching doesn't work for map (non-deterministic) #501

Closed
wulu473 opened this issue Aug 12, 2020 · 4 comments
Closed

Caching doesn't work for map (non-deterministic) #501

wulu473 opened this issue Aug 12, 2020 · 4 comments
Assignees

Comments

@wulu473
Copy link

wulu473 commented Aug 12, 2020

The caching functionality doesn't work reliably when tokenizing a dataset. Here's a small example to reproduce it.

import nlp
import transformers

def main():
    ds = nlp.load_dataset("reddit", split="train[:500]")

    tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")

    def convert_to_features(example_batch):
        input_str = example_batch["body"]
        encodings = tokenizer(input_str, add_special_tokens=True, truncation=True)
        return encodings

    ds = ds.map(convert_to_features, batched=True)

if __name__ == "__main__":
    main()

Roughly 3/10 times, this example recomputes the tokenization.

Is this expected behaviour?

@lhoestq lhoestq self-assigned this Aug 13, 2020
@lhoestq
Copy link
Member

lhoestq commented Aug 13, 2020

Thanks for reporting !

To store the cache file, we compute a hash of the function given in .map, using our own hashing function.
The hash doesn't seem to stay the same over sessions for the tokenizer.
Apparently this is because of the regex at tokenizer.pat is not well supported by our hashing function.

I'm working on a fix

@wulu473
Copy link
Author

wulu473 commented Aug 24, 2020

Thanks everyone. Works great now.

@MaveriQ
Copy link

MaveriQ commented Aug 4, 2022

Hi. I believe the fix was for the nlp library. Is there a solution to handle compiled regex expressions in .map() with the caching. I want to run a simple regex pattern on a big dataset, but I am running into the issue of compiled expression not being cached.

Instead of opening a new issue, I thought I would put my query here. Let me know if a new issue would be more suitable. Thanks

@mariosasko
Copy link
Collaborator

Hi @MaveriQ! This fix is also included in the datasets library. Can you provide a reproducer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants