Caching doesn't work for map (non-deterministic) #501

wulu473 · 2020-08-12T20:20:07Z

The caching functionality doesn't work reliably when tokenizing a dataset. Here's a small example to reproduce it.

import nlp
import transformers

def main():
    ds = nlp.load_dataset("reddit", split="train[:500]")

    tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")

    def convert_to_features(example_batch):
        input_str = example_batch["body"]
        encodings = tokenizer(input_str, add_special_tokens=True, truncation=True)
        return encodings

    ds = ds.map(convert_to_features, batched=True)

if __name__ == "__main__":
    main()

Roughly 3/10 times, this example recomputes the tokenization.

Is this expected behaviour?

lhoestq · 2020-08-13T08:44:39Z

Thanks for reporting !

To store the cache file, we compute a hash of the function given in .map, using our own hashing function.
The hash doesn't seem to stay the same over sessions for the tokenizer.
Apparently this is because of the regex at tokenizer.pat is not well supported by our hashing function.

I'm working on a fix

wulu473 · 2020-08-24T16:35:00Z

Thanks everyone. Works great now.

MaveriQ · 2022-08-04T14:37:18Z

Hi. I believe the fix was for the nlp library. Is there a solution to handle compiled regex expressions in .map() with the caching. I want to run a simple regex pattern on a big dataset, but I am running into the issue of compiled expression not being cached.

Instead of opening a new issue, I thought I would put my query here. Let me know if a new issue would be more suitable. Thanks

mariosasko · 2022-08-08T11:02:23Z

Hi @MaveriQ! This fix is also included in the datasets library. Can you provide a reproducer?

lhoestq self-assigned this Aug 13, 2020

lhoestq mentioned this issue Aug 13, 2020

Fix tokenizers caching #502

Merged

wulu473 closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching doesn't work for map (non-deterministic) #501

Caching doesn't work for map (non-deterministic) #501

wulu473 commented Aug 12, 2020

lhoestq commented Aug 13, 2020

wulu473 commented Aug 24, 2020

MaveriQ commented Aug 4, 2022

mariosasko commented Aug 8, 2022

Caching doesn't work for map (non-deterministic) #501

Caching doesn't work for map (non-deterministic) #501

Comments

wulu473 commented Aug 12, 2020

lhoestq commented Aug 13, 2020

wulu473 commented Aug 24, 2020

MaveriQ commented Aug 4, 2022

mariosasko commented Aug 8, 2022