Llama-3 offset-mapping needs fixing #1553

davidb-cerebras · 2024-06-14T23:39:07Z

Opening a new issue for the previously opened issue here -- #1517

Here we can see that the desired behavior for return_offsets_mapping from Mistral gives character indices corresponding to tokens:

(Pdb) from transformers import AutoTokenizer
(Pdb) tok_mistral = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
(Pdb) tok_mistral(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[1, 27797, 2787]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 6), (6, 12)]]}
(Pdb) tok_mistral.convert_ids_to_tokens([1, 27797, 2787])
['<s>', '▁Sample', '▁input']
(Pdb) "Sample input"[0:6]
'Sample'
(Pdb) "Sample input"[6:12]
' input'

But for Llama-3 they are not correct

(Pdb) tok_llama3 = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(Pdb) tok_llama3(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

We can also see Llama-2 and GPT-2 working the same as Mistral, so Llama-3 is definitely the one performing behavior that is unexpected

(Pdb) tok_llama2 = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
(Pdb) tok_llama2(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[1, 21029, 1881]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 6), (6, 12)]]}
(Pdb) tok_gpt2 = AutoTokenizer.from_pretrained("openai-community/gpt2") 
(Pdb) tok_gpt2(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[36674, 5128]], 'attention_mask': [[1, 1]], 'offset_mapping': [[(0, 6), (6, 12)]]}

The text was updated successfully, but these errors were encountered:

davidb-cerebras · 2024-06-17T23:20:59Z

@ArthurZucker Is it possible to fix this in tokenizers ?

ArthurZucker · 2024-06-18T07:10:48Z

Yep, you are right, I'll dive a bit to see why we have this!

davidb-cerebras · 2024-06-18T17:43:32Z

Awesome thank you!

maximilianmordig · 2024-06-24T12:13:56Z

@ArthurZucker Is there a workaround in the meantime?

ArthurZucker · 2024-07-12T10:28:25Z

sorry not yet! I am fixing bunch of stuff, maybe #1568 ?

davidb-cerebras · 2024-07-22T22:38:18Z

@maximilianmordig Cerebras has implemented a wrapper that corrects the buggy method, feel free to use the wrapper class here: https://github.com/Cerebras/modelzoo/blob/main/src/cerebras/modelzoo/data_preparation/data_preprocessing/custom_tokenizer_example/CustomLlama3Tokenizer.py

github-actions · 2024-08-22T01:56:39Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

srinjoym-cerebras · 2024-08-26T06:57:46Z

Hey, any update on this?

ArthurZucker · 2024-08-26T08:58:17Z

Hey! Sorry not yet, it's no my stack, and will investigate for the next release as there is a need from all of you! 🤗

tcleberg · 2024-09-20T00:53:19Z

Is there any whose stack this is who can try to resolve this?

ArthurZucker · 2024-09-27T16:25:03Z

I think it's ignore_merges

ArthurZucker · 2024-09-30T11:07:34Z

Pr will fix!

ArthurZucker · 2024-10-01T13:16:14Z

I'll do a patch this week for this!~

github-actions bot added the Stale label Aug 22, 2024

github-actions bot removed the Stale label Aug 27, 2024

ArthurZucker mentioned this issue Aug 27, 2024

PreTrainedTokenizerFast char_to_token token_to_char not working as expected #1620

Closed

4 tasks

drsanta-1337 mentioned this issue Sep 30, 2024

[TEMP FIX] Ollama / llama.cpp: cannot find tokenizer merges in model file unslothai/unsloth#1065

Open

ArthurZucker mentioned this issue Sep 30, 2024

[ignore_merges] Fix offsets #1640

Merged

ArthurZucker closed this as completed in #1640 Oct 1, 2024

kyrawilson mentioned this issue Nov 26, 2024

Llama-3.2 offset-mapping needs fixing #1688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3 offset-mapping needs fixing #1553

Llama-3 offset-mapping needs fixing #1553

davidb-cerebras commented Jun 14, 2024

davidb-cerebras commented Jun 17, 2024

ArthurZucker commented Jun 18, 2024

davidb-cerebras commented Jun 18, 2024

maximilianmordig commented Jun 24, 2024

ArthurZucker commented Jul 12, 2024

davidb-cerebras commented Jul 22, 2024 •

edited

Loading

github-actions bot commented Aug 22, 2024

srinjoym-cerebras commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

tcleberg commented Sep 20, 2024

ArthurZucker commented Sep 27, 2024

ArthurZucker commented Sep 30, 2024

ArthurZucker commented Oct 1, 2024

Llama-3 offset-mapping needs fixing #1553

Llama-3 offset-mapping needs fixing #1553

Comments

davidb-cerebras commented Jun 14, 2024

davidb-cerebras commented Jun 17, 2024

ArthurZucker commented Jun 18, 2024

davidb-cerebras commented Jun 18, 2024

maximilianmordig commented Jun 24, 2024

ArthurZucker commented Jul 12, 2024

davidb-cerebras commented Jul 22, 2024 • edited Loading

github-actions bot commented Aug 22, 2024

srinjoym-cerebras commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

tcleberg commented Sep 20, 2024

ArthurZucker commented Sep 27, 2024

ArthurZucker commented Sep 30, 2024

ArthurZucker commented Oct 1, 2024

davidb-cerebras commented Jul 22, 2024 •

edited

Loading