Allow already tokenized sequences for `response_template` in `DataCollatorForCompletionOnlyLM` #622

ivsanro1 · 2023-08-07T10:33:45Z

This PR fixes issue described in #598

Problem

Some tokenizers will tokenize the same string differently, depending on whether it has left context or not. An example of this is the tokenizer of meta-llama/Llama-2-7b-hf, that tokenizes the response_template differently in these cases:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

def print_tokens_with_ids(txt):
    tokens = tokenizer.tokenize(txt, add_special_tokens=False)
    token_ids = tokenizer.encode(txt, add_special_tokens=False)
    print(list(zip(tokens, token_ids)))

prompt = """### User: Hello\n\n### Assistant: Hi, how can I help you?"""
print_tokens_with_ids(prompt)  # [..., ('▁Hello', 15043), ('<0x0A>', 13), ('<0x0A>', 13), ('##', 2277), ('#', 29937), ('▁Ass', 4007), ('istant', 22137), (':', 29901), ...]

response_template = "### Assistant:"
print_tokens_with_ids(response_template)  # [('▁###', 835), ('▁Ass', 4007), ('istant', 22137), (':', 29901)]

This leads to an error when the instanced data_collator DataCollatorForCompletionOnlyLM tries to look for the response_template in the text of the instruction dataset, because it does not find it.

Solution

Allow response_template in DataCollatorForCompletionOnlyLM to use directly token_ids that are prepared with enough context (and then sliced) to match how they appear in instruction datasets, where they have the left context.

Extra additions in the PR

I added a test to check the fix, for which I had to use a non-official tokenizer (upstage/Llama-2-70b-instruct-v2 instead of meta-llama/Llama-2-7b-hf) because the official one needs to be logged in a huggingface account that has accepted the License and has been granted access to the tokenized, as described in the model card.

I also added a subsection in the documentation that explains the problem and how to solve it using the code fix that only works with this PR.

…docs

HuggingFaceDocBuilderDev · 2023-08-07T12:27:14Z

The documentation is not available anymore as the PR was closed or merged.

ivsanro1 · 2023-08-07T15:33:31Z

Sorry, I forgot to delete some unneeded code for the test. Now it should be ready for review

lvwerra

Thanks for the fix, this looks good to me!

Ivan Sanchez added 3 commits August 7, 2023 10:22

Allow tokenized ids in DataCollatorForCompletionOnlyLM. Add test and …

999cf24

…docs

Formatting

97095ac

Documentation

7d90f7b

Remove unused code from test

edb5825

lvwerra approved these changes Aug 8, 2023

View reviewed changes

lvwerra merged commit 2cff1e4 into huggingface:main Aug 8, 2023

ivsanro1 mentioned this pull request Aug 8, 2023

Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM #598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow already tokenized sequences for `response_template` in `DataCollatorForCompletionOnlyLM` #622

Allow already tokenized sequences for `response_template` in `DataCollatorForCompletionOnlyLM` #622

ivsanro1 commented Aug 7, 2023

HuggingFaceDocBuilderDev commented Aug 7, 2023 •

edited

Loading

ivsanro1 commented Aug 7, 2023

lvwerra left a comment

Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM #622

Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM #622

Conversation

ivsanro1 commented Aug 7, 2023

Problem

Solution

Extra additions in the PR

HuggingFaceDocBuilderDev commented Aug 7, 2023 • edited Loading

ivsanro1 commented Aug 7, 2023

lvwerra left a comment

Choose a reason for hiding this comment

Allow already tokenized sequences for `response_template` in `DataCollatorForCompletionOnlyLM` #622

Allow already tokenized sequences for `response_template` in `DataCollatorForCompletionOnlyLM` #622

HuggingFaceDocBuilderDev commented Aug 7, 2023 •

edited

Loading