Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM #622

Merged

Conversation

ivsanro1
Copy link
Contributor

@ivsanro1 ivsanro1 commented Aug 7, 2023

This PR fixes issue described in #598

Problem

Some tokenizers will tokenize the same string differently, depending on whether it has left context or not. An example of this is the tokenizer of meta-llama/Llama-2-7b-hf, that tokenizes the response_template differently in these cases:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

def print_tokens_with_ids(txt):
    tokens = tokenizer.tokenize(txt, add_special_tokens=False)
    token_ids = tokenizer.encode(txt, add_special_tokens=False)
    print(list(zip(tokens, token_ids)))

prompt = """### User: Hello\n\n### Assistant: Hi, how can I help you?"""
print_tokens_with_ids(prompt)  # [..., ('▁Hello', 15043), ('<0x0A>', 13), ('<0x0A>', 13), ('##', 2277), ('#', 29937), ('▁Ass', 4007), ('istant', 22137), (':', 29901), ...]

response_template = "### Assistant:"
print_tokens_with_ids(response_template)  # [('▁###', 835), ('▁Ass', 4007), ('istant', 22137), (':', 29901)]

This leads to an error when the instanced data_collator DataCollatorForCompletionOnlyLM tries to look for the response_template in the text of the instruction dataset, because it does not find it.

Solution

Allow response_template in DataCollatorForCompletionOnlyLM to use directly token_ids that are prepared with enough context (and then sliced) to match how they appear in instruction datasets, where they have the left context.

Extra additions in the PR

I added a test to check the fix, for which I had to use a non-official tokenizer (upstage/Llama-2-70b-instruct-v2 instead of meta-llama/Llama-2-7b-hf) because the official one needs to be logged in a huggingface account that has accepted the License and has been granted access to the tokenized, as described in the model card.

I also added a subsection in the documentation that explains the problem and how to solve it using the code fix that only works with this PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 7, 2023

The documentation is not available anymore as the PR was closed or merged.

@ivsanro1
Copy link
Contributor Author

ivsanro1 commented Aug 7, 2023

Sorry, I forgot to delete some unneeded code for the test. Now it should be ready for review

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, this looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants