Allow already tokenized sequences for response_template
in DataCollatorForCompletionOnlyLM
#622
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes issue described in #598
Problem
Some tokenizers will tokenize the same string differently, depending on whether it has left context or not. An example of this is the tokenizer of
meta-llama/Llama-2-7b-hf
, that tokenizes theresponse_template
differently in these cases:This leads to an error when the instanced data_collator
DataCollatorForCompletionOnlyLM
tries to look for the response_template in the text of the instruction dataset, because it does not find it.Solution
Allow
response_template
inDataCollatorForCompletionOnlyLM
to use directlytoken_ids
that are prepared with enough context (and then sliced) to match how they appear in instruction datasets, where they have the left context.Extra additions in the PR
I added a test to check the fix, for which I had to use a non-official tokenizer (
upstage/Llama-2-70b-instruct-v2
instead ofmeta-llama/Llama-2-7b-hf
) because the official one needs to be logged in a huggingface account that has accepted the License and has been granted access to the tokenized, as described in the model card.I also added a subsection in the documentation that explains the problem and how to solve it using the code fix that only works with this PR.