Correct minor errors in example notebooks for causal language modelling #926
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Corrects minor errors in dataset preprocessing in the example notebooks at examples/causal_language_modeling . I believe there are two mistakes that can cause issues when people are using the same code for a different model or task:
add_special_tokens
argument. For example, with GPT2, a BOS token is not added, while for Llama-2 a BOS token will be added. Because of this, if you simply tokenized the label withtokenizer(labels)
and then concatenate with the input sequence, you can have a stray BOS token. You can quickly check this out yourself with the following code block:The outputs for Llama 2 and GPT are
<s> Tweet text : @HMRCcustomers No this is my first job Label : <s> Neutral
andTweet text : @HMRCcustomers No this is my first job Label : Neutral
respectively. In some cases, the BOS and EOS tokens are the same, so this can lead to lower performance.2. The final token after concatenating the label should be an EOS token, which may be different from the padding token. Correct me if I'm wrong!