Correct minor errors in example notebooks for causal language modelling #926

SumanthRH · 2023-09-13T01:58:21Z

What does this PR do?

Corrects minor errors in dataset preprocessing in the example notebooks at examples/causal_language_modeling . I believe there are two mistakes that can cause issues when people are using the same code for a different model or task:

Input tweet text and labels are concatenated after tokenization. However, with 🤗 tokenizers, a BOS token may or may not be added depending on the model if no explicit value is passed for the add_special_tokens argument. For example, with GPT2, a BOS token is not added, while for Llama-2 a BOS token will be added. Because of this, if you simply tokenized the label with tokenizer(labels) and then concatenate with the input sequence, you can have a stray BOS token. You can quickly check this out yourself with the following code block:

from transformers import AutoTokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
llama_tokenizer = AutoTokenizer.from_pretrained("stabilityai/StableBeluga2") # stable beluga has the same tokenizer as Llama 2
input_text = "@HMRCcustomers No this is my first job"
label = "Neutral"
input_formatted = f"Tweet text : {input_text} Label : "
concat_seq_llama = llama_tokenizer(input_formatted)["input_ids"] + llama_tokenizer(label)["input_ids"]
concat_seq_gpt2 = gpt2_tokenizer(input_formatted)["input_ids"] + gpt2_tokenizer(label)["input_ids"]
concat_seq_llama_decoded = llama_tokenizer.decode(concat_seq_llama)
concat_seq_gpt2_decoded = gpt2_tokenizer.decode(concat_seq_gpt2)
print(concat_seq_llama_decoded)
print(concat_seq_gpt2_decoded)

The outputs for Llama 2 and GPT are <s> Tweet text : @HMRCcustomers No this is my first job Label : <s> Neutral and Tweet text : @HMRCcustomers No this is my first job Label : Neutral respectively. In some cases, the BOS and EOS tokens are the same, so this can lead to lower performance.
2. The final token after concatenating the label should be an EOS token, which may be different from the padding token. Correct me if I'm wrong!

review-notebook-app · 2023-09-13T01:58:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

younesbelkada

LGTM thanks, however I would like @pacman100 to have a look here if possible as he wrote those notebooks, just to be sure!

HuggingFaceDocBuilderDev · 2023-09-20T14:21:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

pacman100

Thank you @SumanthRH for fixing the example!

SumanthRH added 4 commits July 16, 2023 11:50

updated Readme

72c6ecf

Merge branch 'main' of https://github.com/huggingface/peft into main

2b2b286

Corrected label bos token error; switched to eos token from pad token

09dae19

reverted readme change

e641fac

younesbelkada approved these changes Sep 13, 2023

View reviewed changes

younesbelkada requested a review from pacman100 September 13, 2023 08:38

Merge remote-tracking branch 'upstream/main' into fix-notebooks

e79a57e

pacman100 approved these changes Oct 3, 2023

View reviewed changes

pacman100 merged commit 3d0edcc into huggingface:main Oct 3, 2023
11 checks passed

SumanthRH deleted the fix-notebooks branch October 13, 2023 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct minor errors in example notebooks for causal language modelling #926

Correct minor errors in example notebooks for causal language modelling #926

SumanthRH commented Sep 13, 2023

review-notebook-app bot commented Sep 13, 2023

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Sep 20, 2023

pacman100 left a comment

Correct minor errors in example notebooks for causal language modelling #926

Correct minor errors in example notebooks for causal language modelling #926

Conversation

SumanthRH commented Sep 13, 2023

What does this PR do?

review-notebook-app bot commented Sep 13, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 20, 2023

pacman100 left a comment

Choose a reason for hiding this comment