Misleading warning message about pad_token_id when passing tokenizer instance to pipeline #29378

tilmanbeck · 2024-02-29T20:03:12Z

System Info

Python=3.11.5
Transformers= '4.37.2'

In my setup I am initializing a tokenizer and want to pass it to the pipeline. My expectation is that if I set the pad_token_id directly on tokenizer instance, the pipeline should not print the warning Setting pad_token_id to eos_token_id:50256 for open-end generation. If I pass the pad_token_id directly to the pipeline's __call__, the warning is not printed. However, I would rather like to set the pad_token_id directly rather than having to think about everytime I use the __call__ method of the pipeline.

Alternatively, I would like to add it as a parameter to the pipeline instantiation, but I think passing tokenizer parameters in the pipeline generator is currently not envisioned, as discussed in #12039 #24707 #22995

@Narsil

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Here is a minimal code example to demonstrate the difference:

from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM, Conversation

msg = 'The capital of France '
modelname = 'microsoft/DialoGPT-small'

# hand over model & tokenizer instantiations to pipeline
model = AutoModelForCausalLM.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
tokenizer.pad_token_id = tokenizer.eos_token_id

chatbot = pipeline(task='conversational', model=model, tokenizer=tokenizer, framework='pt')
messages = ([{"role": "system", "content": 'You are a helpful assistant'}, {"role": "user", "content": msg}])
response = chatbot(Conversation(messages=messages))

This prints the warning message
Setting 'pad_token_id' to 'eos_token_id':50256 for open-end generation.

The following code does not:

from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM, Conversation

msg = 'The capital of France '
modelname = 'microsoft/DialoGPT-small'

chatbot = pipeline(task='conversational', model=modelname, framework='pt')
messages = ([{"role": "system", "content": 'You are a helpful assistant'}, {"role": "user", "content": msg}])
response = chatbot(Conversation(messages=messages), pad_token_id=chatbot.tokenizer.eos_token_id)

Expected behavior

I would expect the first code snippet not to print the warning as the pad_token_id is directly set on the tokenizer instance

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-07T10:28:36Z

cc @Rocketknight1

Rocketknight1 · 2024-03-07T13:56:07Z

This issue and #29379 feel a bit like generation issues to me - @gante do you see any obvious solutions? If not, don't worry, I'll investigate.

gante · 2024-03-12T14:37:45Z

Hi @tilmanbeck 👋

The root issue is that the tokenizer and the model are two separate objects, so pad_token_id needs to be set in both. In pipeline, IMO model should inherit pad_token_id from tokenizer when it is not set there. Opening a PR to fix that :)

tilmanbeck mentioned this issue Feb 29, 2024

Misleading warning message about padding_side when passing tokenizer instance to pipeline #29379

Closed

4 tasks

gante mentioned this issue Mar 12, 2024

Pipeline: use tokenizer pad token at generation time if the model pad token is unset. #29614

Merged

gante closed this as completed in #29614 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misleading warning message about pad_token_id when passing tokenizer instance to pipeline #29378

Misleading warning message about pad_token_id when passing tokenizer instance to pipeline #29378

tilmanbeck commented Feb 29, 2024 •

edited

Loading

ArthurZucker commented Mar 7, 2024

Rocketknight1 commented Mar 7, 2024

gante commented Mar 12, 2024

Misleading warning message about pad_token_id when passing tokenizer instance to pipeline #29378

Misleading warning message about pad_token_id when passing tokenizer instance to pipeline #29378

Comments

tilmanbeck commented Feb 29, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Mar 7, 2024

Rocketknight1 commented Mar 7, 2024

gante commented Mar 12, 2024

tilmanbeck commented Feb 29, 2024 •

edited

Loading