Pipeline: use tokenizer pad token at generation time if the model pad token is unset. #29614

gante · 2024-03-12T15:27:14Z

What does this PR do?

The tagged issue describes the problem, the title describes the fix :D

Example of a script that no longer emits a warning, after this PR:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id

llm = pipeline(task='text-generation', model=model, tokenizer=tokenizer, framework='pt')
response = llm('The capital of France ')

HuggingFaceDocBuilderDev · 2024-03-12T15:53:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for working on this!

Just some qs about the implementation

src/transformers/pipelines/automatic_speech_recognition.py

tests/pipelines/test_pipelines_text_generation.py

gante · 2024-03-14T12:09:41Z

src/transformers/pipelines/conversational.py

@@ -196,9 +196,7 @@ def new_user_input(self):
    build_pipeline_init_args(has_tokenizer=True),
    r"""
        min_length_for_response (`int`, *optional*, defaults to 32):
-            The minimum length (in number of tokens) for a response.
-        minimum_tokens (`int`, *optional*, defaults to 10):


minimum_tokens is an unused internal variable, probably a legacy version of min_length.

Initially, I removed it from the signature of the private _forward, as I was touching it. Then, I realized we could remove all traces since it is unused :)

gante · 2024-03-14T12:46:16Z

src/transformers/pipelines/automatic_speech_recognition.py

@@ -311,14 +311,14 @@ def _sanitize_parameters(

        forward_params = defaultdict(dict)
        if max_new_tokens is not None:
-            forward_params["generate_kwargs"]["max_new_tokens"] = max_new_tokens
+            forward_params["max_new_tokens"] = max_new_tokens


Note regarding this file's diff, also applicable to the diff in src/transformers/pipelines/image_to_text.py:

The conventional strategy to pass kwargs to generate is through **forward_params. Previously in this file, the generation kwargs were held as forward_params["generate_kwargs"], which prevented the use of the conventional strategy. There isn't really a reason to hold these kwargs separately, generate is the only sink for kwargs in models that can generate. Models that can't generate ~~will~~ should throw an exception regardless of the container for kwargs. As such, this diff aims at minimizing the difference for generate parameterization across pipelines :)

I'm not a fan of this - it's far cleaner to clearly outline what are generate kwargs and what are not. In the current pipelines the models might be the only sink, but that's not guaranteed.

Actually - I realise what I've said about the forward kwargs is wrong here - we can just assume they're passed to the model. In this case, my preference is to still have "generate_kwargs" explicitly in the forward_kwargs, but I don't feel strongly and don't mind if you leave as-is

I think we can agree that regardless of the pattern we choose here, it should be applied to all pipelines with generative capabilities for consistency. Based on this premise, enforcing a separation of generate_kwargs this exact way will break backward compatibility, i.e. the following would not be possible

from transformers import pipeline llm = pipeline(task='text-generation', model="openai-community/gpt2") response = llm('The capital of France ', max_length=50)

Nevertheless, I am aligned with you -- we should separate them! We can do it through generation_config.update(**kwargs), and perform the required validation with the aid of generation_config.validate(). One of the requirements to do so is to have a single big blob of keyword arguments to untangle, and thus these changes go in this direction.

Let me know if you agree, in which case I'll merge the PR and prepare this follow-up. [My instinct was to merge this PR now, but I've held it back -- I've merged too many not-100%-approved PRs recently 😉 ]

Yeah, let's merge atm so this is unblocked and then we can iterate on something different :)

gante · 2024-03-14T12:48:21Z

@amyeroberts ready for a re-review :)

amyeroberts

Thanks for iterating on this!

Happy with the changes in general but I think we should maintain the clear separation between "generate_kwargs" and forward_kwargs. It makes it easier to understand what controls what in the pipeline

… token is unset. (#29614)

fix warning

3f3ca17

gante requested a review from amyeroberts March 12, 2024 15:27

amyeroberts reviewed Mar 12, 2024

View reviewed changes

src/transformers/pipelines/automatic_speech_recognition.py Outdated Show resolved Hide resolved

src/transformers/pipelines/automatic_speech_recognition.py Outdated Show resolved Hide resolved

tests/pipelines/test_pipelines_text_generation.py Outdated Show resolved Hide resolved

gante added 3 commits March 14, 2024 12:01

set pad_token_id in the generalist forward; remove test

0acda97

extra condition

658f3d0

remove unused arg; remove unwanted diff

30c3b2e

gante commented Mar 14, 2024

View reviewed changes

gante added 2 commits March 14, 2024 12:14

move to init

399da6d

standardize interface

cef3ae0

gante commented Mar 14, 2024

View reviewed changes

gante requested a review from amyeroberts March 14, 2024 12:48

amyeroberts reviewed Mar 14, 2024

View reviewed changes

amyeroberts approved these changes Mar 14, 2024

View reviewed changes

gante merged commit 53d8912 into huggingface:main Mar 15, 2024
21 checks passed

gante deleted the fix_29378 branch March 15, 2024 13:00

ArthurZucker mentioned this pull request Apr 1, 2024

Misleading warning message about padding_side when passing tokenizer instance to pipeline #29379

Closed

4 tasks

gante mentioned this pull request Apr 15, 2024

4.39.3; ZeroShotClassificationPipeline broken. #30181

Closed

4 tasks

itazap pushed a commit that referenced this pull request May 14, 2024

Pipeline: use tokenizer pad token at generation time if the model pad…

b016d45

… token is unset. (#29614)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline: use tokenizer pad token at generation time if the model pad token is unset. #29614

Pipeline: use tokenizer pad token at generation time if the model pad token is unset. #29614

gante commented Mar 12, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 12, 2024

amyeroberts left a comment

gante Mar 14, 2024 •

edited

Loading

gante Mar 14, 2024 •

edited

Loading

amyeroberts Mar 14, 2024

amyeroberts Mar 14, 2024

gante Mar 15, 2024

amyeroberts Mar 15, 2024

gante commented Mar 14, 2024

amyeroberts left a comment

Pipeline: use tokenizer pad token at generation time if the model pad token is unset. #29614

Pipeline: use tokenizer pad token at generation time if the model pad token is unset. #29614

Conversation

gante commented Mar 12, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

gante Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

gante Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

amyeroberts Mar 14, 2024

Choose a reason for hiding this comment

amyeroberts Mar 14, 2024

Choose a reason for hiding this comment

gante Mar 15, 2024

Choose a reason for hiding this comment

amyeroberts Mar 15, 2024

Choose a reason for hiding this comment

gante commented Mar 14, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

gante commented Mar 12, 2024 •

edited

Loading

gante Mar 14, 2024 •

edited

Loading

gante Mar 14, 2024 •

edited

Loading