Passing tokenizer call kwargs (like truncation) in pipeline #25994

BramVanroy · 2023-09-05T16:56:49Z

System Info

transformers version: 4.32.1
Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34
Python version: 3.11.4
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)

Who can help?

@ArthurZucker for the tokenizers and @Narsil for the pipeline.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am trying to figure out how I can truncate text in a pipeline (without explicitly writing a preprocessing step for my data). I've looked at the documentation and searched the net. A lot of people seem to be asking this question (both on forums and Stack Overflow), but any solution that I could find does not work anymore. Below I try a number of them but none of them work. How can I enable truncation in the pipeline?

from transformers import pipeline

model_name = "bert-base-cased"
text = "Luke, I am not your [MASK]. " * 512  # Make sure text is longer than max model length

# As-is (error, size mismatch -- no truncation seems to happen)
pipe = pipeline("fill-mask", model=model_name)
result = pipe([text])

# truncation in init (error, unrecognized keyword)
pipe = pipeline("fill-mask", model=model_name, truncation=True)
result = pipe([text])

# truncation in call (error, unrecognized keyword)
pipe = pipeline("fill-mask", model=model_name)
result = pipe([text], truncation=True)

# truncation as tokenizer kwargs in tuple (error, size mismatch)
tokenizer_tuple = (model_name, {"truncation": True})
pipe = pipeline("fill-mask", model=model_name, tokenizer=tokenizer_tuple)
result = pipe([text])

# Truncation as tokenize_kwargs (https://github.com/huggingface/transformers/issues/21971#issuecomment-1456725779)
# Unexpected keyword error
pipe = pipeline("fill-mask", model=model_name, tokenize_kwargs={"truncation": True})
result = pipe([text])

Expected behavior

A fix if this is currently not implemented or broken but definitely also a documentation upgrade to clarify how tokenizer kwargs should be passed to a pipeline - both init and call kwargs!

The text was updated successfully, but these errors were encountered:

Narsil · 2023-09-07T09:07:23Z

Indeed it's not implemented:

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/fill_mask.py#L96

The name would be tokenizer_kwargs though to be consistent with the rest.

nmcahill · 2023-09-18T18:52:01Z

Hello! I'm going to take a crack at this one if that's cool.

BramVanroy · 2023-09-18T18:54:53Z

Have at it! Would be great to have this implemented. @nmcahill

thedamnedrhino · 2023-11-17T00:38:30Z

Need the same functionality in "text-generation" pipeline. Would like to take a go!

mirix · 2024-02-06T10:32:05Z

Any solution for this?

ArthurZucker · 2024-02-07T12:30:49Z

I am pretty sure the solution is to add the kwargs to the pipeline in that case text-generation . #28362 fixed is so closing

mirix · 2024-02-08T12:10:39Z

The problem is that max_length (tokenizer) gets mistaken with max_new_tokens (generator).

So the pipeline complains about a duplicated argument and informs that max_new_tokens will take precedence.

Anyway, I have implemented a preprocessing function, so the issue is workarounded.

nmcahill mentioned this issue Sep 18, 2023

Add tokenizer kwargs to fill mask pipeline. #26234

Merged

huggingface deleted a comment from github-actions bot Oct 13, 2023

huggingface deleted a comment from github-actions bot Nov 8, 2023

ArthurZucker added Feature request Request for a new feature Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels Nov 8, 2023

thedamnedrhino mentioned this issue Nov 23, 2023

tokenizer_kwargs in text-generation pipeline __call__() #27683

Closed

thedamnedrhino mentioned this issue Dec 6, 2023

Adding truncation to text-generation pipeline #27869

Open

ArthurZucker closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing tokenizer call kwargs (like truncation) in pipeline #25994

Passing tokenizer call kwargs (like truncation) in pipeline #25994

BramVanroy commented Sep 5, 2023

Narsil commented Sep 7, 2023

nmcahill commented Sep 18, 2023

BramVanroy commented Sep 18, 2023

thedamnedrhino commented Nov 17, 2023

mirix commented Feb 6, 2024

ArthurZucker commented Feb 7, 2024

mirix commented Feb 8, 2024

Passing tokenizer call kwargs (like truncation) in pipeline #25994

Passing tokenizer call kwargs (like truncation) in pipeline #25994

Comments

BramVanroy commented Sep 5, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Narsil commented Sep 7, 2023

nmcahill commented Sep 18, 2023

BramVanroy commented Sep 18, 2023

thedamnedrhino commented Nov 17, 2023

mirix commented Feb 6, 2024

ArthurZucker commented Feb 7, 2024

mirix commented Feb 8, 2024