Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing tokenizer call kwargs (like truncation) in pipeline #25994

Closed
4 tasks
BramVanroy opened this issue Sep 5, 2023 · 7 comments
Closed
4 tasks

Passing tokenizer call kwargs (like truncation) in pipeline #25994

BramVanroy opened this issue Sep 5, 2023 · 7 comments
Labels
Feature request Request for a new feature Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@BramVanroy
Copy link
Collaborator

System Info

  • transformers version: 4.32.1
  • Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.4
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.3
  • Accelerate version: 0.22.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)

Who can help?

@ArthurZucker for the tokenizers and @Narsil for the pipeline.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am trying to figure out how I can truncate text in a pipeline (without explicitly writing a preprocessing step for my data). I've looked at the documentation and searched the net. A lot of people seem to be asking this question (both on forums and Stack Overflow), but any solution that I could find does not work anymore. Below I try a number of them but none of them work. How can I enable truncation in the pipeline?

from transformers import pipeline

model_name = "bert-base-cased"
text = "Luke, I am not your [MASK]. " * 512  # Make sure text is longer than max model length

# As-is (error, size mismatch -- no truncation seems to happen)
pipe = pipeline("fill-mask", model=model_name)
result = pipe([text])

# truncation in init (error, unrecognized keyword)
pipe = pipeline("fill-mask", model=model_name, truncation=True)
result = pipe([text])

# truncation in call (error, unrecognized keyword)
pipe = pipeline("fill-mask", model=model_name)
result = pipe([text], truncation=True)

# truncation as tokenizer kwargs in tuple (error, size mismatch)
tokenizer_tuple = (model_name, {"truncation": True})
pipe = pipeline("fill-mask", model=model_name, tokenizer=tokenizer_tuple)
result = pipe([text])

# Truncation as tokenize_kwargs (https://github.com/huggingface/transformers/issues/21971#issuecomment-1456725779)
# Unexpected keyword error
pipe = pipeline("fill-mask", model=model_name, tokenize_kwargs={"truncation": True})
result = pipe([text])

Expected behavior

A fix if this is currently not implemented or broken but definitely also a documentation upgrade to clarify how tokenizer kwargs should be passed to a pipeline - both init and call kwargs!

@Narsil
Copy link
Contributor

Narsil commented Sep 7, 2023

Indeed it's not implemented:

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/fill_mask.py#L96

The name would be tokenizer_kwargs though to be consistent with the rest.

@nmcahill
Copy link
Contributor

Hello! I'm going to take a crack at this one if that's cool.

@BramVanroy
Copy link
Collaborator Author

Have at it! Would be great to have this implemented. @nmcahill

@huggingface huggingface deleted a comment from github-actions bot Oct 13, 2023
@huggingface huggingface deleted a comment from github-actions bot Nov 8, 2023
@ArthurZucker ArthurZucker added Feature request Request for a new feature Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels Nov 8, 2023
@thedamnedrhino
Copy link
Contributor

Need the same functionality in "text-generation" pipeline. Would like to take a go!

@mirix
Copy link

mirix commented Feb 6, 2024

Any solution for this?

@ArthurZucker
Copy link
Collaborator

I am pretty sure the solution is to add the kwargs to the pipeline in that case text-generation . #28362 fixed is so closing

@mirix
Copy link

mirix commented Feb 8, 2024

The problem is that max_length (tokenizer) gets mistaken with max_new_tokens (generator).

So the pipeline complains about a duplicated argument and informs that max_new_tokens will take precedence.

Anyway, I have implemented a preprocessing function, so the issue is workarounded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants