-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing tokenizer call kwargs (like truncation) in pipeline #25994
Comments
Indeed it's not implemented: The name would be |
Hello! I'm going to take a crack at this one if that's cool. |
Have at it! Would be great to have this implemented. @nmcahill |
Need the same functionality in "text-generation" pipeline. Would like to take a go! |
Any solution for this? |
I am pretty sure the solution is to add the kwargs to the |
The problem is that max_length (tokenizer) gets mistaken with max_new_tokens (generator). So the pipeline complains about a duplicated argument and informs that max_new_tokens will take precedence. Anyway, I have implemented a preprocessing function, so the issue is workarounded. |
System Info
transformers
version: 4.32.1Who can help?
@ArthurZucker for the tokenizers and @Narsil for the pipeline.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to figure out how I can truncate text in a pipeline (without explicitly writing a preprocessing step for my data). I've looked at the documentation and searched the net. A lot of people seem to be asking this question (both on forums and Stack Overflow), but any solution that I could find does not work anymore. Below I try a number of them but none of them work. How can I enable truncation in the pipeline?
Expected behavior
A fix if this is currently not implemented or broken but definitely also a documentation upgrade to clarify how tokenizer kwargs should be passed to a pipeline - both
init
andcall
kwargs!The text was updated successfully, but these errors were encountered: