-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tokenizer kwargs to fill mask pipeline. #26234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good contribution! Thanks 🚀 Just make sure to run make fix-copies
and make style
Thanks. Okay I will look into that. First time contributing to HF. |
Try to install the styling package with |
Replace single tick with double
…rs into add-tokenizer-kwargs
Anything else you can think of? I can't seem to get the setup and quality checks to pass... |
Hey @nmcahill, the recommended tool to run here is I've taken the liberty to add code wrappers around your example and run |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Thanks @LysandreJik! |
The test transformers/tests/pipelines/test_pipelines_fill_mask.py Lines 219 to 229 in 3e68944
The results we get is shown at the end. @nmcahill Could you check if the Thank you in advance.
|
FYI: both |
What's strange is that I expect to see: I know these tests passed when my pr was approved. I will need to get some time at night to dig into this... |
It might simply be that the terminal output is not showing the word Other than that oddity, though, the fact that it is returning scores at all instead of failing with long input text is the true point of this unit test... I am not sure why the tokens and scores have changed since I tested this locally, but I'm tempted to change the unit test to check for results at all rather than checking for a particular set of tokens/scores. Would that work for everyone? |
Hi @nmcahill Thank you for the response 🤗 This might be the hardware difference, we use GPU T4. As long as My main concern here is that it looks we pass Take your time on this, but if you could not allocate some time in the following weeks, let me know 🙏 |
So the behavior if truncation is set to False and the input string is very
long would be that the model.forward will throw an error. The truncation
happens between the tokenizer and and the model so the bit that is actually
truncated is the vector called “input_ids” in model inputs, the input
string never gets truncated so no need to check that visually.
To prove it to yourself that the truncation=True works, try setting it to
false and seeing if the model.forward fails.
If it doesn’t fail with Truncation=False then I’ll definitely try fixing
it. But as far as I can tell, I think this is probably working as expected.
…On Wed, Dec 6, 2023 at 1:53 AM Yih-Dar ***@***.***> wrote:
I am not sure why the tokens and scores have changed
Hi @nmcahill <https://github.com/nmcahill> Thank you for the response 🤗
This might be the hardware difference, we use GPU T4. As long as sequence
is the correct one (i.e. being truncated here), we can adjust the values in
other fields.
My main concern here is that it looks we pass tokenizer_kwargs={"truncation":
True} but it doesn't seem have the effect in this test.
Take your time on this, but if you could not allocate some time in the
following weeks, let me know 🙏
—
Reply to this email directly, view it on GitHub
<#26234 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGI5ZZEMOKMT5EWHDKEVILTYIAXBLAVCNFSM6AAAAAA45GOORWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBSGQ2TCNRSGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK, got it. I was confused as I saw you added the expected output as |
This pr addresses #25994 by adding tokenizer_kwargs as an input preprocessing parameter to the fill mask pipeline.
Attn: @BramVanroy @Narsil.