Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenizer kwargs to fill mask pipeline. #26234

Merged
merged 10 commits into from
Oct 3, 2023

Conversation

nmcahill
Copy link
Contributor

This pr addresses #25994 by adding tokenizer_kwargs as an input preprocessing parameter to the fill mask pipeline.

Attn: @BramVanroy @Narsil.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good contribution! Thanks 🚀 Just make sure to run make fix-copies and make style

@nmcahill
Copy link
Contributor Author

Thanks. Okay I will look into that. First time contributing to HF.

@ArthurZucker
Copy link
Collaborator

Try to install the styling package with pip install -U "transformers[quality]" 😉

@nmcahill
Copy link
Contributor Author

nmcahill commented Oct 2, 2023

Anything else you can think of? I can't seem to get the setup and quality checks to pass...

@LysandreJik
Copy link
Member

Hey @nmcahill, the recommended tool to run here is make fixup which takes care of everything under the hood and does so quite fast.

I've taken the liberty to add code wrappers around your example and run make fixup on your PR directly so that we may merge this PR and include it in today's release. Thank you for your contribution!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@LysandreJik LysandreJik merged commit b5ca8fc into huggingface:main Oct 3, 2023
@nmcahill
Copy link
Contributor Author

nmcahill commented Oct 3, 2023

Thanks @LysandreJik!

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 5, 2023

The test tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_large_model_pt started to fail since this PR being merged. It is the new test block added in this PR

outputs = unmasker(
"My name is <mask>" + "Lorem ipsum dolor sit amet, consectetur adipiscing elit," * 100,
tokenizer_kwargs={"truncation": True},
)
self.assertEqual(
nested_simplify(outputs, decimals=6),
[
{"sequence": "My name is grouped", "score": 2.2e-05, "token": 38015, "token_str": " grouped"},
{"sequence": "My name is accuser", "score": 2.1e-05, "token": 25506, "token_str": " accuser"},
],
)

The results we get is shown at the end.

@nmcahill Could you check if the tokenizer_kwargs={"truncation": True}, does its job here?

Thank you in advance.

(Pdb) nested_simplify(outputs, decimals=6),
([{'score': 0.281868, 'token': 6, 'token_str': ',', 'sequence': 'My name is,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipisc
ing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lo
rem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum d
olor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit am
et, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consect
etur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipis
cing elit,Lorem'}, {'score': 0.095431, 'token': 46686, 'token_str': ':,', 'sequence': 'My name is:,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit am
et, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consect
etur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipis
cing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,L
orem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum 
dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem'}],)

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 5, 2023

FYI: both test_large_model_pt and test_large_model_tf are failing

@nmcahill
Copy link
Contributor Author

nmcahill commented Dec 5, 2023

What's strange is that I expect to see: "My name is <mask>Lorem ipsum dolor sit amet,..." which would be the expected output of "My name is <mask>" + "Lorem ipsum dolor sit amet, consectetur adipiscing elit," * 100, but instead i see "My name is,Lorem ipsum dolor sit amet,".

I know these tests passed when my pr was approved. I will need to get some time at night to dig into this...

@nmcahill
Copy link
Contributor Author

nmcahill commented Dec 5, 2023

It might simply be that the terminal output is not showing the word <mask> though.

Other than that oddity, though, the fact that it is returning scores at all instead of failing with long input text is the true point of this unit test... I am not sure why the tokens and scores have changed since I tested this locally, but I'm tempted to change the unit test to check for results at all rather than checking for a particular set of tokens/scores. Would that work for everyone?

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 6, 2023

I am not sure why the tokens and scores have changed

Hi @nmcahill Thank you for the response 🤗

This might be the hardware difference, we use GPU T4. As long as sequence is the correct one (i.e. being truncated here), we can adjust the values in other fields.

My main concern here is that it looks we pass tokenizer_kwargs={"truncation": True} but it doesn't seem have the effect in this test.

Take your time on this, but if you could not allocate some time in the following weeks, let me know 🙏

@nmcahill
Copy link
Contributor Author

nmcahill commented Dec 6, 2023 via email

@ydshieh
Copy link
Collaborator

ydshieh commented Dec 6, 2023

OK, got it. I was confused as I saw you added the expected output as "sequence": "My name is grouped" which led me to think the sequence is truncated. But this is not the case as you mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants