Add tokenizer kwargs to fill mask pipeline. #26234

nmcahill · 2023-09-18T19:28:03Z

This pr addresses #25994 by adding tokenizer_kwargs as an input preprocessing parameter to the fill mask pipeline.

Attn: @BramVanroy @Narsil.

Narsil

LGTM. @ArthurZucker

ArthurZucker

Good contribution! Thanks 🚀 Just make sure to run make fix-copies and make style

nmcahill · 2023-09-20T17:24:58Z

Thanks. Okay I will look into that. First time contributing to HF.

ArthurZucker · 2023-09-21T12:13:35Z

Try to install the styling package with pip install -U "transformers[quality]" 😉

Replace single tick with double

…rs into add-tokenizer-kwargs

nmcahill · 2023-10-02T20:33:38Z

Anything else you can think of? I can't seem to get the setup and quality checks to pass...

LysandreJik · 2023-10-03T07:16:25Z

Hey @nmcahill, the recommended tool to run here is make fixup which takes care of everything under the hood and does so quite fast.

I've taken the liberty to add code wrappers around your example and run make fixup on your PR directly so that we may merge this PR and include it in today's release. Thank you for your contribution!

HuggingFaceDocBuilderDev · 2023-10-03T07:33:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

nmcahill · 2023-10-03T18:37:52Z

Thanks @LysandreJik!

ydshieh · 2023-12-05T13:52:11Z

The test tests/pipelines/test_pipelines_fill_mask.py::FillMaskPipelineTests::test_large_model_pt started to fail since this PR being merged. It is the new test block added in this PR

transformers/tests/pipelines/test_pipelines_fill_mask.py

Lines 219 to 229 in 3e68944

    
           outputs = unmasker( 
        
               "My name is <mask>" + "Lorem ipsum dolor sit amet, consectetur adipiscing elit," * 100, 
        
               tokenizer_kwargs={"truncation": True}, 
        
           ) 
        
           self.assertEqual( 
        
               nested_simplify(outputs, decimals=6), 
        
               [ 
        
                   {"sequence": "My name is grouped", "score": 2.2e-05, "token": 38015, "token_str": " grouped"}, 
        
                   {"sequence": "My name is accuser", "score": 2.1e-05, "token": 25506, "token_str": " accuser"}, 
        
               ], 
        
           )

The results we get is shown at the end.

@nmcahill Could you check if the tokenizer_kwargs={"truncation": True}, does its job here?

Thank you in advance.

(Pdb) nested_simplify(outputs, decimals=6),
([{'score': 0.281868, 'token': 6, 'token_str': ',', 'sequence': 'My name is,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipisc
ing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lo
rem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum d
olor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit am
et, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consect
etur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipis
cing elit,Lorem'}, {'score': 0.095431, 'token': 46686, 'token_str': ':,', 'sequence': 'My name is:,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit am
et, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consect
etur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipis
cing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,L
orem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum 
dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem ipsum dolor sit amet, consectetur adipiscing elit,Lorem'}],)

ydshieh · 2023-12-05T13:56:39Z

FYI: both test_large_model_pt and test_large_model_tf are failing

nmcahill · 2023-12-05T18:58:54Z

What's strange is that I expect to see: "My name is <mask>Lorem ipsum dolor sit amet,..." which would be the expected output of "My name is <mask>" + "Lorem ipsum dolor sit amet, consectetur adipiscing elit," * 100, but instead i see "My name is,Lorem ipsum dolor sit amet,".

I know these tests passed when my pr was approved. I will need to get some time at night to dig into this...

nmcahill · 2023-12-05T19:19:07Z

It might simply be that the terminal output is not showing the word <mask> though.

Other than that oddity, though, the fact that it is returning scores at all instead of failing with long input text is the true point of this unit test... I am not sure why the tokens and scores have changed since I tested this locally, but I'm tempted to change the unit test to check for results at all rather than checking for a particular set of tokens/scores. Would that work for everyone?

ydshieh · 2023-12-06T08:53:28Z

I am not sure why the tokens and scores have changed

Hi @nmcahill Thank you for the response 🤗

This might be the hardware difference, we use GPU T4. As long as sequence is the correct one (i.e. being truncated here), we can adjust the values in other fields.

My main concern here is that it looks we pass tokenizer_kwargs={"truncation": True} but it doesn't seem have the effect in this test.

Take your time on this, but if you could not allocate some time in the following weeks, let me know 🙏

nmcahill · 2023-12-06T15:19:27Z

So the behavior if truncation is set to False and the input string is very long would be that the model.forward will throw an error. The truncation happens between the tokenizer and and the model so the bit that is actually truncated is the vector called “input_ids” in model inputs, the input string never gets truncated so no need to check that visually. To prove it to yourself that the truncation=True works, try setting it to false and seeing if the model.forward fails. If it doesn’t fail with Truncation=False then I’ll definitely try fixing it. But as far as I can tell, I think this is probably working as expected.

…

On Wed, Dec 6, 2023 at 1:53 AM Yih-Dar ***@***.***> wrote: I am not sure why the tokens and scores have changed Hi @nmcahill <https://github.com/nmcahill> Thank you for the response 🤗 This might be the hardware difference, we use GPU T4. As long as sequence is the correct one (i.e. being truncated here), we can adjust the values in other fields. My main concern here is that it looks we pass tokenizer_kwargs={"truncation": True} but it doesn't seem have the effect in this test. Take your time on this, but if you could not allocate some time in the following weeks, let me know 🙏 — Reply to this email directly, view it on GitHub <#26234 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGI5ZZEMOKMT5EWHDKEVILTYIAXBLAVCNFSM6AAAAAA45GOORWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBSGQ2TCNRSGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ydshieh · 2023-12-06T15:36:14Z

OK, got it. I was confused as I saw you added the expected output as "sequence": "My name is grouped" which led me to think the sequence is truncated. But this is not the case as you mentioned above.

nmcahill added 5 commits September 18, 2023 11:13

add tokenizer kwarg inputs

19bed91

Adding tokenizer_kwargs to _sanitize_parameters

d1669f1

Add truncation=True example to tests

28f34ad

Update test_pipelines_fill_mask.py

fab22d2

Update test_pipelines_fill_mask.py

bc08296

nmcahill mentioned this pull request Sep 18, 2023

Passing tokenizer call kwargs (like truncation) in pipeline #25994

Closed

4 tasks

Narsil approved these changes Sep 19, 2023

View reviewed changes

ArthurZucker approved these changes Sep 20, 2023

View reviewed changes

make fix-copies and make style

103ae6f

nmcahill added 3 commits September 21, 2023 08:49

Update fill_mask.py

7c97a69

Replace single tick with double

make fix-copies

405f8cc

Merge branch 'add-tokenizer-kwargs' of github.com:nmcahill/transforme…

8c15baa

…rs into add-tokenizer-kwargs

Style

7729175

LysandreJik merged commit b5ca8fc into huggingface:main Oct 3, 2023

thedamnedrhino mentioned this pull request Dec 6, 2023

Adding truncation to text-generation pipeline #27869

Open

ydshieh mentioned this pull request Dec 7, 2023

Fix 2 tests in FillMaskPipelineTests #27889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer kwargs to fill mask pipeline. #26234

Add tokenizer kwargs to fill mask pipeline. #26234

nmcahill commented Sep 18, 2023

Narsil left a comment

ArthurZucker left a comment •

edited

Loading

nmcahill commented Sep 20, 2023

ArthurZucker commented Sep 21, 2023

nmcahill commented Oct 2, 2023

LysandreJik commented Oct 3, 2023

HuggingFaceDocBuilderDev commented Oct 3, 2023

nmcahill commented Oct 3, 2023

ydshieh commented Dec 5, 2023 •

edited

Loading

ydshieh commented Dec 5, 2023

nmcahill commented Dec 5, 2023 •

edited

Loading

nmcahill commented Dec 5, 2023

ydshieh commented Dec 6, 2023

nmcahill commented Dec 6, 2023 via email

ydshieh commented Dec 6, 2023

Add tokenizer kwargs to fill mask pipeline. #26234

Add tokenizer kwargs to fill mask pipeline. #26234

Conversation

nmcahill commented Sep 18, 2023

Narsil left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

nmcahill commented Sep 20, 2023

ArthurZucker commented Sep 21, 2023

nmcahill commented Oct 2, 2023

LysandreJik commented Oct 3, 2023

HuggingFaceDocBuilderDev commented Oct 3, 2023

nmcahill commented Oct 3, 2023

ydshieh commented Dec 5, 2023 • edited Loading

ydshieh commented Dec 5, 2023

nmcahill commented Dec 5, 2023 • edited Loading

nmcahill commented Dec 5, 2023

ydshieh commented Dec 6, 2023

nmcahill commented Dec 6, 2023 via email

ydshieh commented Dec 6, 2023

ArthurZucker left a comment •

edited

Loading

ydshieh commented Dec 5, 2023 •

edited

Loading

nmcahill commented Dec 5, 2023 •

edited

Loading