mBART support for run_summarization.py #15125

banda-larga · 2022-01-12T14:30:19Z

Added support for multilingual tokenizer and mBART for run_summarization.py

-> run_summarization.py not working with mBART

sgugger

I don't think we should duplicate everything from the translation script, only take what is necessary for the summarization script to work with MBart.

sgugger · 2022-01-12T17:00:55Z

examples/pytorch/summarization/run_summarization.py

+    source_lang: str = field(default=None, metadata={"help": "Source language id for summarization."})
+    target_lang: str = field(default=None, metadata={"help": "Target language id for summarization."})


If the language is different for the inputs and the outputs it is a translation task, not a summarization task, so there should only be one language argument.

Yes it's true, fixed

sgugger · 2022-01-12T17:01:20Z

examples/pytorch/summarization/run_summarization.py

+            "help": "The token to force as the first generated token after the :obj:`decoder_start_token_id`."
+            "Useful for multilingual models like :doc:`mBART <../model_doc/mbart>` where the first generated token "
+            "needs to be the target language token.(Usually it is the target language token)"


Use MarkDown here, no need for obj or docs. Also the link should be resolved.

sgugger · 2022-01-12T17:02:03Z

examples/pytorch/summarization/run_summarization.py

-        # remove pairs where at least one record is None
-        inputs, targets = [], []
-        for i in range(len(examples[text_column])):
-            if examples[text_column][i] is not None and examples[summary_column][i] is not None:
-                inputs.append(examples[text_column][i])
-                targets.append(examples[summary_column][i])


Why is this removed?

Yes my code was slightly old, fixed it

sgugger

Thanks for adapting! Can you address the last two comments and run make style on your branch to get rid of the quality check?

sgugger · 2022-01-12T18:46:11Z

examples/pytorch/summarization/run_summarization.py

+    # Get the language codes for input/target.
+    source_lang = data_args.lang.split("_")[0]
+    target_lang = data_args.lang.split("_")[0]


Those are not used in the rest of the example if I'm not mistaken.

stas00 · 2022-01-13T22:14:57Z

So, it appears that this PR introduced a new issue by forcing all models to include --lang, @banda-larga would you like to fix it in a new PR and assert on lack of --lang only with mbart model type? see: #15150 (comment)

and while at it to integrate a better error message as proposed here #15150

will then also need to revert this #15149 as part of the new PR.

and note to self: make sure

RUN_SLOW=1 pytest tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero2_sum_pegasus

doesn't fail as this PR broke it. please tag me to the new PR and I will do the checking - you don't need to figure this part out.

Bonus points: adding a new examples test that should have failed with this PR - are we not testing run_summarization.py in torch_examples CI?

Thanks.

sgugger · 2022-01-13T22:32:50Z

This is a tiny bit urgent, so not waiting to remove the two lines that break every existing command using run_summarization. I fixed that in this commit.

stas00 · 2022-01-13T22:45:31Z

ok, so nothing else needs to be done. @banda-larga please ignore my comments above.

Update run_summarization.py

7fc4e44

LysandreJik requested review from patil-suraj and sgugger January 12, 2022 16:59

sgugger reviewed Jan 12, 2022

View reviewed changes

Fixed languages and added missing code

bd305a6

banda-larga changed the title ~~mBART support run_summarization.py~~ mBART support for run_summarization.py Jan 12, 2022

sgugger approved these changes Jan 12, 2022

View reviewed changes

banda-larga and others added 2 commits January 12, 2022 20:06

fixed obj, docs, removed source_lang and target_lang

910a111

make style, run_summarization.py reformatted

f06d925

banda-larga force-pushed the master branch from df75ed9 to f06d925 Compare January 12, 2022 21:29

sgugger merged commit 9a94bb8 into huggingface:master Jan 12, 2022

patil-suraj mentioned this pull request Jan 13, 2022

The pytorch example summarization/run_summarization.py do not work with MBart #14360

Closed

4 tasks

This was referenced Jan 13, 2022

[deepspeed tests] fix summarization #15149

Merged

[summarization example] better error message #15150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mBART support for run_summarization.py #15125

mBART support for run_summarization.py #15125

banda-larga commented Jan 12, 2022 •

edited

Loading

sgugger left a comment

sgugger Jan 12, 2022

banda-larga Jan 12, 2022

sgugger Jan 12, 2022

sgugger Jan 12, 2022

banda-larga Jan 12, 2022

sgugger left a comment

sgugger Jan 12, 2022

banda-larga Jan 12, 2022

stas00 commented Jan 13, 2022 •

edited

Loading

sgugger commented Jan 13, 2022

stas00 commented Jan 13, 2022

		source_lang: str = field(default=None, metadata={"help": "Source language id for summarization."})
		target_lang: str = field(default=None, metadata={"help": "Target language id for summarization."})

mBART support for run_summarization.py #15125

mBART support for run_summarization.py #15125

Conversation

banda-larga commented Jan 12, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jan 12, 2022

Choose a reason for hiding this comment

banda-larga Jan 12, 2022

Choose a reason for hiding this comment

sgugger Jan 12, 2022

Choose a reason for hiding this comment

sgugger Jan 12, 2022

Choose a reason for hiding this comment

banda-larga Jan 12, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jan 12, 2022

Choose a reason for hiding this comment

banda-larga Jan 12, 2022

Choose a reason for hiding this comment

stas00 commented Jan 13, 2022 • edited Loading

sgugger commented Jan 13, 2022

stas00 commented Jan 13, 2022

banda-larga commented Jan 12, 2022 •

edited

Loading

stas00 commented Jan 13, 2022 •

edited

Loading