Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mBART support for run_summarization.py #15125

Merged
merged 4 commits into from
Jan 12, 2022
Merged

Conversation

banda-larga
Copy link
Contributor

@banda-larga banda-larga commented Jan 12, 2022

Added support for multilingual tokenizer and mBART for run_summarization.py

-> run_summarization.py not working with mBART

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should duplicate everything from the translation script, only take what is necessary for the summarization script to work with MBart.

Comment on lines 123 to 124
source_lang: str = field(default=None, metadata={"help": "Source language id for summarization."})
target_lang: str = field(default=None, metadata={"help": "Target language id for summarization."})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the language is different for the inputs and the outputs it is a translation task, not a summarization task, so there should only be one language argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's true, fixed

Comment on lines 235 to 237
"help": "The token to force as the first generated token after the :obj:`decoder_start_token_id`."
"Useful for multilingual models like :doc:`mBART <../model_doc/mbart>` where the first generated token "
"needs to be the target language token.(Usually it is the target language token)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use MarkDown here, no need for obj or docs. Also the link should be resolved.

Comment on lines 440 to 445
# remove pairs where at least one record is None
inputs, targets = [], []
for i in range(len(examples[text_column])):
if examples[text_column][i] is not None and examples[summary_column][i] is not None:
inputs.append(examples[text_column][i])
targets.append(examples[summary_column][i])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes my code was slightly old, fixed it

@banda-larga banda-larga changed the title mBART support run_summarization.py mBART support for run_summarization.py Jan 12, 2022
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adapting! Can you address the last two comments and run make style on your branch to get rid of the quality check?

Comment on lines 450 to 452
# Get the language codes for input/target.
source_lang = data_args.lang.split("_")[0]
target_lang = data_args.lang.split("_")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are not used in the rest of the example if I'm not mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@stas00
Copy link
Contributor

stas00 commented Jan 13, 2022

So, it appears that this PR introduced a new issue by forcing all models to include --lang, @banda-larga would you like to fix it in a new PR and assert on lack of --lang only with mbart model type? see: #15150 (comment)

and while at it to integrate a better error message as proposed here #15150

will then also need to revert this #15149 as part of the new PR.

and note to self: make sure

RUN_SLOW=1 pytest tests/deepspeed/test_model_zoo.py::TestDeepSpeedModelZoo::test_zero_to_fp32_zero2_sum_pegasus

doesn't fail as this PR broke it. please tag me to the new PR and I will do the checking - you don't need to figure this part out.

Bonus points: adding a new examples test that should have failed with this PR - are we not testing run_summarization.py in torch_examples CI?

Thanks.

@sgugger
Copy link
Collaborator

sgugger commented Jan 13, 2022

This is a tiny bit urgent, so not waiting to remove the two lines that break every existing command using run_summarization. I fixed that in this commit.

@stas00
Copy link
Contributor

stas00 commented Jan 13, 2022

ok, so nothing else needs to be done. @banda-larga please ignore my comments above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants