-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mBART support for run_summarization.py #15125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should duplicate everything from the translation script, only take what is necessary for the summarization script to work with MBart.
source_lang: str = field(default=None, metadata={"help": "Source language id for summarization."}) | ||
target_lang: str = field(default=None, metadata={"help": "Target language id for summarization."}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the language is different for the inputs and the outputs it is a translation task, not a summarization task, so there should only be one language argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's true, fixed
"help": "The token to force as the first generated token after the :obj:`decoder_start_token_id`." | ||
"Useful for multilingual models like :doc:`mBART <../model_doc/mbart>` where the first generated token " | ||
"needs to be the target language token.(Usually it is the target language token)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use MarkDown here, no need for obj or docs. Also the link should be resolved.
# remove pairs where at least one record is None | ||
inputs, targets = [], [] | ||
for i in range(len(examples[text_column])): | ||
if examples[text_column][i] is not None and examples[summary_column][i] is not None: | ||
inputs.append(examples[text_column][i]) | ||
targets.append(examples[summary_column][i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes my code was slightly old, fixed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adapting! Can you address the last two comments and run make style
on your branch to get rid of the quality check?
# Get the language codes for input/target. | ||
source_lang = data_args.lang.split("_")[0] | ||
target_lang = data_args.lang.split("_")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are not used in the rest of the example if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
So, it appears that this PR introduced a new issue by forcing all models to include and while at it to integrate a better error message as proposed here #15150 will then also need to revert this #15149 as part of the new PR. and note to self: make sure
doesn't fail as this PR broke it. please tag me to the new PR and I will do the checking - you don't need to figure this part out. Bonus points: adding a new examples test that should have failed with this PR - are we not testing Thanks. |
This is a tiny bit urgent, so not waiting to remove the two lines that break every existing command using |
ok, so nothing else needs to be done. @banda-larga please ignore my comments above. |
Added support for multilingual tokenizer and mBART for run_summarization.py
-> run_summarization.py not working with mBART