-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch Bart] Split Bart into different models #9343
[PyTorch Bart] Split Bart into different models #9343
Conversation
src/transformers/models/blenderbot_small/configuration_blenderbot_small.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job splitting the four models! Given that the slow tests pass, this LGTM once the modifications we discussed regarding the small blenderbot model are applied.
Fantastic job 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptional work! I have a few nits, which, from the way they are duplicated for each model, sound like issues in the templates. If you could fix the templates as you fix the nits, that would be truly amazing.
"BlenderbotSmallEncoder", # Building part of bigger (tested) model. | ||
"BlenderbotSmallDecoder", # Building part of bigger (tested) model. | ||
"BlenderbotEncoder", # Building part of bigger (tested) model. | ||
"BlenderbotDecoder", # Building part of bigger (tested) model. | ||
"MBartEncoder", # Building part of bigger (tested) model. | ||
"MBartDecoder", # Building part of bigger (tested) model. | ||
"PegasusEncoder", # Building part of bigger (tested) model. | ||
"PegasusDecoder", # Building part of bigger (tested) model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a followup PR for me, all Encoder
and Decoder
models should be ignored for the checks of tested and auto-configured.
One important comment I forgot to add to my review: I don't think we should adapt the |
…aten/transformers into split_bart_into_sep
* first try * remove old template * finish bart * finish mbart * delete unnecessary line * init pegasus * save intermediate * correct pegasus * finish pegasus * remove cookie cutter leftover * add marian * finish blenderbot * replace in file * correctly split blenderbot * delete "old" folder * correct "add statement" * adapt config for tf comp * correct configs for tf * remove ipdb * fix more stuff * fix mbart * push pegasus fix * fix mbart * more fixes * fix research projects code * finish docs for bart, mbart, and marian * delete unnecessary file * correct attn typo * correct configs * remove pegasus for seq class * correct peg docs * correct peg docs * finish configs * further improve docs * add copied from statements to mbart * fix copied from in mbart * add copy statements to marian * add copied from to marian * add pegasus copied from * finish pegasus * finish copied from * Apply suggestions from code review * make style * backward comp blenderbot * apply lysandres and sylvains suggestions * apply suggestions * push last fixes * fix docs * fix tok tests * fix imports code style * fix doc
What does this PR do?
This PR splits all Bart-like models into their own respective classes for PyTorch models only. This is more in line with the general philosophy of the library to have self-contained model files.
As discussed with @jplu, the TF models will be separated in a future PR as there are still some issues and improvements (TF serving) blocking the separation - see #9313.
In short, after this PR all those "model-specific" config parameters are removed from all Bart-like configs:
extra_pos_embeddings
normalize_embedding
add_final_layer_norm
normalize_before
do_blenderbot_90_layernorm
static_position_embeddings
add_bias_logits
force_bos_token_to_be_generated
(this one has to be kept for Bart though)and each "bart" model (Pegasus, Bart, MBart, Marian, Blenderbot, BlenderbotSmall) will get its own
modeling_....py
file.At the moment the models have the following configurations:
extra_pos_embeddings
normalize_before
add_final_layer_norm
do_blenderbot_90_layernorm
normalize_embedding
static_position_embeddings
add_bias_logits
force_bos_token_to_be_generated
bart
mbart
marian
pegasus
blenderbot90M (BlenderbotSmall)
blenderbot3B + rest (Blenderbot)
We can see that
add_bias_logits
is actually never used, so I think the best option is to just delete the functionality. Also, one can see that no two models have the exact same usage of the above params, so we'll make 6 different modeling_....py files.Resulting Improvements:
The model files are much more readable and should be much easier to navigate for the user. No difficult config parameters anymore where the user doesn't know what to set anyways, such as
normalize_before
.All irrelevant Bart-like features for other models are removed. Those features are a) never mentioned in the paper, b) don't make any sense since the model wasn't trained with those features, so that the usage of those features leads to non-sense outputs. E.g. Marian was never supposed to be a "mask-filling" model, yet it has "mask-filling" functionality, when doing:
The big gain here is that users are better guided on how to use the model and wonder less about whether the model is used correctly & whether there is a bug in the model.
Docstrings are improved with more Model-specific examples and fewer comparisons to Bart. E.g. Pegasus, Marian, and Blenderbot never really mention BART in their paper and have no direct relation to BART IMO => these models should not be compared to BART in the docs -> it's confusing for the user
Some small improvements, memory is slightly improved for beam search and gradient checkpointing is added.
All previous tests are copied + some additional tests are added for each model
Possible drawback
Breaking changes
🚨🚨 Important: We cannot keep 100% backward compatibility here or the PR won't make much sense 🚨🚨
Since all models were packed into a single model file a lot of different model design are at the moment possible. E.g.
Pegasus was only ever used with Sinusoidal position embeddings (as mentioned in the paper) but since it's merged into
modeling_bart.py
, one could theoretically use Pegasus with Learned position embeddings. This is not done in any config on the model hub however and will not be possible anymore after the PR. Also, Marian's model design has never normalized the word embeddings, but it could be possible with the current design. But again no config in the model hub does that, so this will also not be possible anymore after the PR. In short: All model designs that were never foreseen in the original model and that are never used on the model hub at the moment won't be allowed anymore after the PR.If we would not make this change, it would mean that we would have to keep all those
normalize_before
configs, which in return would mean that the modeling code of all Bart-like models would be the same again.Blenderbot needs to be divided into two models IMO. Blenderbot 90M not only has a very different architecture (see table above), but also uses a different tokenizer. I created a new
BlenderbotSmallModel
class. Thus I need to update one Blenderbot config online, changing it's class. This means that from this PR onward the following is not supported anymore:That's a big breaking change, but I don't see another way. If we keep the small blenderbot in the "normal" blenderbot, we have to keep the config params
normalize_before
which I really don't want to do.... I think the best option here is to add a warning (or even an error) by overwritingfrom_pretrained(...)
inBlenderbotForConditionalGeneration
so thatwill throw an error or give a warning. There are no fine-tuning blenderbot models on the hub and this is the only config. I think it's the right approach to separate the model here
Barthez has essentially a
mbart
architecture, but hasbart
defined as itsmodel_type
in the configs. Here I'd also like to change the configs online to make sure the correct model is loaded when usingAutoModelForSeq2SeqLM
. I should also contact the author here.Bart allowed to automatically create
decoder_input_ids
by shifting theinput_ids
to the right. Thus, in Bart one can do the following:This is a very special case and should only be used for Bart-like denoising pre-training or mask-filling. The only models that were trained in this fashion and thus can do mask-filling are Bart and MBart. All other models cannot do mask-filling so that
decoder_input_ids
should never be created from shiftinginput_ids
. => this feature is removed therefore from Pegasus, Marian, Blenderbot, and BlenderbotSmallThose are all breaking changes. Blenderbot is the big one, the other one should be fine. To be sure, I wrote some scripts that verify that no model on the model hub that contains one of the keywords
bart
,mbart
,pegasus
,blenderbot
,opus-mt
,barthez
has incorrect/unexpected parameter settings after the PR.TODO:
mbart
instead ofbart
.Future TODO: