-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
from_pretrained: check that the pretrained model is for the right model architecture #10586
Conversation
Hi @vimarshc, thank you for opening this PR! Could you:
|
Awesome! Would you like to attempt to add a test for this check? We need to use tiny models so it's fast and I made the suggestions here: If you're not sure how to do it please let me know and I will add a test. |
Hi @stas00, Regardless, I shall try to add the test for this assertion I've already added and the changes mentioned by @LysandreJik in the next 24 hours. |
OK, so your change works for the model and the config:
same for:
As you discovered - and I didn't know - the tokenizer doesn't seem to need the config file, so it doesn't look there is a way to check that the tokenizer being downloaded is of the right kind. I will ask. And yes, it's great if you can add the test - thank you. I restyled your PR to fit our style guide - we don't use So please |
Hi @stas00, |
I'm puzzled. why did you undo my fix? If you want to restore it, it was:
|
Hi, |
Hi, However, I pushed after taking a pull from the master, and yet it's showing a merge conflict. Not sure how that got there. |
you messed up your PR branch - so this PR now contains dozens of unrelated changes. You can do a soft reset to the last good sha, e.g.:
Just save somewhere your newly added test code first. |
I think you picked the wrong sha and ended up with an even worse situation. Try |
…n incompatiable model name
tests/test_modeling_common.py
Outdated
model = BertModel.from_pretrained(TINY_BERT) | ||
self.assertIsNotNone(model) | ||
|
||
self.assertRaises(AssertionError, BertModel.from_pretrained, TINY_T5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertRaises(AssertionError, BertModel.from_pretrained, TINY_T5) | |
with self.assertRaises(Exception) as context: | |
BertModel.from_pretrained(TINY_T5) | |
self.assertTrue("You tried to initiate a model of type" in str(context.exception)) |
Let's check the actual assert message here, just in case it asserts on something else and then this test would be misleading.
Just please test that it works. thank you.
…nsure desired assert message is being generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. Thank you for bearing my requests.
It Looks like this check found some bugs in our code. So we will need to resolve those before merging this. I will update you when this is done.
OK, so looking at the errors - need to solve 2 issues: Issue 1.
so some models don't have the @vimarshc, I suppose you need to edit the code to skip this assert if we don't have the data. You can verify that your change works with this test:
I looked at the config.json generated by this test and it's:
so far from being complete. Issue 2This one looks trickier:
We will ask for help with this one. |
@patrickvonplaten, @patil-suraj - your help is needed here. BlenderbotSmall has an inconsistency. It declares its model type as "blenderbot-small":
but the pretrained models all use So this new sanity check this PR is trying to add fails.
What shall we do? It's possible that that part of the config object needs to be re-designed, so that there is a top architecture/type and then perhaps sub-types? |
Hi @stas00 |
Looks good, @vimarshc So we are down to one failing test:
|
I wonder if we could sort of cheat and do:
so this will check whether the main type matches as a substring of a sub-type. It's not a precise solution, but will probably catch the majority of mismatches. Actually for t5/mt5 it's reversed. So I think this check could fail in some situations. In which case we could perhaps check if one is a subset of another in either direction?
So this proposes a sort of fuzzy-match. |
@stas00 You are right. Before the BART refactor all |
I updated the config https://huggingface.co/facebook/blenderbot-90M/blob/main/config.json. And actually, there's a new version of It's actually the same model, but with the proper name. The blenderbot small test uses |
Hi @stas00, |
That's an excellent counter-example! As I proposed that it might mostly work ;) But it looks like your original solution will now work after @patil-suraj fixing. some unrelated test is failing - I rebased this branch - let's see if it will be green now. |
Thank you, Suraj! Since it's sort of related to this PR, do you want to push the change in here, or do it in another PR? |
Oh bummer, we have 2 more in TF land:
same issue for both tests:
@LysandreJik, who can help resolving this one? Thank you! |
Yes, I'll take a look as soon as possible! |
I fixed the tests related to FlauBERT. Flax test is a flaky test that @patrickvonplaten is working on solving, and should not block this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Would like to merge that after the v4.4.0 that comes out tomorrow so that we have time to test it on the master
branch before putting it in a version.
Thank you for taking care of this, @LysandreJik I suppose we will take care of potentially doing the same for the Tokenizer validation in another PR. |
With the tokenizer it'll likely be a bit more complex, as it is perfectly possible to have decoupled models/tokenizers, e.g., a BERT model and a different tokenizer like it is the case in BERTweet (config.json). |
Indeed, I think this will require a change where there is a required
but to indicate to the user that they got either the wrong tokenizer class or the the tokenizer identifier, since the above error is invalid - it's the correct identifier As can be seen from:
which works. (and it erroneously says "model identifier" and there is no model here, but that's an unrelated minor issue). And of course there are many other ways I have seen this mismatch to fail, usually a lot noisier when it's missing some file. |
@LysandreJik, I rebased this PR and it looks good. v4.4.0 is out so we can probably merge this one now. Thank you. |
So should I create a new issue for doing the same for the Tokenizers? I think it'd be much more complicated since we don't save any tokenizer data at the moment that puts the tokenizer in any category/architecture. |
Hi, |
I'm glad to hear it was a good experience for you, @vimarshc. I'm not quite sure yet how to tackle the same for tokenizers. I will try to remember to tag you if we can think of an idea on how to approach this task. |
…del architecture (huggingface#10586) * Added check to ensure model name passed to from_pretrained and model are the same * Added test to check from_pretrained throws assert error when passed an incompatiable model name * Modified assert in from_pretrained with f-strings. Modified test to ensure desired assert message is being generated * Added check to ensure config and model has model_type * Fix FlauBERT heads Co-authored-by: vimarsh chaturvedi <vimarsh chaturvedi> Co-authored-by: Stas Bekman <stas@stason.org> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
What does this PR do?
Adding Checks to the from_pretrained workflow to check the model name passed belongs to the model being initiated.
Same checks need to be added for Tokenizer.
Fixes #10293
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.