Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_pretrained: check that the pretrained model is for the right model architecture #10586

Merged
merged 7 commits into from
Mar 18, 2021
Merged

Conversation

vimarshc
Copy link
Contributor

@vimarshc vimarshc commented Mar 8, 2021

What does this PR do?

Adding Checks to the from_pretrained workflow to check the model name passed belongs to the model being initiated.
Same checks need to be added for Tokenizer.

Fixes #10293

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@LysandreJik
Copy link
Member

Hi @vimarshc, thank you for opening this PR! Could you:

  • rebase your PR on the most recent master so that the failing tests don't fail anymore
  • run make fixup at the root of your repository to fix your code quality issue (More information related to this on step 5 of this document

@stas00
Copy link
Contributor

stas00 commented Mar 8, 2021

Awesome!

Would you like to attempt to add a test for this check?

We need to use tiny models so it's fast and I made the suggestions here:
#10293 (comment)

If you're not sure how to do it please let me know and I will add a test.

@vimarshc
Copy link
Contributor Author

vimarshc commented Mar 9, 2021

Hi @stas00,
I'd like to add the tests myself if that's ok. I have to add the same checks for the from_pretrained for Tokenizer however it's not as straightforward. The Tokenizer's from_pretrained is written with some assumptions in mind and I'm not entirely sure where to add the check. Here's the from_pretrained method for Tokenizers.

Regardless, I shall try to add the test for this assertion I've already added and the changes mentioned by @LysandreJik in the next 24 hours.

@stas00
Copy link
Contributor

stas00 commented Mar 9, 2021

OK, so your change works for the model and the config:

PYTHONPATH=src python -c 'from transformers import PegasusForConditionalGeneration; PegasusForConditionalGeneration.from_pretrained("patrickvonplaten/t5-tiny-random")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/modeling_utils.py", line 975, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/configuration_utils.py", line 387, in from_pretrained
    assert (
AssertionError: You tried to initiate a model of type 'pegasus' with a pretrained model of type 't5'

same for:

PYTHONPATH=src python -c 'from transformers import PegasusConfig; PegasusConfig.from_pretrained("patrickvonplaten/t5-tiny-random")'

As you discovered - and I didn't know - the tokenizer doesn't seem to need the config file, so it doesn't look there is a way to check that the tokenizer being downloaded is of the right kind. I will ask.

And yes, it's great if you can add the test - thank you.

I restyled your PR to fit our style guide - we don't use format and you need to run the code through make fixup or make style (slower) before committing - otherwise CIs may fail. Which is what @LysandreJik was requesting.
https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#start-contributing-pull-requests

So please git pull your branch to get my updates.

@stas00 stas00 changed the title Issue 10293: Checks for from_pretrained from_pretrained: check that the pretrained model is for the right model architecture Mar 9, 2021
@vimarshc
Copy link
Contributor Author

vimarshc commented Mar 9, 2021

Hi @stas00,
Thanks for the update.
Will take a pull, add the test and go through the checklist before pushing the changes.
Will try to push in a few hours.

@stas00
Copy link
Contributor

stas00 commented Mar 9, 2021

I'm puzzled. why did you undo my fix? If you want to restore it, it was:

--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -384,6 +384,9 @@ class PretrainedConfig(object):

         """
         config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        assert (
+            config_dict["model_type"] == cls.model_type
+        ), f"You tried to initiate a model of type '{cls.model_type}' with a pretrained model of type '{config_dict['model_type']}'"
         return cls.from_dict(config_dict, **kwargs)

     @classmethod

@vimarshc
Copy link
Contributor Author

vimarshc commented Mar 9, 2021

Hi,
Apologies.
I rebased my branch and assumed had to force push which deleted your changes.

@vimarshc
Copy link
Contributor Author

vimarshc commented Mar 9, 2021

Hi,
I have added the tests.
Everything seems to be working fine.

However, I pushed after taking a pull from the master, and yet it's showing a merge conflict. Not sure how that got there.

@stas00
Copy link
Contributor

stas00 commented Mar 9, 2021

you messed up your PR branch - so this PR now contains dozens of unrelated changes.

You can do a soft reset to the last good sha, e.g.:

git reset --soft d70a770
git commit
git push -f

Just save somewhere your newly added test code first.

@stas00
Copy link
Contributor

stas00 commented Mar 9, 2021

I think you picked the wrong sha and ended up with an even worse situation. Try d70a770 as I suggested.

model = BertModel.from_pretrained(TINY_BERT)
self.assertIsNotNone(model)

self.assertRaises(AssertionError, BertModel.from_pretrained, TINY_T5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertRaises(AssertionError, BertModel.from_pretrained, TINY_T5)
with self.assertRaises(Exception) as context:
BertModel.from_pretrained(TINY_T5)
self.assertTrue("You tried to initiate a model of type" in str(context.exception))

Let's check the actual assert message here, just in case it asserts on something else and then this test would be misleading.

Just please test that it works. thank you.

…nsure desired assert message is being generated
Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. Thank you for bearing my requests.

It Looks like this check found some bugs in our code. So we will need to resolve those before merging this. I will update you when this is done.

@stas00 stas00 requested a review from LysandreJik March 9, 2021 21:34
@stas00
Copy link
Contributor

stas00 commented Mar 9, 2021

OK, so looking at the errors - need to solve 2 issues:

Issue 1.

        assert (
>           config_dict["model_type"] == cls.model_type
        ), f"You tried to initiate a model of type '{cls.model_type}' with a pretrained model of type '{config_dict['model_type']}'"
E       KeyError: 'model_type'

so some models don't have the model_type key.

@vimarshc, I suppose you need to edit the code to skip this assert if we don't have the data.

You can verify that your change works with this test:

pytest -sv tests/test_trainer.py::TrainerIntegrationTest -k test_early_stopping_callback

I looked at the config.json generated by this test and it's:

{
  "a": 0,
  "architectures": [
    "RegressionPreTrainedModel"
  ],
  "b": 0,
  "double_output": false,
  "transformers_version": "4.4.0.dev0"
}

so far from being complete.

Issue 2

This one looks trickier:

E       AssertionError: You tried to initiate a model of type 'blenderbot-small' with a pretrained model of type 'blenderbot'

We will ask for help with this one.

@stas00
Copy link
Contributor

stas00 commented Mar 9, 2021

@patrickvonplaten, @patil-suraj - your help is needed here.

BlenderbotSmall has an inconsistency. It declares its model type as "blenderbot-small":

src/transformers/models/auto/configuration_auto.py:        ("blenderbot-small", BlenderbotSmallConfig),
src/transformers/models/auto/configuration_auto.py:        ("blenderbot-small", "BlenderbotSmall"),
src/transformers/models/blenderbot_small/configuration_blenderbot_small.py:    model_type = "blenderbot-small"

but the pretrained models all use model_type: blenderbot: https://huggingface.co/facebook/blenderbot-90M/blob/main/config.json

So this new sanity check this PR is trying to add fails.

        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
>       assert (
            config_dict["model_type"] == cls.model_type
        ), f"You tried to initiate a model of type '{cls.model_type}' with a pretrained model of type '{config_dict['model_type']}'"
E       AssertionError: You tried to initiate a model of type 'blenderbot-small' with a pretrained model of type 'blenderbot'

What shall we do?

It's possible that that part of the config object needs to be re-designed, so that there is a top architecture/type and then perhaps sub-types?

@vimarshc
Copy link
Contributor Author

Hi @stas00
Will add the check you mentioned today.

@stas00
Copy link
Contributor

stas00 commented Mar 10, 2021

Looks good, @vimarshc

So we are down to one failing test:

tests/test_modeling_blenderbot_small.py::Blenderbot90MIntegrationTests::test_90_generation_from_short_input

@stas00
Copy link
Contributor

stas00 commented Mar 11, 2021

I wonder if we could sort of cheat and do:

if not cls.model_type in config_dict["model_type"]: assert ...

so this will check whether the main type matches as a substring of a sub-type. It's not a precise solution, but will probably catch the majority of mismatches.

Actually for t5/mt5 it's reversed. model_type are t5 and mt5, but both may have T5ForConditionalGeneration as architecture.
https://huggingface.co/google/mt5-base/blob/main/config.json#L16 since MT5ForConditionalGeneration is a copy of T5ForConditionalGeneration with the only difference of having model_type = "mt5"

So I think this check could fail in some situations. In which case we could perhaps check if one is a subset of another in either direction?

if not (cls.model_type in config_dict["model_type"] or config_dict["model_type"] in cls.model_type): assert ...

So this proposes a sort of fuzzy-match.

@patil-suraj
Copy link
Contributor

patil-suraj commented Mar 12, 2021

BlenderbotSmall has an inconsistency. It declares its model type as "blenderbot-small":

@stas00 You are right. Before the BART refactor all blenderbot models shared the same model class, but the config was not updated after the refactor. The model_type on the hub should be blenderbot-small. I will fix that.

@patil-suraj
Copy link
Contributor

I updated the config https://huggingface.co/facebook/blenderbot-90M/blob/main/config.json.

And actually, there's a new version of blenderbot-90M , https://huggingface.co/facebook/blenderbot_small-90M

It's actually the same model, but with the proper name. The blenderbot small test uses blenderbot-90M which should be changed to use this new model.

@vimarshc
Copy link
Contributor Author

Hi @stas00,
The fuzzy match approach will not work for the case 'distilbert' vs 'bert'.

@stas00
Copy link
Contributor

stas00 commented Mar 12, 2021

Hi @stas00,
The fuzzy match approach will not work for the case 'distilbert' vs 'bert'.

That's an excellent counter-example! As I proposed that it might mostly work ;)

But it looks like your original solution will now work after @patil-suraj fixing.

some unrelated test is failing - I rebased this branch - let's see if it will be green now.

@stas00
Copy link
Contributor

stas00 commented Mar 12, 2021

I updated the config https://huggingface.co/facebook/blenderbot-90M/blob/main/config.json.

And actually, there's a new version of blenderbot-90M , https://huggingface.co/facebook/blenderbot_small-90M

It's actually the same model, but with the proper name. The blenderbot small test uses blenderbot-90M which should be changed to use this new model.

Thank you, Suraj!

Since it's sort of related to this PR, do you want to push the change in here, or do it in another PR?

@stas00
Copy link
Contributor

stas00 commented Mar 12, 2021

Oh bummer, we have 2 more in TF land:

FAILED tests/test_modeling_tf_flaubert.py::TFFlaubertModelTest::test_compile_tf_model
FAILED tests/test_modeling_tf_flaubert.py::TFFlaubertModelTest::test_save_load

same issue for both tests:

E           AssertionError: You tried to initiate a model of type 'xlm' with a pretrained model of type 'flaubert'

@LysandreJik, who can help resolving this one? Thank you!

@LysandreJik
Copy link
Member

Yes, I'll take a look as soon as possible!

@LysandreJik
Copy link
Member

I fixed the tests related to FlauBERT. Flax test is a flaky test that @patrickvonplaten is working on solving, and should not block this PR.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Would like to merge that after the v4.4.0 that comes out tomorrow so that we have time to test it on the master branch before putting it in a version.

@stas00
Copy link
Contributor

stas00 commented Mar 15, 2021

Thank you for taking care of this, @LysandreJik

I suppose we will take care of potentially doing the same for the Tokenizer validation in another PR.

@LysandreJik
Copy link
Member

With the tokenizer it'll likely be a bit more complex, as it is perfectly possible to have decoupled models/tokenizers, e.g., a BERT model and a different tokenizer like it is the case in BERTweet (config.json).

@stas00
Copy link
Contributor

stas00 commented Mar 15, 2021

Indeed, I think this will require a change where there is a required tokenizer_config.json which identifies itself which arch it belongs to, so while it should be possible to mix a model and tokenizer from different architectures, this shouldn't fail with random misleading errors like:

python -c 'from transformers import BartTokenizer; BartTokenizer.from_pretrained("prajjwal1/bert-tiny")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/tokenization_utils_base.py", line 1693, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'prajjwal1/bert-tiny'. Make sure that:

- 'prajjwal1/bert-tiny' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'prajjwal1/bert-tiny' is the correct path to a directory containing relevant tokenizer files

but to indicate to the user that they got either the wrong tokenizer class or the the tokenizer identifier, since the above error is invalid - it's the correct identifier

As can be seen from:

python -c 'from transformers import BertTokenizer; BertTokenizer.from_pretrained("prajjwal1/bert-tiny")'

which works.

(and it erroneously says "model identifier" and there is no model here, but that's an unrelated minor issue).

And of course there are many other ways I have seen this mismatch to fail, usually a lot noisier when it's missing some file.

@stas00
Copy link
Contributor

stas00 commented Mar 18, 2021

@LysandreJik, I rebased this PR and it looks good. v4.4.0 is out so we can probably merge this one now.

Thank you.

@LysandreJik
Copy link
Member

Indeed, this is great! Thanks a lot @vimarshc and @stas00 for working on this.

@LysandreJik LysandreJik merged commit 094afa5 into huggingface:master Mar 18, 2021
@stas00
Copy link
Contributor

stas00 commented Mar 18, 2021

So should I create a new issue for doing the same for the Tokenizers? I think it'd be much more complicated since we don't save any tokenizer data at the moment that puts the tokenizer in any category/architecture.

@vimarshc
Copy link
Contributor Author

Hi,
Thanks, @stas00 for providing the guidance to close this issue. This is my first contribution to transformers so you can imagine my excitement. :D
I understand that a similar change for Tokenizer will be a bit more complicated. Would love to take a shot at fixing that as well. :)

@stas00
Copy link
Contributor

stas00 commented Mar 19, 2021

I'm glad to hear it was a good experience for you, @vimarshc.

I'm not quite sure yet how to tackle the same for tokenizers. I will try to remember to tag you if we can think of an idea on how to approach this task.

Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
…del architecture (huggingface#10586)

* Added check to ensure model name passed to from_pretrained and model are the same

* Added test to check from_pretrained throws assert error when passed an incompatiable model name

* Modified assert in from_pretrained with f-strings. Modified test to ensure desired assert message is being generated

* Added check to ensure config and model has model_type

* Fix FlauBERT heads

Co-authored-by: vimarsh chaturvedi <vimarsh chaturvedi>
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pretrained] model classes aren't checking the arch of the pretrained model it loads
4 participants