Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing issue where generic model types wouldn't load properly with the pipeline #18392

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Aug 1, 2022

What does this PR do?

When this occurs #17929
we can provide a better error message since this is detectable at load time
and the fix should happen within transformers.

Found out 3 odd cases which have been dealt with differently:

  • translation actually uses translation_XX_to_YY and also relies on task_specific_params for some model configs.
    I tried cleaning that up and using task_specific_params only once, but the rabbit hole is deep, and it would have meant more
    code changes that this PR should hold. Waiting for a subsequent PR.
    The issue is that translation_XX_to_YY is not a normalized task name and is not within NO_TOKENIZER_TASKS nor NO_FEATURE_EXTRACTION_TASKS so the configuration on wether we should load or not doesn't work.
  • feature-extraction. That one is extremely special, since ALL models could in theory use that pipeline, and so we cannot enforce or detect anything statically on what should be loaded or not.
  • automatic-speech-recognition has this speech-encoder-decoder type of model, which do not define any tokenizer class, so the type(config) is NOT within TOKENIZER_MAPPING (correctly), but the first version of the check would fail when deciding staticly if we should load the tokenizer or not. The fix was to check if the user passed a tokenizer or not (if tokenizer is passed we should never try to do anything anyway)

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@sugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 1, 2022

The documentation is not available anymore as the PR was closed or merged.

Narsil added 9 commits August 1, 2022 14:54
translation need to use its non normalized name (translation_XX_to_YY,
so that the task_specific_params are correctly overloaded).
This can be removed and cleaned up in a later PR.

`speech-encode-decoder` actually REQUIRES to pass a `tokenizer` manually
so the error needs to be discarded when the `tokenizer` is already
there.
@Narsil Narsil requested a review from sgugger August 2, 2022 10:16
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! I'm all for better and clearer error messages but I want to emphasize this is by no way a failure of the library. All the Encoder/Decoder classes are generic and can be uses with multiples different encoders/decoders. Therefore, they do not have a default tokenizer/feature extractor class associated to them and it's up to the users to set those in the config.json (or tokenizer_config.json/preprocessor_config.json) so the Auto classes then work properly.

Thus added some tweaks.

Comment on lines 676 to 677
# Feature extraction is very special, it can't be statically known
# if it needs feature_extractor/tokenizer or not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that comment is valid for the two tests, so should go above.

Comment on lines 671 to 674
raise EnvironmentError(
f"There is a problem in `transformers`. The task {task} requires a tokenizer, however the model"
f" {type(model_config)} seems to not support tokenizer. This is likely a misconfiguration in the library,"
" please report this issue."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue seems to stem from the generic classes Encoder/Decoder which are not directly usable with the auto APIs and thus the pipeline. So I would tweak the message to say that the user needs to manually build their tokenizer and pass it to the pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay ! I knew about the speech ones, not the vision ones. So it's the same.

Comment on lines 685 to 687
f"There is a problem in `transformers`. The task {task} requires a feature extractor, however the model"
f" {type(model_config)} seems to not support feature-extractors. This is likely a misconfiguration in the"
" library, please report this issue."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines 175 to 189
@slow
@require_torch
def test_large_model_misconfigured(self):
# XXX this should be a fast test, but the triggering arch
# VisionTextDualEncoderModel is missing for small tests
# https://huggingface.co/hf-internal-testing
# This test will also start to fail, once this architecture
# correctly defines AutoFeatureExtractor. At this point
# we can safely remove this test as we don't really want
# to keep around an invalid model around just for this.
with self.assertRaises(EnvironmentError):
pipeline(
task="zero-shot-image-classification",
model="Bingsu/vitB32_bert_ko_small_clip",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't include this test as it is. It's easy to build a small model in hf-internal-testing that is misconfigured and does not provide a tokenizer class while using a generic encoder/decoder arch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

@Narsil
Copy link
Contributor Author

Narsil commented Aug 2, 2022

@sgugger

It seems I had completely misunderstood what was going on there. I thought it was misconfiguration while it's more of a normal state of things (Wasn't aware we had added those generic models for vision too).

My new proposed PR then actually fixes the underlying issue initially created #17929 .

The way I did it is keep some manual bookkeeping for these "multi model" configurations (is the name right) ?

Then if we are actually using one of these models, attempt to load the Tokenizer/feature_Extractor looking exclusively at whether or not the task requires one.
This should fix the original issue, and it so happens we had a test that could easily be updated to support those use cases.

What do you think of this approach ?

If a user created a model and forgot to upload either one of the necessary components, the pipeline will simply fail to load attempting to load one of them. I think that sort of failure mode should be OK to understand and users should be able to recover on their own. So no need for error messages now.

I am still keeping the regular way to detect if we need the tokenizer for other types of configs, but then we will still fail if the AutoTokenizer/FeatureExtractor is not correctly configured. I think maybe switching entirely to NO_TOKENIZER_TASKS detection seems easier in the long run but I didn't want to do such a change in a small PR. (feature-extraction will still be a corner case)

@Narsil Narsil changed the title [Pipeline] Improving error message when transformers model is not correctly configured Fixing issue where generic model types wouldn't load properly with the pipeline Aug 2, 2022
@Narsil Narsil requested a review from sgugger August 4, 2022 12:10
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on your PR. LGTM with just the problem of importing from specific models in the pipeline file. I'd prefer to avoid it so the library stay as compartmentalized as possible. Made some suggestions of alternatives.

@@ -25,6 +25,8 @@

from numpy import isin

from transformers import SpeechEncoderDecoderConfig, VisionTextDualEncoderConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add new dependencies between pipeline and specific models if we can avoid it. Here we can detect the proper model types I believe.

If those are kept, they should be relative imports like the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfectly understandable. It's done.

src/transformers/pipelines/__init__.py Outdated Show resolved Hide resolved
src/transformers/pipelines/__init__.py Outdated Show resolved Hide resolved
@Narsil Narsil merged commit 586dcf6 into huggingface:main Aug 5, 2022
@Narsil Narsil deleted the improve_error_message_when_transformers_is_misconfigured branch August 5, 2022 06:45
@sgugger sgugger mentioned this pull request Aug 5, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
…e pipeline (huggingface#18392)

* Adding a better error message when the model is improperly configured

within transformers.

* Update src/transformers/pipelines/__init__.py

* Black version.

* Overriding task aliases so that tokenizer+feature_extractor

values are correct.

* Fixing task aliases by overriding their names early

* X.

* Fixing feature-extraction.

* black again.

* Normalizing `translation` too.

* Fixing last few corner cases.

translation need to use its non normalized name (translation_XX_to_YY,
so that the task_specific_params are correctly overloaded).
This can be removed and cleaned up in a later PR.

`speech-encode-decoder` actually REQUIRES to pass a `tokenizer` manually
so the error needs to be discarded when the `tokenizer` is already
there.

* doc-builder fix.

* Fixing the real issue.

* Removing dead code.

* Do not import the actual config classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants