-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing issue where generic model types wouldn't load properly with the pipeline #18392
Fixing issue where generic model types wouldn't load properly with the pipeline #18392
Conversation
within transformers.
The documentation is not available anymore as the PR was closed or merged. |
values are correct.
translation need to use its non normalized name (translation_XX_to_YY, so that the task_specific_params are correctly overloaded). This can be removed and cleaned up in a later PR. `speech-encode-decoder` actually REQUIRES to pass a `tokenizer` manually so the error needs to be discarded when the `tokenizer` is already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! I'm all for better and clearer error messages but I want to emphasize this is by no way a failure of the library. All the Encoder/Decoder classes are generic and can be uses with multiples different encoders/decoders. Therefore, they do not have a default tokenizer/feature extractor class associated to them and it's up to the users to set those in the config.json
(or tokenizer_config.json
/preprocessor_config.json
) so the Auto classes then work properly.
Thus added some tweaks.
# Feature extraction is very special, it can't be statically known | ||
# if it needs feature_extractor/tokenizer or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that comment is valid for the two tests, so should go above.
raise EnvironmentError( | ||
f"There is a problem in `transformers`. The task {task} requires a tokenizer, however the model" | ||
f" {type(model_config)} seems to not support tokenizer. This is likely a misconfiguration in the library," | ||
" please report this issue." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue seems to stem from the generic classes Encoder/Decoder which are not directly usable with the auto APIs and thus the pipeline. So I would tweak the message to say that the user needs to manually build their tokenizer and pass it to the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay ! I knew about the speech ones, not the vision ones. So it's the same.
f"There is a problem in `transformers`. The task {task} requires a feature extractor, however the model" | ||
f" {type(model_config)} seems to not support feature-extractors. This is likely a misconfiguration in the" | ||
" library, please report this issue." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
@slow | ||
@require_torch | ||
def test_large_model_misconfigured(self): | ||
# XXX this should be a fast test, but the triggering arch | ||
# VisionTextDualEncoderModel is missing for small tests | ||
# https://huggingface.co/hf-internal-testing | ||
# This test will also start to fail, once this architecture | ||
# correctly defines AutoFeatureExtractor. At this point | ||
# we can safely remove this test as we don't really want | ||
# to keep around an invalid model around just for this. | ||
with self.assertRaises(EnvironmentError): | ||
pipeline( | ||
task="zero-shot-image-classification", | ||
model="Bingsu/vitB32_bert_ko_small_clip", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't include this test as it is. It's easy to build a small model in hf-internal-testing that is misconfigured and does not provide a tokenizer class while using a generic encoder/decoder arch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
It seems I had completely misunderstood what was going on there. I thought it was misconfiguration while it's more of a normal state of things (Wasn't aware we had added those generic models for vision too). My new proposed PR then actually fixes the underlying issue initially created #17929 . The way I did it is keep some manual bookkeeping for these "multi model" configurations (is the name right) ? Then if we are actually using one of these models, attempt to load the What do you think of this approach ? If a user created a model and forgot to upload either one of the necessary components, the pipeline will simply fail to load attempting to load one of them. I think that sort of failure mode should be OK to understand and users should be able to recover on their own. So no need for error messages now. I am still keeping the regular way to detect if we need the tokenizer for other types of configs, but then we will still fail if the AutoTokenizer/FeatureExtractor is not correctly configured. I think maybe switching entirely to |
transformers
model is not correctly configuredThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on your PR. LGTM with just the problem of importing from specific models in the pipeline file. I'd prefer to avoid it so the library stay as compartmentalized as possible. Made some suggestions of alternatives.
@@ -25,6 +25,8 @@ | |||
|
|||
from numpy import isin | |||
|
|||
from transformers import SpeechEncoderDecoderConfig, VisionTextDualEncoderConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add new dependencies between pipeline and specific models if we can avoid it. Here we can detect the proper model types I believe.
If those are kept, they should be relative imports like the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfectly understandable. It's done.
…e pipeline (huggingface#18392) * Adding a better error message when the model is improperly configured within transformers. * Update src/transformers/pipelines/__init__.py * Black version. * Overriding task aliases so that tokenizer+feature_extractor values are correct. * Fixing task aliases by overriding their names early * X. * Fixing feature-extraction. * black again. * Normalizing `translation` too. * Fixing last few corner cases. translation need to use its non normalized name (translation_XX_to_YY, so that the task_specific_params are correctly overloaded). This can be removed and cleaned up in a later PR. `speech-encode-decoder` actually REQUIRES to pass a `tokenizer` manually so the error needs to be discarded when the `tokenizer` is already there. * doc-builder fix. * Fixing the real issue. * Removing dead code. * Do not import the actual config classes.
What does this PR do?
When this occurs #17929
we can provide a better error message since this is detectable at load time
and the fix should happen within
transformers
.Found out 3 odd cases which have been dealt with differently:
translation
actually usestranslation_XX_to_YY
and also relies ontask_specific_params
for some model configs.I tried cleaning that up and using
task_specific_params
only once, but the rabbit hole is deep, and it would have meant morecode changes that this PR should hold. Waiting for a subsequent PR.
The issue is that
translation_XX_to_YY
is not a normalized task name and is not withinNO_TOKENIZER_TASKS
norNO_FEATURE_EXTRACTION_TASKS
so the configuration on wether we should load or not doesn't work.feature-extraction
. That one is extremely special, since ALL models could in theory use that pipeline, and so we cannot enforce or detect anything statically on what should be loaded or not.automatic-speech-recognition
has thisspeech-encoder-decoder
type of model, which do not define anytokenizer
class, so thetype(config)
is NOT withinTOKENIZER_MAPPING
(correctly), but the first version of the check would fail when deciding staticly if we should load the tokenizer or not. The fix was to check if the user passed a tokenizer or not (if tokenizer is passed we should never try to do anything anyway)Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@sugger
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.