-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate #36741 and map Causal to Conditional #36917
Deprecate #36741 and map Causal to Conditional #36917
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR of our internal thread:
- we need a AutoForAny, as today
AutoForCausalLM
is a dumpster used to map anything - we cannot break, so for now it will remain this way
- we also want text only parts to be loadable only, similarly you would want to load only Image and Text -> ImageTextToText.
Let's patch this!
What does this PR do?
Fixes #36886, fixes #36926 and loading in SmolAgents (from feedback in internal slack)
NOTE: this is a temp fix. In long term we will converge under one Auto for all multimodals, which we don't have yet. As such we will keep
CausalLM
as temporal dump for custom code users. 🔴 We will break this pattern in the near future and might enforce text-only models under this mappingAfter #36741, we unintentionally broke model loading for most remote code users, because many Vision/Audio/Omni LLMs on the hub use
CausalLM
mapping withAutoTokenizer
. This happens becauseAutoImageTextToText
is less visible, and also because we have no mapping for other modalities.This PR deprecates the previous fix and properly maps Gemma3 4B+ models to its
ConditionalGeneration
class which aligns with info in model card. As discussed internally, all vision-audio-multimodal models will converge underAutoModelForCausalLM
in the future to maintain consistency and stop adding new mapping for each new modality.Why? All Audio LMs are already in causal mapping due to the lack of
AutoAudioTextToText
orAutoAudioToText
mappings. Vision LMs have also been inconsistently mapped under CausalLMConsequences: No breaking changes for users, including those using remote code. A warning is raised only if a model has both full config and text config mapped in
CausalLM
which is a super rare case. (gemma-3 was exception)Edge cases: Other models like
llava-1.5
cannot load anymore underAutoModelForCausalLM
after this PR, but checkpoint keys won’t match anyway + we never got user issues. This was an existing issue, not introduced by this PR. It might be resolved by a bigger refactor for vLLM after adding base models for all VLMs and correctbase-prefix-keys
I verified that Gemma3 case works as expected, without raising warnings (I just changed mapping class). The only difference from the previous fix is that vision tower is loaded as well, which might affect advanced users who manipulated configs or access model layers manually.