- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
Closed
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is neededmulti-modalityRelated to multi-modality (#4194)Related to multi-modality (#4194)v1
Description
In V1, we expect the output of get_multimodal_embedding to correspond to the PlaceholderRange, which is in turn constructed based on PromptUpdateDetails.features. However, the current V1 code doesn't validate this, causing the model to crash during inference when under high load (e.g. #14897, #14963).
From a quick look at the code, these models output embedding sizes which are inconsistent with the placeholder range:
-  Fuyu (fixed by [Bugfix] Added 
embed_is_patchmask for fuyu model #15731) - Gemma3 (fixed by [Bugfix] Re-enable Gemma3 for V1 #14980)
 -  Idefics3 (fixed by [Bugfix] 
embed_is_patchfor Idefics3 #15696) - InternVL-based models (fixed by [Bugfix] Fix embedding assignment for InternVL-based models #15086)
 - MiniCPM-V (fixed by [Model] MiniCPM-V/O supports V1 #15487)
 
(Basically, any model that has image newline/column tokens after applying HF processor needs a mask to map image patch features to image embeddings, as described below.)
To fix this, we can follow these steps:
- Update the multi-modal processor to output a mask to indicate which positions in the 
PlaceholderRange-aligned embeddings should the patch features (outputted by vision encoder) be assigned to. This mask can be calledembed_is_patch. - Use 
scatter_patch_featuresto scatter the patch features into the image embedding tensor. - When merging multimodal embeddings, use 
select_patch_featuresto recover the patch features from the image embeddings. The number of patch features should correspond to the number of image tokens (which is a subset of the feature tokens inPromptUpdateDetails). 
Follow-up work:
- [V1] Scatter and gather placeholders in the model runner #15712 (assigned to @DarkLight1337)
 - Directly use individual token IDs instead of range of IDs (assigned to @ywang96 )
 
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is neededmulti-modalityRelated to multi-modality (#4194)Related to multi-modality (#4194)v1
Type
Projects
Status
Done