- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[TPU][V1] Capture multimodal encoder during model compilation #15051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPU][V1] Capture multimodal encoder during model compilation #15051
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
| PS I want to add tests to this PR, just having some issues getting a test with  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche thanks for doing this! In general, LGTM, left some questions/comments.
        
          
                vllm/v1/worker/tpu_model_runner.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
position_ids does not require this zeroing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great observation thanks!
This is definitely happening, I am just unsure if the fact that it's being added to a padding token will make its positioning not meaningful @robertgshaw2-redhat
Values at runtime
position_ids = [512, 513, 514, 515, . . . , 594, 595, padding starts=>468, 469, 470, ... , 511]
4c3ff9d    to
    50e76ad      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche do you want to revive and test this now that the Pallas MHA backend has landed? It should be sufficient to show a benchmark comparing llava with a single image before and after this PR
| This pull request has merge conflicts that must be resolved before it can be | 
| @mgoin Totally I had this lined up right after main branch is stable. Main issue is work around the two causes of recompilations I listed above, now that we don't allow any in the CI anymore. | 
50e76ad    to
    285f084      
    Compare
  
    | This pull request has merge conflicts that must be resolved before it can be | 
e5e6cd5    to
    c85c6f9      
    Compare
  
    037d7d4    to
    5ed00e5      
    Compare
  
    | There's still some recompilation on smaller graphs to address, but this PR should now be ready (thanks @mgoin and @DarkLight1337 for the work on mm scatter-gather and scheduler). Test with  | 
16eefea    to
    c4f4280      
    Compare
  
    | 
 | 
| Hi @NickLucche, I debugged the recompilation issue on my end and here are the findings and the fix. You can check this commit which includes the fix. 
 I think we have the recompilation issue, and it's not straightforward to debug, because the code path between precompile and execution are not the same. I think we can abstract out the function of device operations, and then call the same function during precompile and  To debug this, I added more recompilation check code throughout the  | 
| Thanks for the fix and reporting the debugging method too! | 
47845e0    to
    a442a8b      
    Compare
  
    Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
remove debug print Signed-off-by: NickLucche <nlucches@redhat.com>
a442a8b    to
    4c13715      
    Compare
  
    Signed-off-by: NickLucche <nlucches@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…roject#15051) Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>
…roject#15051) Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com>
…roject#15051) Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com>
…roject#15051) Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
…roject#15051) Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Carrying on @mgoin work here #14254.
This PR pre-compiles the multimodal encoder for various mm items batch sizes. We're mostly concerned with LLava-style vision language models (fixed input resolution), but pre-compiling code is already considering other modalities (audio etc..). We're not addressing dynamic image sizes models just yet (Pixtral etc..).
Server:
Client:
Benchmark:
Update: The issues listed below have all been addressed by separate PRs. In particular, we work around issue 1 by only processing whole mm_items, as the
gather_mm_placeholdersoperation is intrinsically dynamic.I do want to discuss a few issues we have with code manipulating on device tensors that will potentially cause re-compilation (here as well as on main) related to this PR :
in
_gather_encoder_outputs:vllm/vllm/v1/worker/tpu_model_runner.py
Lines 530 to 537 in 46c759c
Last line will slice an on-device tensor with varying shape. Padding here is non-obvious because image features have to be aligned with image placeholders in input_ids.
vllm/vllm/model_executor/models/llava.py
Lines 773 to 778 in 46c759c
This dynamically filters out the on-device patches. The code is inside the model definition and is duplicated across different models.
Finally, this is not a real re-compilation issue but just wanted to point out the CLIP-based model I've tested will have it's graph broken at
vllm/vllm/attention/layer.py
Lines 313 to 316 in 46c759c
as XLA debug logs will reveal an access to an on device tensor forcing compilation to start.