Skip to content

Conversation

@ayushsatyam146
Copy link
Contributor

@ayushsatyam146 ayushsatyam146 commented Oct 5, 2025

Part of #26149

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Ultravox, Voxtral, and Whisper models to use the new merge_by_field_config multimodal interface. The changes involve setting this flag to True in the model classes and removing the flatten_bn utility, as its functionality is now handled by the new data processing pipeline. The code modifications are consistent and appear correct for this refactoring. I did not identify any issues of high or critical severity.

@ayushsatyam146 ayushsatyam146 force-pushed the merge_by_field_config-X branch from a7615f9 to 81fe190 Compare October 5, 2025 17:32
@chatgpt-codex-connector
Copy link

💡 Codex Review

# Audio lens and token_len are already in the correct shape
audio_lens = audio_input["lens"]
audio_token_len = audio_input["token_len"]
embeddings = self._audio_features_to_embeddings(audio_features, audio_lens)
# We should flatten and concatenate embeddings based on token lengths
# For example, with token_len = [4, 2, 3], flattened_embeddings will be
# concat(embeddings[0][:4], embeddings[1][:2], embeddings[2][:3])
# Create a mask of valid indices based on token lengths
max_len = embeddings.shape[1]
indices = torch.arange(max_len, device=embeddings.device).expand(
embeddings.shape[0], -1
)
mask = indices < audio_token_len[:, None]
# Apply mask and flatten
flattened_embeddings = embeddings[mask]
# Return one tensor per input audio
embed_lens = [
token_len_item.sum().item() for token_len_item in audio_input["token_len"]
]
return flattened_embeddings.split(embed_lens)

P1 Badge Splitting Ultravox embeddings per chunk instead of per audio

In _process_audio_input the lens and token lengths are now taken as-is (audio_token_len = audio_input["token_len"]) because the new merge_by_field_config path already flattens the B×N dimensions. However, embed_lens is still built by iterating over audio_input["token_len"] and summing each element. With the flattened inputs this loop now produces one entry per chunk rather than one entry per audio item, so flattened_embeddings.split(embed_lens) returns a list of tensors per chunk while _get_prompt_updates still emits only one PromptReplacement per audio based on audio_num_chunks. The number and ordering of multimodal embeddings therefore no longer matches the placeholders inserted into the prompt, which will misalign audio embeddings with their token replacements (and can raise when counts differ). Consider rebuilding embed_lens using the original per-audio grouping (e.g. via audio_num_chunks) so the function continues to return one tensor per audio item.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@DarkLight1337
Copy link
Member

Have you tested each model using the example script?

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
@ayushsatyam146 ayushsatyam146 force-pushed the merge_by_field_config-X branch from 81fe190 to 730e6e7 Compare October 6, 2025 05:44
@ayushsatyam146
Copy link
Contributor Author

@DarkLight1337 I have done the required changes. But, I am sorry since I couldn't run the example scripts due to GPU constraints

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Voxtral and Whisper are both failing locally, I have fixed Voxtral and will fix Whisper later

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed Whisper as well, let's merge this

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 7, 2025 03:47
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Oct 7, 2025
@DarkLight1337 DarkLight1337 merged commit 5f7e8a9 into vllm-project:main Oct 7, 2025
55 checks passed
southfreebird pushed a commit to southfreebird/vllm that referenced this pull request Oct 7, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
mrasquinha-g pushed a commit to mrasquinha-g/vllm that referenced this pull request Oct 9, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…#26261)

Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants