Skip to content

Conversation

@pansicheng
Copy link
Contributor

@pansicheng pansicheng commented Mar 22, 2025

FIX #14677 (link existing issues this PR will resolve)

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

This breaks the processing correctness tests in test_common.py, PTAL.

@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Mar 23, 2025
@pansicheng
Copy link
Contributor Author

This breaks the processing correctness tests in test_common.py, PTAL.

Thank you for pointing this out.

After investigating, I found the issue primarily stems from differences in input preprocessing for the phi3v model. Referencing: Phi-3.5-vision-instruct's processing_phi3_v.py#L407, here's the breakdown:

No images: Phi3VProcessor tokenizes the prompt directly:

if not len(images):
    model_inputs = self.tokenizer(texts, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length)

Images present: The prompt is split using r"<|image_\d+|>" regex before tokenizing each chunk:

pattern = r"<\|image_\d+\|>"  
prompt_chunks = [self.tokenizer(chunk).input_ids for chunk in re.split(pattern, texts)]  

I've now aligned the implementation in phi3v.py and adjusted the test cases to follow the logic defined in the model's official processing_phi3_v.py.

Please let me know if there's anything further I can clarify!

@pansicheng pansicheng changed the title fix tests/models/embedding/vision_language/test_phi3v.py fix test_phi3v Mar 23, 2025
@DarkLight1337
Copy link
Member

Could you explain what is the problem with the existing processor? We should rely on _get_prompt_updates as much as possible to detect and replace the image placeholders.

Comment on lines 157 to 171
Copy link
Member

@DarkLight1337 DarkLight1337 Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call tokenizer directly to tokenize the prompt in online inference, so we cannot rely on special cases like this.

@pansicheng
Copy link
Contributor Author

pansicheng commented Mar 25, 2025

Could you explain what is the problem with the existing processor? We should rely on _get_prompt_updates as much as possible to detect and replace the image placeholders.

Sorry for the delayed reply. Thank you for your patience. Here is what I have observed:

HF Processor:

  • Prompt Splitting: HF splits prompts using the regex r"<|image_\d+|>", separating text-only chunks with <|image_\d|> placeholders. For example, the prompt <|image_1|> Text is split into ["", " Text"].
  • Tokenization Flow: Each text segment (e.g., "" and " Text") is tokenized separately, with the resulting token IDs interleaved with image_ids_pad.

VLLM Processor:

  • Monolithic Tokenization: VLLM uses apply_hf_processor_text_only to tokenize the entire prompt as one string (without splitting on <|image\d|>). After tokenization, placeholder tokens are replaced with image_ids_pad.

I am attempting to align VLLM’s workflow with HF’s prompt splitting approach in both phi3v.py and test_common.py.

The current modification mistakenly splits prompts using _apply_hf_processor_text_only regardless of whether they are truly text-only. This discrepancy from HF's behavior necessitates introducing a parameter to differentiate between real text-only prompts and prompts meant to separate images.

Regarding the test scenario using processor.apply(token_prompt, mm_data=mm_data, hf_processor_mm_kwargs={}), it might not be fully compatible with phi3v due to HF’s prompt splitting. I would sincerely appreciate any guidance or suggestions for this context.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 26, 2025

How does this lead to different results? In BaseMultiModalProcessor._apply_prompt_updates, we apply prompt replacements by converting the tokens back to text, applying those replacements based on text, and finally tokenizing the result back into tokens.

Perhaps it would be best if you show a full example.

@mergify
Copy link

mergify bot commented Mar 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pansicheng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 26, 2025
@pansicheng
Copy link
Contributor Author

How does this lead to different results? In BaseMultiModalProcessor._apply_prompt_updates, we apply prompt replacements by converting the tokens back to text, applying those replacements based on text, and finally tokenizing the result back into tokens.

Perhaps it would be best if you show a full example.

here is an example

prompt="<|image_1|> Select the portion of the image that isolates the object of the given label: The label of the object is stop sign"
<|image_1|>: [529, 29989, 3027, 29918, 29896, 29989, 29958]
vllm:   [1, 529, 29989, 3027, 29918, 29896, 29989, 29958, 7605, 278, 11910, 310, 278, 1967, 393, 11695, 1078, 278, 1203, 310, 278, 2183, 3858, 29901, 450, 3858, 310, 278, 1203, 338, 5040, 1804]
         "  <|image_1|>                                   Select the portion ..."
            |<- image_ids_pad                         ->|
hf:     [[[1],       [1, 29871, 7605, 278, 11910, 310, 278, 1967, 393, 11695, 1078, 278, 1203, 310, 278, 2183, 3858, 29901, 450, 3858, 310, 278, 1203, 338, 5040, 1804]]]
          ""     ^   " Select the portion ..."
                 |
            image_ids_pad

get_replacement_phi3v will add an "1" after [529, 29989, 3027, 29918, 29896, 29989, 29958],
the difference is the "29871" introduced by tokenizing " Select the portion ..."

@DarkLight1337
Copy link
Member

Thanks for the example. Isn't that why the original code had bos_token_id in get_replacement_phi3v? So I guess the discrepancy is coming from elsewhere.

@pansicheng
Copy link
Contributor Author

Thanks for the example. Isn't that why the original code had bos_token_id in get_replacement_phi3v? So I guess the discrepancy is coming from elsewhere.

Yes, I also believe that's the reason for adding bos_token_id in get_replacement_phi3v, but this doesn't handle all situations, such as the two test cases below:

"<|image_1|> Select the portion of the image ..."
" Select the portion of the image ..."

processor.tokenizer("<|image_1|> Select the portion of the image that isolates the object of the given label: The label of the object is stop sign")
[1, 529, 29989, 3027, 29918, 29896, 29989, 29958,        7605, 278, 11910, 310, 278, 1967, 393, 11695, 1078, 278, 1203, 310, 278, 2183, 3858, 29901, 450, 3858, 310, 278, 1203, 338, 5040, 1804]

processor.tokenizer.encode(
  " Select the portion of the image that isolates the object of the given label: The label of the object is stop sign"
)
[1,                                               29871, 7605, 278, 11910, 310, 278, 1967, 393, 11695, 1078, 278, 1203, 310, 278, 2183, 3858, 29901, 450, 3858, 310, 278, 1203, 338, 5040, 1804]

The fundamental reason lies in that the phi3v tokenizer cannot ensure that the same substring, when appearing in different strings, is tokenized into the same token id sequence.
Therefore, my current attempt is to modify the processing logic of vllm to align with the processing logic of the model files.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 26, 2025

I suggest we try to edit the tokens in the end so that the overall result is the same. Otherwise, there will be too many cases (text input, token input, text input with cache, token input with cache; where the token input is from online serving and created by directly applying tokenizer to the text) which makes the code a mess if we try to handle them separately.

@pansicheng
Copy link
Contributor Author

I suggest we try to edit the tokens in the end so that the overall result is the same. Otherwise, there will be too many cases (text input, token input, text input with cache, token input with cache; where the token input is from online serving and created by directly applying tokenizer to the text) which makes the code a mess if we try to handle them separately.

I've limited the modifications to _apply_prompt_updates, please take a look

@DarkLight1337
Copy link
Member

Multi-modal tests pass which is great! Can you update the entrypoints tests with respect to the updated token count?

Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
@pansicheng
Copy link
Contributor Author

Multi-modal tests pass which is great! Can you update the entrypoints tests with respect to the updated token count?

It seems that the Multi-modal tests and the entrypoints tests are complete, could you please assist with the readthedocs build?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with me, the PR looks good now. I'll just force-merge the PR since the relevant tests have passed.

@vllm-bot vllm-bot merged commit 7fd8c0f into vllm-project:main Mar 30, 2025
13 of 14 checks passed
@pansicheng pansicheng deleted the fix/14677 branch March 30, 2025 09:20
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unit test tests/models/embedding/vision_language/test_phi3v.py failing

3 participants