[Bugfix] model_max_length should consider max_model_len in tokenizer_config #19201

noooop · 2025-06-05T09:57:58Z

model_max_length should consider max_model_len in tokenizer_config

e.g.
BAAI/bge-reranker-base
config.json "max_position_embeddings": 514,
tokenizer_config.json "model_max_length": 512,

when XLMRobertaForSequenceClassification is used as cross-encoding

Because RobertaEmbedding uses padding_idx, the position_ids for an input maximum length of 514 will become 515, exceeding max_position_embeddings.

from vllm import LLM

model = LLM(model="BAAI/bge-reranker-base", task="score")

text_1 = "What is the capital of France?"
text_2 = "The capital of Brazil is Brasilia." * 1000

outputs = model.score(text_1, text_2, truncate_prompt_tokens=-1)


'''
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
'''

gemini-code-assist

Hello @noooop, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

As gemini-code-assist, I've reviewed this pull request. The core purpose of this PR is to address a specific bug encountered when using XLM-RoBERTa based models (like BAAI/bge-reranker-base) within the cross_encoding task, particularly when handling long inputs and using the truncate_prompt_tokens parameter. The issue manifests as a RuntimeError: CUDA error: device-side assert triggered, which the author's description and code example demonstrate. This error occurs because, for models like XLM-R that use a padding_idx, the calculated position_ids can exceed the model's max_position_embeddings when the input length is close to the maximum, leading to the device-side assertion failure. The proposed fix involves adjusting the maximum length used in the truncation validation logic to account for the padding index, thereby preventing the position ID overflow.

Highlights

Bugfix for XLM-R Truncation: This pull request fixes a CUDA error (device-side assert triggered) that occurs specifically with XLM-RoBERTa based models when performing cross-encoding with truncation, especially near the maximum model length.
Addresses Position ID Overflow: The root cause of the bug is identified as position IDs exceeding the model's maximum allowed value due to the handling of padding tokens in models that utilize a padding_idx.
Adjusts Truncation Validation Logic: The fix modifies the internal _validate_truncation_size function to subtract 2 * padding_idx from the model's maximum length when performing the validation, creating a safer boundary for truncation with affected models.

Changelog

vllm/entrypoints/llm.py
- Added a comment (# fix XLMRobertaForSequenceClassification) to indicate the purpose of the change (line 1146).
- Retrieved the pad_token_id from the model's Hugging Face configuration, defaulting to 0 if not found (lines 1147-1148).
- Modified the call to _validate_truncation_size (lines 1151-1153) to use self.llm_engine.model_config.max_model_len - 2 * padding_idx as the maximum allowed length for truncation validation, instead of just max_model_len.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Padding shifts the count,
Position IDs overflow,
Subtract two times pad.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix a RuntimeError: CUDA error: device-side assert triggered that occurs with XLM-RoBERTa models used for sequence classification in cross-encoding tasks, specifically when long prompts are truncated. The error is attributed to position_ids exceeding max_position_embeddings due to the way RobertaEmbedding handles padding_idx.

The proposed fix adjusts the max_model_len passed to _validate_truncation_size by subtracting 2 * padding_idx. While this approach might prevent the crash, there are a few points to consider for ensuring correctness and optimality, especially regarding the source of padding_idx and the factor of 2.

Thank you for tackling this tricky issue!

Summary of Findings

Correctness of padding_idx adjustment: The subtraction of 2 * padding_idx might be overly conservative. Standard derivation suggests max_model_len - padding_idx would be the correct truncation length to prevent position ID overflows. Additionally, the example model BAAI/bge-reranker-base has pad_token_id: 0 in its config, which would make the current fix a no-op for it. This needs clarification.

Merge Readiness

The PR addresses an important runtime error. However, the exact mechanism and the numerical adjustment (- 2 * padding_idx) warrant further discussion to ensure correctness and avoid over-truncation. The discrepancy with the example model (BAAI/bge-reranker-base) also needs to be clarified. I recommend addressing these points before merging. I am unable to approve pull requests, so please ensure other reviewers take a look and approve before merging.

vllm/entrypoints/llm.py

noooop · 2025-06-05T10:16:49Z

cc @maxdebayser

github-actions · 2025-06-05T10:19:35Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

noooop · 2025-06-06T06:22:49Z

cc @DarkLight1337

vllm/transformers_utils/config.py

yselivonchyk · 2025-07-15T15:56:20Z

This patch unfortunately doesn't seem to work with verl due to

line 1521, in execute_model assert end_idx <= self.max_model_len, ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Sampled token IDs exceed the max model length. Total number of
tokens: 1025 > max_model_len: 1024

noooop · 2025-07-16T01:07:50Z

This patch unfortunately doesn't seem to work with verl due to

line 1521, in execute_model assert end_idx <= self.max_model_len, ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Sampled token IDs exceed the max model length. Total number of
tokens: 1025 > max_model_len: 1024

Can #20322 solve this problem?

gemini-code-assist bot reviewed Jun 5, 2025

View reviewed changes

mergify bot added the frontend label Jun 5, 2025

gemini-code-assist bot suggested changes Jun 5, 2025

View reviewed changes

vllm/entrypoints/llm.py Outdated Show resolved Hide resolved

fix cross_encoding

fcf09f2

noooop force-pushed the fix_truncate branch from 289c5f3 to fcf09f2 Compare June 5, 2025 10:12

noooop changed the title ~~[Bugfix] fix XLMRobertaForSequenceClassification cross_encoding truncate_prompt_tokens~~ [Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding using truncate_prompt_tokens Jun 5, 2025

fix

e3fab63

noooop changed the title ~~[Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding using truncate_prompt_tokens~~ [Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding max_model_len Jun 5, 2025

noooop added 2 commits June 6, 2025 13:55

model_max_length in tokenizer_config

7ad41a8

+ try_get_tokenizer_config

bd2ed3c

noooop changed the title ~~[Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding max_model_len~~ [Bugfix] model_max_length should consider max_model_len in tokenizer_config Jun 6, 2025

DarkLight1337 reviewed Jun 6, 2025

View reviewed changes

vllm/transformers_utils/config.py Outdated Show resolved Hide resolved

fix

f83380d

DarkLight1337 approved these changes Jun 6, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) June 6, 2025 09:34

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 6, 2025

noooop added 2 commits June 6, 2025 17:51

Merge branch 'vllm-project:main' into fix_truncate

7adb9c4

Merge branch 'vllm-project:main' into fix_truncate

e128df0

vllm-bot merged commit 2ffb9b6 into vllm-project:main Jun 8, 2025
63 of 64 checks passed

yeqcharlotte mentioned this pull request Jun 15, 2025

[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config #19660

Merged

huaqiangwang mentioned this pull request Jul 1, 2025

[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models #20322

Merged

4 tasks

noooop deleted the fix_truncate branch July 10, 2025 04:46

noooop mentioned this pull request Sep 29, 2025

[Bug]: vLLM recognize the bge-m3-korean (embedding model) max length, 512 tokens. #25865

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] model_max_length should consider max_model_len in tokenizer_config #19201

[Bugfix] model_max_length should consider max_model_len in tokenizer_config #19201

Uh oh!

noooop commented Jun 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

noooop commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

noooop commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

yselivonchyk commented Jul 15, 2025

Uh oh!

noooop commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Bugfix] model_max_length should consider max_model_len in tokenizer_config #19201

[Bugfix] model_max_length should consider max_model_len in tokenizer_config #19201

Uh oh!

Conversation

noooop commented Jun 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

noooop commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

noooop commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

yselivonchyk commented Jul 15, 2025

Uh oh!

noooop commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

noooop commented Jun 5, 2025 •

edited by github-actions bot

Loading