Skip to content

Conversation

noooop
Copy link
Collaborator

@noooop noooop commented Jun 5, 2025

model_max_length should consider max_model_len in tokenizer_config

e.g.
BAAI/bge-reranker-base
config.json "max_position_embeddings": 514,
tokenizer_config.json "model_max_length": 512,

when XLMRobertaForSequenceClassification is used as cross-encoding

Because RobertaEmbedding uses padding_idx, the position_ids for an input maximum length of 514 will become 515, exceeding max_position_embeddings.

from vllm import LLM

model = LLM(model="BAAI/bge-reranker-base", task="score")

text_1 = "What is the capital of France?"
text_2 = "The capital of Brazil is Brasilia." * 1000

outputs = model.score(text_1, text_2, truncate_prompt_tokens=-1)


'''
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
'''

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @noooop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

As gemini-code-assist, I've reviewed this pull request. The core purpose of this PR is to address a specific bug encountered when using XLM-RoBERTa based models (like BAAI/bge-reranker-base) within the cross_encoding task, particularly when handling long inputs and using the truncate_prompt_tokens parameter. The issue manifests as a RuntimeError: CUDA error: device-side assert triggered, which the author's description and code example demonstrate. This error occurs because, for models like XLM-R that use a padding_idx, the calculated position_ids can exceed the model's max_position_embeddings when the input length is close to the maximum, leading to the device-side assertion failure. The proposed fix involves adjusting the maximum length used in the truncation validation logic to account for the padding index, thereby preventing the position ID overflow.

Highlights

  • Bugfix for XLM-R Truncation: This pull request fixes a CUDA error (device-side assert triggered) that occurs specifically with XLM-RoBERTa based models when performing cross-encoding with truncation, especially near the maximum model length.
  • Addresses Position ID Overflow: The root cause of the bug is identified as position IDs exceeding the model's maximum allowed value due to the handling of padding tokens in models that utilize a padding_idx.
  • Adjusts Truncation Validation Logic: The fix modifies the internal _validate_truncation_size function to subtract 2 * padding_idx from the model's maximum length when performing the validation, creating a safer boundary for truncation with affected models.

Changelog

  • vllm/entrypoints/llm.py
    • Added a comment (# fix XLMRobertaForSequenceClassification) to indicate the purpose of the change (line 1146).
    • Retrieved the pad_token_id from the model's Hugging Face configuration, defaulting to 0 if not found (lines 1147-1148).
    • Modified the call to _validate_truncation_size (lines 1151-1153) to use self.llm_engine.model_config.max_model_len - 2 * padding_idx as the maximum allowed length for truncation validation, instead of just max_model_len.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Padding shifts the count,
Position IDs overflow,
Subtract two times pad.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the frontend label Jun 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a RuntimeError: CUDA error: device-side assert triggered that occurs with XLM-RoBERTa models used for sequence classification in cross-encoding tasks, specifically when long prompts are truncated. The error is attributed to position_ids exceeding max_position_embeddings due to the way RobertaEmbedding handles padding_idx.

The proposed fix adjusts the max_model_len passed to _validate_truncation_size by subtracting 2 * padding_idx. While this approach might prevent the crash, there are a few points to consider for ensuring correctness and optimality, especially regarding the source of padding_idx and the factor of 2.

Thank you for tackling this tricky issue!

Summary of Findings

  • Correctness of padding_idx adjustment: The subtraction of 2 * padding_idx might be overly conservative. Standard derivation suggests max_model_len - padding_idx would be the correct truncation length to prevent position ID overflows. Additionally, the example model BAAI/bge-reranker-base has pad_token_id: 0 in its config, which would make the current fix a no-op for it. This needs clarification.

Merge Readiness

The PR addresses an important runtime error. However, the exact mechanism and the numerical adjustment (- 2 * padding_idx) warrant further discussion to ensure correctness and avoid over-truncation. The discrepancy with the example model (BAAI/bge-reranker-base) also needs to be clarified. I recommend addressing these points before merging. I am unable to approve pull requests, so please ensure other reviewers take a look and approve before merging.

@noooop noooop changed the title [Bugfix] fix XLMRobertaForSequenceClassification cross_encoding truncate_prompt_tokens [Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding using truncate_prompt_tokens Jun 5, 2025
@noooop
Copy link
Collaborator Author

noooop commented Jun 5, 2025

cc @maxdebayser

Copy link

github-actions bot commented Jun 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@noooop noooop changed the title [Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding using truncate_prompt_tokens [Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding max_model_len Jun 5, 2025
@noooop noooop changed the title [Bugfix] fix XLMRobertaForSequenceClassification for cross_encoding max_model_len [Bugfix] model_max_length should consider max_model_len in tokenizer_config Jun 6, 2025
@noooop
Copy link
Collaborator Author

noooop commented Jun 6, 2025

cc @DarkLight1337

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 6, 2025 09:34
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 6, 2025
@vllm-bot vllm-bot merged commit 2ffb9b6 into vllm-project:main Jun 8, 2025
63 of 64 checks passed
@noooop noooop deleted the fix_truncate branch July 10, 2025 04:46
@yselivonchyk
Copy link

This patch unfortunately doesn't seem to work with verl due to

line 1521, in execute_model assert end_idx <= self.max_model_len, ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Sampled token IDs exceed the max model length. Total number of
tokens: 1025 > max_model_len: 1024

@noooop
Copy link
Collaborator Author

noooop commented Jul 16, 2025

This patch unfortunately doesn't seem to work with verl due to

line 1521, in execute_model assert end_idx <= self.max_model_len, ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Sampled token IDs exceed the max model length. Total number of
tokens: 1025 > max_model_len: 1024

Can #20322 solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants