-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Bugfix] model_max_length should consider max_model_len in tokenizer_config #19201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @noooop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
As gemini-code-assist, I've reviewed this pull request. The core purpose of this PR is to address a specific bug encountered when using XLM-RoBERTa based models (like BAAI/bge-reranker-base
) within the cross_encoding
task, particularly when handling long inputs and using the truncate_prompt_tokens
parameter. The issue manifests as a RuntimeError: CUDA error: device-side assert triggered
, which the author's description and code example demonstrate. This error occurs because, for models like XLM-R that use a padding_idx
, the calculated position_ids
can exceed the model's max_position_embeddings
when the input length is close to the maximum, leading to the device-side assertion failure. The proposed fix involves adjusting the maximum length used in the truncation validation logic to account for the padding index, thereby preventing the position ID overflow.
Highlights
- Bugfix for XLM-R Truncation: This pull request fixes a CUDA error (
device-side assert triggered
) that occurs specifically with XLM-RoBERTa based models when performing cross-encoding with truncation, especially near the maximum model length. - Addresses Position ID Overflow: The root cause of the bug is identified as position IDs exceeding the model's maximum allowed value due to the handling of padding tokens in models that utilize a
padding_idx
. - Adjusts Truncation Validation Logic: The fix modifies the internal
_validate_truncation_size
function to subtract2 * padding_idx
from the model's maximum length when performing the validation, creating a safer boundary for truncation with affected models.
Changelog
- vllm/entrypoints/llm.py
- Added a comment (
# fix XLMRobertaForSequenceClassification
) to indicate the purpose of the change (line 1146). - Retrieved the
pad_token_id
from the model's Hugging Face configuration, defaulting to 0 if not found (lines 1147-1148). - Modified the call to
_validate_truncation_size
(lines 1151-1153) to useself.llm_engine.model_config.max_model_len - 2 * padding_idx
as the maximum allowed length for truncation validation, instead of justmax_model_len
.
- Added a comment (
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Padding shifts the count,
Position IDs overflow,
Subtract two times pad.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a RuntimeError: CUDA error: device-side assert triggered
that occurs with XLM-RoBERTa models used for sequence classification in cross-encoding tasks, specifically when long prompts are truncated. The error is attributed to position_ids
exceeding max_position_embeddings
due to the way RobertaEmbedding
handles padding_idx
.
The proposed fix adjusts the max_model_len
passed to _validate_truncation_size
by subtracting 2 * padding_idx
. While this approach might prevent the crash, there are a few points to consider for ensuring correctness and optimality, especially regarding the source of padding_idx
and the factor of 2
.
Thank you for tackling this tricky issue!
Summary of Findings
- Correctness of
padding_idx
adjustment: The subtraction of2 * padding_idx
might be overly conservative. Standard derivation suggestsmax_model_len - padding_idx
would be the correct truncation length to prevent position ID overflows. Additionally, the example modelBAAI/bge-reranker-base
haspad_token_id: 0
in its config, which would make the current fix a no-op for it. This needs clarification.
Merge Readiness
The PR addresses an important runtime error. However, the exact mechanism and the numerical adjustment (- 2 * padding_idx
) warrant further discussion to ensure correctness and avoid over-truncation. The discrepancy with the example model (BAAI/bge-reranker-base
) also needs to be clarified. I recommend addressing these points before merging. I am unable to approve pull requests, so please ensure other reviewers take a look and approve before merging.
cc @maxdebayser |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This patch unfortunately doesn't seem to work with
|
Can #20322 solve this problem? |
model_max_length should consider max_model_len in tokenizer_config
e.g.
BAAI/bge-reranker-base
config.json "max_position_embeddings": 514,
tokenizer_config.json "model_max_length": 512,
when XLMRobertaForSequenceClassification is used as cross-encoding
Because RobertaEmbedding uses padding_idx, the position_ids for an input maximum length of 514 will become 515, exceeding max_position_embeddings.