Skip to content

Update vLLM to 0.10.0 and add max_model_len#172

Merged
pan-x-c merged 8 commits intoagentscope-ai:mainfrom
hiyuchang:dev/merge_vllm_v0.10.0
Aug 8, 2025
Merged

Update vLLM to 0.10.0 and add max_model_len#172
pan-x-c merged 8 commits intoagentscope-ai:mainfrom
hiyuchang:dev/merge_vllm_v0.10.0

Conversation

@hiyuchang
Copy link
Collaborator

Description

  1. Upgrade vllm to 0.10.0
  2. Add max_model_len parameter

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @hiyuchang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've made significant changes to how token limits are managed within the system, primarily by introducing a max_model_len parameter. This new parameter consolidates the concept of maximum sequence length, replacing the previously separate max_prompt_tokens and max_response_tokens in many configurations. This change simplifies token management and aligns with the capabilities of the upgraded vllm library, which I've also updated to version 0.10.0. The goal is to provide a more direct and efficient way to control the total length of sequences processed by the models.

Highlights

  • Introduction of max_model_len: I've introduced a new max_model_len parameter across the codebase to define the maximum total token length for a model's sequence, which includes both prompt and response tokens. This provides a more unified and explicit way to manage token limits.
  • Deprecation of max_prompt_tokens: The max_prompt_tokens parameter has been deprecated and largely removed from configurations and code. Where necessary, existing max_prompt_tokens and max_response_tokens values will be used to derive max_model_len for backward compatibility, but direct usage of max_model_len is now preferred.
  • vLLM Version Upgrade: I've updated the vllm dependency in pyproject.toml to version 0.10.0, ensuring compatibility with the latest features and performance improvements from the vllm library.
  • Configuration File Updates: I've updated numerous example configuration files (.yaml files) to reflect the new max_model_len parameter, replacing the deprecated max_prompt_tokens where applicable and adjusting values to align with the new token length management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the max_model_len parameter to replace the deprecated max_prompt_tokens, and upgrades the vllm dependency. The changes are mostly in configuration files and the related Python code that parses them.

My review has found a few issues:

  • There are typos and inconsistencies in the new configuration key (model_model_len, max_token_len) in some example YAML files.
  • In a couple of example files, the max_response_tokens parameter was unintentionally removed, which is still required for generation.
  • A debug print statement was left in the vllm_model.py file.

I've provided specific suggestions to fix these issues. Overall, the changes are in the right direction but need these corrections to be complete.

@pan-x-c pan-x-c changed the title Add max_model_len Update vLLM to 0.10.0 and add max_model_len Aug 8, 2025
@hiyuchang
Copy link
Collaborator Author

/unittest-all

@pan-x-c pan-x-c linked an issue Aug 8, 2025 that may be closed by this pull request
@pan-x-c pan-x-c requested a review from Copilot August 8, 2025 04:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR upgrades vLLM from 0.9.1-0.9.2 to 0.10.0 and introduces a new max_model_len parameter to replace the deprecated max_prompt_tokens field. The change simplifies token length management by using a single parameter instead of separate prompt and response token limits.

  • Updates vLLM version constraint to include 0.10.0
  • Replaces max_prompt_tokens with max_model_len across configuration files and codebase
  • Updates token length calculations to use the new parameter

Reviewed Changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Updates vLLM version constraint to allow 0.10.0
trinity/common/config.py Adds max_model_len field and deprecates max_prompt_tokens
trinity/common/models/vllm_model.py Updates model initialization to use max_model_len
trinity/common/models/api/vllm_patch.py Updates version check to support vLLM 0.10.0
trinity/manager/ Updates UI components and config generation to use new parameter
examples/ Updates all example configurations to use max_model_len
docs/ Updates documentation to reflect new parameter

@pan-x-c pan-x-c merged commit 068da40 into agentscope-ai:main Aug 8, 2025
2 checks passed
vadimkantorov added a commit to vadimkantorov/Trinity-RFT that referenced this pull request Aug 18, 2025
…y large prompts and preventing vllm from throwing exception

To prevent vllm throwing exceptions like:
```ERROR 08-17 23:32:15 scheduler.py:86] ValueError: The decoder prompt (length 42861) is longer than the maximum model length of 32768. Make sure that `max_model_len` is no smaller than the number of text tokens.
```

`truncate_prompt_tokens=config.max_model_len-1` is used to ensure at least one output token

A similar setting was used before agentscope-ai#172, and got removed without an explanation that I could find
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migration to newer versions of vllm

2 participants