Update vLLM to 0.10.0 and add max_model_len#172
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @hiyuchang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've made significant changes to how token limits are managed within the system, primarily by introducing a max_model_len parameter. This new parameter consolidates the concept of maximum sequence length, replacing the previously separate max_prompt_tokens and max_response_tokens in many configurations. This change simplifies token management and aligns with the capabilities of the upgraded vllm library, which I've also updated to version 0.10.0. The goal is to provide a more direct and efficient way to control the total length of sequences processed by the models.
Highlights
- Introduction of
max_model_len: I've introduced a newmax_model_lenparameter across the codebase to define the maximum total token length for a model's sequence, which includes both prompt and response tokens. This provides a more unified and explicit way to manage token limits. - Deprecation of
max_prompt_tokens: Themax_prompt_tokensparameter has been deprecated and largely removed from configurations and code. Where necessary, existingmax_prompt_tokensandmax_response_tokensvalues will be used to derivemax_model_lenfor backward compatibility, but direct usage ofmax_model_lenis now preferred. vLLMVersion Upgrade: I've updated thevllmdependency inpyproject.tomlto version0.10.0, ensuring compatibility with the latest features and performance improvements from thevllmlibrary.- Configuration File Updates: I've updated numerous example configuration files (
.yamlfiles) to reflect the newmax_model_lenparameter, replacing the deprecatedmax_prompt_tokenswhere applicable and adjusting values to align with the new token length management.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces the max_model_len parameter to replace the deprecated max_prompt_tokens, and upgrades the vllm dependency. The changes are mostly in configuration files and the related Python code that parses them.
My review has found a few issues:
- There are typos and inconsistencies in the new configuration key (
model_model_len,max_token_len) in some example YAML files. - In a couple of example files, the
max_response_tokensparameter was unintentionally removed, which is still required for generation. - A debug
printstatement was left in thevllm_model.pyfile.
I've provided specific suggestions to fix these issues. Overall, the changes are in the right direction but need these corrections to be complete.
|
/unittest-all |
There was a problem hiding this comment.
Pull Request Overview
This PR upgrades vLLM from 0.9.1-0.9.2 to 0.10.0 and introduces a new max_model_len parameter to replace the deprecated max_prompt_tokens field. The change simplifies token length management by using a single parameter instead of separate prompt and response token limits.
- Updates vLLM version constraint to include 0.10.0
- Replaces
max_prompt_tokenswithmax_model_lenacross configuration files and codebase - Updates token length calculations to use the new parameter
Reviewed Changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Updates vLLM version constraint to allow 0.10.0 |
| trinity/common/config.py | Adds max_model_len field and deprecates max_prompt_tokens |
| trinity/common/models/vllm_model.py | Updates model initialization to use max_model_len |
| trinity/common/models/api/vllm_patch.py | Updates version check to support vLLM 0.10.0 |
| trinity/manager/ | Updates UI components and config generation to use new parameter |
| examples/ | Updates all example configurations to use max_model_len |
| docs/ | Updates documentation to reflect new parameter |
…y large prompts and preventing vllm from throwing exception To prevent vllm throwing exceptions like: ```ERROR 08-17 23:32:15 scheduler.py:86] ValueError: The decoder prompt (length 42861) is longer than the maximum model length of 32768. Make sure that `max_model_len` is no smaller than the number of text tokens. ``` `truncate_prompt_tokens=config.max_model_len-1` is used to ensure at least one output token A similar setting was used before agentscope-ai#172, and got removed without an explanation that I could find
Description
vllmto0.10.0max_model_lenparameterChecklist
Please check the following items before code is ready to be reviewed.