Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Jun 28, 2025

Purpose

Fixes #20138

Test

(profile, create kv cache, warmup model) took 44.33 seconds
Adding requests: 100%|| 200/200 [00:00<00:00, 5003.64i
Adding requests: 100%|| 200/200 [00:00<00:00, 5111.95i
Processed prompts: 100%|| 200/200 [00:03<00:00, 64.25i
Processed prompts: 100%|| 200/200 [00:02<00:00, 66.82i
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' _____. I\'m 26 years old and I love art. I want to do art."'
DP rank 0, Prompt: 'Hello, my name is', Generated text: " _____. I'm 26 years old and I love art. I want"
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' not a dictator. That is not a part of his job description.\n\nThis is the'
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' not a dictator. That is not a part of his job description.\n'
DP rank 1, Prompt: 'The capital of France is', Generated text: ' a major international city and an important financial centre. A favourite with tourists and business travell'
DP rank 0, Prompt: 'The capital of France is', Generated text: ' a major international city and an important financial centre. A favourite with tourists'
DP rank 1, Prompt: 'The future of AI is', Generated text: ' closely tied to the broader field of robotics, with the potential for increasingly sophistic'
DP rank 0, Prompt: 'The future of AI is', Generated text: ' closely tied to the broader field of robotics, with the potential for increasing'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Mark. I am 40 years old. I am an artist. I am a musician'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Mark. I am 40 years old. I am an artist. I'

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yewentao256, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a CI issue encountered during distributed 4-GPU testing by enhancing the data_parallel.py example with new configurable parameters. These parameters allow for more precise control over GPU memory allocation and sequence limits during model warm-up, mitigating potential out-of-memory errors and improving the stability of distributed inference tests.

Highlights

  • CI Issue Resolution: This pull request aims to fix a continuous integration (CI) issue related to distributed 4-GPU testing, specifically addressing potential memory constraints that might cause test failures.
  • New Configuration Arguments: I've introduced two new command-line arguments to examples/offline_inference/data_parallel.py: --max-num-seqs and --gpu-memory-utilization. These provide finer control over memory usage during distributed inference.
  • Memory Management: The --max-num-seqs argument (default 64) allows users to limit the maximum number of sequences used during engine warm-up, which can significantly reduce peak memory consumption. The --gpu-memory-utilization argument (default 0.8) enables setting the fraction of GPU memory vLLM is allowed to allocate, providing more headroom to avoid CUDA OOM errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Jun 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new command-line arguments, --max-num-seqs and --gpu-memory-utilization, to the data parallel example script. This provides more control over memory usage, which is intended to fix a CI test failure.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yewentao256!

Comment on lines +67 to +72
parser.add_argument(
"--max-num-seqs",
type=int,
default=64,
help=("Maximum number of sequences to be processed in a single iteration."),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both of these args required to avoid the OOM? 64 is quite small for batch mode, would be good if we could fix just with the gpu_memory_utilization reduction...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, also seems that we need much more memory during initialization than before. I was about to investigate more into this, but didn't get time to do so. Wondering if @yewentao256 could dig further into this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am happy to dig further, but what is the expected result for this? To reduce the memory usage? But it is kind of like a tradeoff between speed and efficiency I am afraid.
Basically, the original cause of this OOM issue is from #18724, which I think it is reasonable to adopt. @houseroad

@njhill
Copy link
Member

njhill commented Jun 28, 2025

I unblocked the 4-GPUs test so that we can verify it passes.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test passes so I'm merging this to unblock CI first. Let's fix the underlying issue in another PR.

@vllm-bot vllm-bot merged commit d45417b into vllm-project:main Jun 28, 2025
14 checks passed
@yewentao256 yewentao256 deleted the wye-fix-ci-issue-distributed-gpu-test branch June 30, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Distributed Tests (4 GPUs) failing in main branch CI

5 participants