Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Sep 29, 2025

This PR has two parts, the query vllm/merges_percentage and a new HUD page metrics/vllm.tsx to display vLLM CI metrics. There are 2 KPIs to start with:

  1. The % of force merges with CI failures. Its meaning is clear.
  2. The % of manual merges where a vLLM maintainer merged the pull request manually without using GitHub auto-merge. But there wasn't any failures in the pull request at the time of the merge.

What is a vLLM CI failure?

As vLLM CI is on Buildkite, a CI failure means a failed Buildkite job that (1) is not a soft fail, and (2) fails on its latest retry at the time of the merge. We can get this information by joining the GitHub pull_request with Buildkite vllm_buildkite_jobs on the pull request number.

Testing

https://torchci-git-vllm-metrics-fbopensource.vercel.app/metrics/vllm

cc @rzabarazesh @yeqcharlotte @simon-mo

Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@vercel
Copy link

vercel bot commented Sep 29, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
torchci Ready Ready Preview Sep 30, 2025 0:03am

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 29, 2025
@huydhn huydhn marked this pull request as ready for review September 29, 2025 08:22
@huydhn huydhn requested a review from a team September 29, 2025 08:22
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn
Copy link
Contributor Author

huydhn commented Sep 29, 2025

As there are questions from @jeanschmidt @zhewenl and @rzabarazesh in the context of vllm-project/vllm#25670, I want to share my thoughts on how this fits into GHA migration journey:

  1. IMO, the 2 big milestones of GHA migration are (1) infra running on GHA, and (2) DevX features. (1) is the bigger area because PT Dev Infra don't have the capacity to maintain 2 infra stack for both GHA and Buildkite. In addition, Meta OSS team also supports only GHA at the moment. So, this a strong argument to go ahead with GHA migration.
  2. On the other hand, there are some small areas of (2) that could be done as stop gaps before the migration finishes, mainly because it's cheap to redo them. This PR is one of them. IMO, we don't need to wait for GHA migration to finish to have some high level KPIs about vLLM CI. As a counter example, creating a metric page is easy, but to fully supporting vLLM CI on HUD https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50 is a much bigger task that could only be unblocked after GHA migration.

@yangw-dev
Copy link
Contributor

Optional:L when hover on the chart, is not very clear which color represented which

@yangw-dev
Copy link
Contributor

Probably good to add some description about the metrics too in the ui

What is a vLLM CI failure?
As vLLM CI is on Buildkite, a CI failure means a failed Buildkite job that (1) is not a soft fail, and (2) fails on its latest retry at the time of the merge. We can get this information by joining the GitHub pull_request with Buildkite vllm_buildkite_jobs on the pull request number

@yangw-dev yangw-dev self-requested a review September 29, 2025 19:49
@yeqcharlotte
Copy link

Find it hard to interpret the data here.

Force merge is super low in the dashboard
image

But we are seeing almost none of the vllm commits are green:
image

@huydhn
Copy link
Contributor Author

huydhn commented Sep 29, 2025

Find it hard to interpret the data here.

We will need trunk health metric like what is available on https://app.hex.tech/533fe68e-dcd8-4a52-a101-aefba762f581/app/vLLM-CI-030kdEgDv6lSlh1UPYOkWP/latest to supplement these metrics here. Take an example, the signals from vllm-project/vllm#25706 merged 3 hours ago was a-ok, but its trunk commit failed vllm-project/vllm@d5ab285. The discrepancy you see could mean 2 things:

  • The target determination part on the PR is busted, where it misses relevant tests
  • Or there is a recent failure that break vLLM trunk. These PRs are green because they base on an older commit that didn't include the broken change

Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn merged commit 5398e1a into main Sep 30, 2025
5 checks passed
@huydhn huydhn deleted the vllm-metrics branch September 30, 2025 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants