Skip to content

Conversation

@jakub-sochacki
Copy link
Contributor

  • Add HPU image build step that runs only on main branch
  • Fetch compatible vLLM commit from vllm-gaudi's VLLM_STABLE_COMMIT
  • Use dynamic image tagging based on stable commit, not BUILDKITE_COMMIT
  • HPU images always pushed to vllm-ci-postmerge-repo registry
  • Don't use static docker_image_hpu variable (it's determined dynamically)
  • Implement image existence check before building to avoid duplicates
  • Build HPU images with both VLLM_COMMIT and VLLM_GAUDI_COMMIT args

This enables nightly HPU benchmarks by ensuring Docker images are built with vLLM versions compatible with the vllm-gaudi plugin, addressing version synchronization issues between the two repositories.

Copy link

@louie-tsai louie-tsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. thanks

- |
#!/bin/bash
# Fetch the compatible vLLM commit for vllm-gaudi
VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/main/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use

git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi
export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null)
cd ..

I tried with the link, it looks like it is not working

image

Copy link

@Jacob-Intel Jacob-Intel Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used the wrong branch (main/last-good-commit-for-vllm-gaudi) while it should be vllm/last-good-commit-for-vllm-gaudi. Is since I prefer not to clone the repo here i will update the command (URL) to:
VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')

jakub-sochacki and others added 5 commits October 30, 2025 14:32
Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
@xuechendi
Copy link
Collaborator

@khluu , may you help with this PR, thanks so much

queue: cpu_queue_postmerge_us_east_1
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- |
Copy link
Collaborator

@khluu khluu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this script into vllm repo and call it here instead of sending the whole script as part of commands?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PR is submitted to vllm repo as well:
vllm-project/vllm#26919

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the script in vllm-project/vllm#26919

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, which part you want us move to vllm-project?
Because hpu is a plugin, so we use below line to build our docker, is this part you suggested to be part of vllm?

# Fetch the compatible vLLM commit for vllm-gaudi
VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')

git clone https://github.com/vllm-project/vllm-gaudi.git /tmp/vllm-gaudi

docker build \
  --file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \
  --build-arg max_jobs=16 \
  --build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \
  --build-arg VLLM_GAUDI_COMMIT=main \
  --tag "$HPU_IMAGE_TAG" \
  --progress plain .

The vllm_stable_commit is necessary because some latest commit might failed vllm-gaudi,. So we are tracking the last good vllm commit sha

and rest part is to use Dockerfile.hpu exiting in vllm-gaudi instead of vllm to build docker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh what I mean is can we store the bash script in the command into a file on vllm-project/vllm repo, then just call that script here, like https://github.com/vllm-project/ci-infra/blob/main/buildkite/test-template-ci.j2#L722

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khluu I understand that scripts like https://github.com/vllm-project/vllm/blob/main/.buildkite/scripts/hardware_ci/run-xpu-test.sh are to running in the CI. Our goal is to build images in the CI but run performance benchmarks in nightly / 12h-cadence. Do you suggest moving docker building to separate script in vllm?

Copy link
Collaborator

@xuechendi xuechendi Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khluu , I discussed with Jakub.
Can we merge as it is now for this PR, the Pytorch-integration PR is merged. so we want to see how it goes, in case we might need more fix to clear the path.

=> The reason we have to do it now, is in the Pytorch-integration PR, we also use VLLM_STABLE_COMMIT as part of image name to index, that is why in this PR, we need to use the commit_id to tag image.

In next PR,
we will remove the whole VLLM_STABLE_COMMIT thing, and directly using BUILDKITE_COMMIT, and also submit another pytorch-integration PR to use BUILDKITE_COMMIT to index hpu_docker_image there.

So the HPU docker build can be simplified with single step which is to build docker from vllm-gaudi / Dockerfile.hpu

@xuechendi
Copy link
Collaborator

@jakub-sochacki , please help to rebase

docker build \
--file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \
--build-arg max_jobs=16 \
--build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakub-sochacki, does VLLM_COMMIT needed here, I realized you have add same step in the Dockerfile, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that the same vllm commit will be used in the dockerfile and in the docker image tag HPU_IMAGE_TAG="${REGISTRY}:${VLLM_STABLE_COMMIT}-hpu"

@jakub-sochacki
Copy link
Contributor Author

@jakub-sochacki , please help to rebase

Its already rebased and up-to-date.

@khluu khluu merged commit 76ef876 into vllm-project:main Oct 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants