[CI/Build][Intel] Add HPU image build with vllm-gaudi compatibility #191

jakub-sochacki · 2025-10-15T15:49:06Z

Add HPU image build step that runs only on main branch
Fetch compatible vLLM commit from vllm-gaudi's VLLM_STABLE_COMMIT
Use dynamic image tagging based on stable commit, not BUILDKITE_COMMIT
HPU images always pushed to vllm-ci-postmerge-repo registry
Don't use static docker_image_hpu variable (it's determined dynamically)
Implement image existence check before building to avoid duplicates
Build HPU images with both VLLM_COMMIT and VLLM_GAUDI_COMMIT args

This enables nightly HPU benchmarks by ensuring Docker images are built with vLLM versions compatible with the vllm-gaudi plugin, addressing version synchronization issues between the two repositories.

buildkite/test-template-ci.j2

louie-tsai

looks good to me. thanks

xuechendi · 2025-10-29T00:58:52Z

buildkite/test-template-ci.j2

+      - |
+        #!/bin/bash
+        # Fetch the compatible vLLM commit for vllm-gaudi
+        VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/main/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')


can we use

git clone https://github.com/vllm-project/vllm-gaudi cd vllm-gaudi export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null) cd ..

I tried with the link, it looks like it is not working

I just used the wrong branch (main/last-good-commit-for-vllm-gaudi) while it should be vllm/last-good-commit-for-vllm-gaudi. Is since I prefer not to clone the repo here i will update the command (URL) to:
VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

xuechendi · 2025-10-30T14:41:56Z

@khluu , may you help with this PR, thanks so much

khluu · 2025-10-30T22:35:15Z

buildkite/test-template-ci.j2

+      queue: cpu_queue_postmerge_us_east_1
+    commands:
+      - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
+      - |


Can we add this script into vllm repo and call it here instead of sending the whole script as part of commands?

like this one https://github.com/vllm-project/vllm/blob/main/.buildkite/scripts/hardware_ci/run-xpu-test.sh

Yes, PR is submitted to vllm repo as well:
vllm-project/vllm#26919

I don't see the script in vllm-project/vllm#26919

Oh, which part you want us move to vllm-project?
Because hpu is a plugin, so we use below line to build our docker, is this part you suggested to be part of vllm?

# Fetch the compatible vLLM commit for vllm-gaudi VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n') git clone https://github.com/vllm-project/vllm-gaudi.git /tmp/vllm-gaudi docker build \ --file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \ --build-arg max_jobs=16 \ --build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \ --build-arg VLLM_GAUDI_COMMIT=main \ --tag "$HPU_IMAGE_TAG" \ --progress plain .

The vllm_stable_commit is necessary because some latest commit might failed vllm-gaudi,. So we are tracking the last good vllm commit sha

and rest part is to use Dockerfile.hpu exiting in vllm-gaudi instead of vllm to build docker

Oh what I mean is can we store the bash script in the command into a file on vllm-project/vllm repo, then just call that script here, like https://github.com/vllm-project/ci-infra/blob/main/buildkite/test-template-ci.j2#L722

@khluu I understand that scripts like https://github.com/vllm-project/vllm/blob/main/.buildkite/scripts/hardware_ci/run-xpu-test.sh are to running in the CI. Our goal is to build images in the CI but run performance benchmarks in nightly / 12h-cadence. Do you suggest moving docker building to separate script in vllm?

@khluu , I discussed with Jakub.
Can we merge as it is now for this PR, the Pytorch-integration PR is merged. so we want to see how it goes, in case we might need more fix to clear the path.

=> The reason we have to do it now, is in the Pytorch-integration PR, we also use VLLM_STABLE_COMMIT as part of image name to index, that is why in this PR, we need to use the commit_id to tag image.

In next PR,
we will remove the whole VLLM_STABLE_COMMIT thing, and directly using BUILDKITE_COMMIT, and also submit another pytorch-integration PR to use BUILDKITE_COMMIT to index hpu_docker_image there.

So the HPU docker build can be simplified with single step which is to build docker from vllm-gaudi / Dockerfile.hpu

xuechendi · 2025-10-31T01:38:05Z

@jakub-sochacki , please help to rebase

xuechendi · 2025-10-31T03:02:53Z

buildkite/test-template-ci.j2

+        docker build \
+          --file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \
+          --build-arg max_jobs=16 \
+          --build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \


@jakub-sochacki, does VLLM_COMMIT needed here, I realized you have add same step in the Dockerfile, right?

This ensures that the same vllm commit will be used in the dockerfile and in the docker image tag HPU_IMAGE_TAG="${REGISTRY}:${VLLM_STABLE_COMMIT}-hpu"

jakub-sochacki · 2025-10-31T14:13:49Z

@jakub-sochacki , please help to rebase

Its already rebased and up-to-date.

jakub-sochacki force-pushed the enable-gaudi3 branch from 68db113 to 2846d06 Compare October 17, 2025 12:40

xuechendi reviewed Oct 22, 2025

View reviewed changes

buildkite/test-template-ci.j2 Show resolved Hide resolved

louie-tsai approved these changes Oct 27, 2025

View reviewed changes

huydhn approved these changes Oct 28, 2025

View reviewed changes

xuechendi reviewed Oct 29, 2025

View reviewed changes

jakub-sochacki force-pushed the enable-gaudi3 branch from 2846d06 to deb205f Compare October 29, 2025 14:23

jakub-sochacki and others added 5 commits October 30, 2025 14:32

draft intel gaudi 3 integration

6c32ec8

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

Add HPU image build with vllm-gaudi compatibility

f472a8d

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

fix: correct branch path for vllm-gaudi VLLM_STABLE_COMMIT file

1d5e989

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

Use vllm-gaudi Dockerfile for HPU builds

8e5f332

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

Fix vllm-gaudi Dockerfile path

bb802f2

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

jakub-sochacki force-pushed the enable-gaudi3 branch from 8268810 to bb802f2 Compare October 30, 2025 12:32

xuechendi mentioned this pull request Oct 30, 2025

[CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 vllm-project/vllm#26919

Merged

5 tasks

khluu approved these changes Oct 30, 2025

View reviewed changes

xuechendi reviewed Oct 31, 2025

View reviewed changes

khluu merged commit 76ef876 into vllm-project:main Oct 31, 2025
1 check passed

[CI/Build][Intel] Add HPU image build with vllm-gaudi compatibility #191

[CI/Build][Intel] Add HPU image build with vllm-gaudi compatibility #191

Uh oh!

Conversation

jakub-sochacki commented Oct 15, 2025

Uh oh!

Uh oh!

louie-tsai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jacob-Intel Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Oct 30, 2025

Uh oh!

khluu Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Oct 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakub-sochacki commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Jacob-Intel Oct 29, 2025 •

edited

Loading

khluu Oct 30, 2025 •

edited

Loading

xuechendi Oct 31, 2025 •

edited

Loading