Skip to content

Commit 93c0b8d

Browse files
1pikachujikunshangyma11wendyliu235
committed
CI for vLLM with vllm-xpu-kernel (vllm-project#372)
* layernorm use vllm_xpu_kernels Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [ww34] switch silu_and_mul, reshape_and_cache_flash, rope to xpu kernel Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * update activation kernels Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * try remove ipex Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * switch to xpu kernel for w8a16 gemm (vllm-project#323) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * enable cutlass chunked-prefill (vllm-project#330) * enable cutlass chunked-prefill Signed-off-by: Yan Ma <yan.ma@intel.com> * add required pkg for xpu-kernels compilation Signed-off-by: Yan Ma <yan.ma@intel.com> --------- Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * enable topk/grouped_gemm based on llama4 (vllm-project#354) * enable topk/grouped_gemm based on llama4 Signed-off-by: Yan Ma <yan.ma@intel.com> * address comments Signed-off-by: Yan Ma <yan.ma@intel.com> --------- Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * enable CI * replace lora kernels (vllm-project#347) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * remove ipex (vllm-project#370) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * update QA CI branch * update QA CI yaml * update QA CI yaml * update QA CI yaml * update QA CI yaml * update QA CI yaml * update QA CI yaml * fix conflict --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Liu, Wenjun <wenjun.liu@intel.com>
1 parent 9ce9f32 commit 93c0b8d

File tree

2 files changed

+188
-1
lines changed

2 files changed

+188
-1
lines changed

.github/workflows/ci.yaml

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
name: Run Intel XPU BMG CI
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- '**xpu-kernels**'
7+
types: [opened, synchronize, reopened] #
8+
9+
jobs:
10+
run-xpu-BMG-CI:
11+
if: |
12+
github.event_name == 'pull_request' ||
13+
(github.event_name == 'issue_comment' &&
14+
github.event.issue.pull_request &&
15+
contains(github.event.comment.body, '/BMG_CI'))
16+
runs-on: BMG
17+
18+
steps:
19+
- name: Fix workspace permissions
20+
run: |
21+
sudo chown -R $(whoami):$(whoami) "$GITHUB_WORKSPACE"
22+
sudo chmod -R 755 "$GITHUB_WORKSPACE"
23+
sudo rm -f "$GITHUB_WORKSPACE/.git/index.lock" || true
24+
25+
- name: Checkout QA_ci (test code) #
26+
uses: actions/checkout@v4
27+
with:
28+
ref: QA_ci
29+
path: qa_ci_code
30+
31+
- name: Checkout PR + Release Branch (DUT code)
32+
uses: actions/checkout@v4
33+
with:
34+
ref: ${{ github.event.pull_request.head.ref }}
35+
path: target_code
36+
fetch-depth: 0 #
37+
38+
- name: Merge PR into Release Branch
39+
if: github.event_name == 'pull_request'
40+
run: |
41+
cd target_code
42+
echo "Merging PR branch into base: ${{ github.base_ref }}"
43+
git fetch origin ${{ github.base_ref }}
44+
git merge origin/${{ github.base_ref }} --no-commit
45+
shell: bash
46+
47+
- name: Build docker image
48+
run: |
49+
echo "start to build image"
50+
cd target_code
51+
if [ -n "${{ github.event.pull_request.number }}" ]; then
52+
image_name="vllm_xpu_ci_${{ github.event.pull_request.number }}"
53+
else
54+
image_name="vllm_xpu_ci_$(echo $GITHUB_REF | awk -F '/' '{print $3}')"
55+
fi
56+
image_name=$(echo "$image_name" | tr '[:upper:]' '[:lower:]')
57+
#!/bin/bash
58+
59+
# Configuration
60+
MAX_RETRIES=6 # Maximum number of retry attempts
61+
TIMEOUT=1800 # 30-minute timeout per attempt (in seconds)
62+
LOG_FILE="docker_build.log" # Log file path
63+
64+
# Proxy configurations - add more if needed
65+
PROXIES=(
66+
"http://child-prc.intel.com:913" # First fallback
67+
"http://proxy.ims.intel.com:911" # Primary proxy
68+
"http://child-prc.intel.com:913" # First fallback
69+
)
70+
71+
# No-proxy configuration
72+
NO_PROXY=".intel.com,intel.com,localhost,127.0.0.1"
73+
#docker builder prune -f #clean cache
74+
docker builder prune --all --force
75+
76+
#Loop through proxy configurations
77+
for (( attempt=1; attempt<=$MAX_RETRIES; attempt++ )); do
78+
proxy_index=$(( (attempt-1) % ${#PROXIES[@]} ))
79+
proxy=${PROXIES[$proxy_index]}
80+
echo "=== Attempt $attempt/$MAX_RETRIES (Proxy: $proxy) ===" | tee -a "$LOG_FILE"
81+
82+
if [ $attempt -eq 1 ]; then
83+
# First attempt without no_proxy
84+
timeout $TIMEOUT docker build \
85+
--build-arg http_proxy=$proxy \
86+
--build-arg https_proxy=$proxy \
87+
-f docker/Dockerfile.xpu \
88+
-t "$image_name" \
89+
--shm-size=4g . 2>&1 | tee -a "$LOG_FILE"
90+
else
91+
# Subsequent attempts with no_proxy
92+
timeout $TIMEOUT docker build \
93+
--build-arg http_proxy=$proxy \
94+
--build-arg https_proxy=$proxy \
95+
--build-arg no_proxy="$NO_PROXY" \
96+
-f docker/Dockerfile.xpu \
97+
-t "$image_name" \
98+
--shm-size=4g . 2>&1 | tee -a "$LOG_FILE"
99+
fi
100+
101+
# Check if build succeeded
102+
if [ ${PIPESTATUS[0]} -eq 0 ]; then
103+
echo "=== Build succeeded on attempt $attempt ===" | tee -a "$LOG_FILE"
104+
exit 0
105+
fi
106+
done
107+
108+
echo "=== ERROR: All $MAX_RETRIES attempts failed. Check $LOG_FILE for details. ===" | tee -a "$LOG_FILE"
109+
exit 1
110+
111+
- name: Prepare environment (clean up old processes and containers)
112+
run: |
113+
echo "Killing any process on port 8000..."
114+
lsof -t -i:8000 | xargs -r kill -9 || true
115+
116+
echo "Killing old vllm server processes..."
117+
pkill -f "python3 -m vllm.entrypoints.openai.api_server" || true
118+
119+
echo "Removing old container if exists..."
120+
docker rm -f vllm_internal_ci || true
121+
122+
- name: Run benchmark inside local Docker image
123+
run: |
124+
# Reuse the image_name from previous step
125+
if [ -n "${{ github.event.pull_request.number }}" ]; then
126+
image_name="vllm_xpu_ci_${{ github.event.pull_request.number }}"
127+
else
128+
image_name="vllm_xpu_ci_$(echo $GITHUB_REF | awk -F '/' '{print $3}')"
129+
fi
130+
image_name=$(echo "$image_name" | tr '[:upper:]' '[:lower:]')
131+
132+
echo "Running benchmark using image: $image_name"
133+
docker run -t --rm --name vllm_internal_ci --shm-size 10g \
134+
--net=host \
135+
--ipc=host \
136+
--privileged \
137+
-v ${HOME}/actions-runner/_work/vllm-xpu/vllm-xpu/qa_ci_code:/WORKSPACE \
138+
-v /dev/dri/by-path:/dev/dri/by-path \
139+
-v ${HOME}/.cache:/root/.cache/ \
140+
-e http_proxy=${http_proxy:-"http://proxy-dmz.intel.com:912"} \
141+
-e https_proxy=${http_proxy:-"http://proxy-dmz.intel.com:912"} \
142+
-e no_proxy=${no_proxy:-"127.0.0.1,localhost"} \
143+
--device /dev/dri:/dev/dri \
144+
-w /workspace \
145+
--entrypoint='' \
146+
--mount type=bind,source="$HOME/.secrets/my_token",target=/run/secrets/my_token,readonly \
147+
$image_name \
148+
bash -c "bash /WORKSPACE/.buildkite/nightly-benchmarks/scripts/CI_run_server_benchmarks.sh BMG_KERNEL || true; chown -R \$(id -u):\$(id -g) /WORKSPACE"
149+
150+
- name: Validate server benchmark results
151+
run: |
152+
python3 ${HOME}/actions-runner/_work/vllm-xpu/vllm-xpu/qa_ci_code/.buildkite/nightly-benchmarks/scripts/analyze_benchmark_results_final.py --test-selector BMG_KERNEL
153+
cat ${HOME}/actions-runner/_work/vllm-xpu/vllm-xpu/qa_ci_code/benchmarks/results/benchmark_analysis_final.json
154+
155+
- name: Fix permissions
156+
run: sudo chmod -R 755 ${{ runner.workspace }}/vllm-xpu/qa_ci_code/benchmarks/results/
157+
158+
- name: Debug path
159+
run: ls -la ${{ runner.workspace }}/vllm-xpu/qa_ci_code/benchmarks/results/
160+
161+
- name: Upload benchmark results
162+
if: always()
163+
uses: actions/upload-artifact@v4
164+
with:
165+
name: benchmark-results
166+
path: ${{ runner.workspace }}/vllm-xpu/qa_ci_code/benchmarks/results/
167+
168+
- name: Analyze and validate benchmark results
169+
if: always()
170+
run: |
171+
RESULTS_FILE="$HOME/actions-runner/_work/vllm-xpu/vllm-xpu/qa_ci_code/benchmarks/results/benchmark_analysis_final.json"
172+
if [ ! -f "$RESULTS_FILE" ]; then
173+
echo "❌ Benchmark analysis file not found!"
174+
exit 1
175+
fi
176+
177+
echo "📊 Benchmark Results:"
178+
cat "$RESULTS_FILE"
179+
FAILURES=$(jq -r '.[] | select(.function != "pass") | .case_name' "$RESULTS_FILE")
180+
181+
if [ -n "$FAILURES" ]; then
182+
echo "❌ Failed cases detected:"
183+
echo "$FAILURES"
184+
exit 1
185+
else
186+
echo "✅ All benchmarks passed"
187+
fi

requirements/xpu.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ torchaudio
1515
torchvision
1616
--extra-index-url=https://download.pytorch.org/whl/xpu
1717

18-
intel-extension-for-pytorch @ https://intel-extension-for-pytorch.s3.us-east-1.amazonaws.com/ipex_dev/xpu/intel_extension_for_pytorch-2.8.10.post1%2Bxpu-cp312-cp312-linux_x86_64.whl
18+
vllm-xpu-kernels @ https://ubit-artifactory-ba.intel.com/artifactory/aipc_releases-ba-local/gpu/new/validation/IPEX/nightly/PVC/UBUNTU/VLLM_nightly/vllm_kernel/20251014/e50cca090e/vllm_xpu_kernels-0.0.1-cp312-cp312-linux_x86_64.whl

0 commit comments

Comments
 (0)