Skip to content

Commit 519785a

Browse files
bbartelsQierLi
authored andcommitted
Bump Flashinfer to 0.3.1 (#24868)
Signed-off-by: bbartels <benjamin@bartels.dev> [gpt-oss] Add IncompleteDetails to ResponsesRepsonse (#24561) Signed-off-by: Andrew Xia <axia@meta.com> [gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still (#24759) Signed-off-by: Andrew Xia <axia@meta.com> [Performance] Remove redundant clone() calls in cutlass_mla (#24891) [Bug] Fix Cutlass Scaled MM Compilation Error (#24887) Signed-off-by: yewentao256 <zhyanwentao@126.com> [ci] fix wheel names for arm wheels (#24898) Signed-off-by: simon-mo <simon.mo@hey.com> [Tests] fix initialization of kv hash in tests (#24273) Signed-off-by: Mickael Seznec <mickael@mistral.ai> [Compile] Fix noop_elimination pass and add tests for noop_elimination (#24880) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Propagate entire tokens to connector for resumed preemptions Signed-off-by: Qier Li <kevin44036@gmail.com> Fix pre-commit Signed-off-by: Qier Li <kevin44036@gmail.com> Rename field and nullify empty lists Signed-off-by: Qier Li <kevin44036@gmail.com> Update vllm/v1/core/sched/scheduler.py Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Qier Li <kevin44036@gmail.com> Add unit test for preemption resumption Signed-off-by: Qier Li <kevin44036@gmail.com>
1 parent 49bfc53 commit 519785a

27 files changed

+473
-215
lines changed

.buildkite/release-pipeline.yaml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,22 @@
11
steps:
22
# aarch64 + CUDA builds. PyTorch 2.8 aarch64 + CUDA wheel is only available on CUDA 12.9
33
- label: "Build arm64 wheel - CUDA 12.9"
4+
depends_on: ~
45
id: build-wheel-arm64-cuda-12-9
56
agents:
67
queue: arm64_cpu_queue_postmerge
78
commands:
89
# #NOTE: torch_cuda_arch_list is derived from upstream PyTorch build files here:
910
# https://github.com/pytorch/pytorch/blob/main/.ci/aarch64_linux/aarch64_ci_build.sh#L7
10-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
11+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg VLLM_MAIN_CUDA_VERSION=12.9 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
1112
- "mkdir artifacts"
1213
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
1314
- "bash .buildkite/scripts/upload-wheels.sh"
1415
env:
1516
DOCKER_BUILDKIT: "1"
1617

17-
- block: "Build CUDA 12.8 wheel"
18-
key: block-build-cu128-wheel
19-
2018
- label: "Build wheel - CUDA 12.8"
21-
depends_on: block-build-cu128-wheel
19+
depends_on: ~
2220
id: build-wheel-cuda-12-8
2321
agents:
2422
queue: cpu_queue_postmerge
@@ -30,12 +28,8 @@ steps:
3028
env:
3129
DOCKER_BUILDKIT: "1"
3230

33-
- block: "Build CUDA 12.6 wheel"
34-
key: block-build-cu126-wheel
35-
depends_on: ~
36-
3731
- label: "Build wheel - CUDA 12.6"
38-
depends_on: block-build-cu126-wheel
32+
depends_on: ~
3933
id: build-wheel-cuda-12-6
4034
agents:
4135
queue: cpu_queue_postmerge
@@ -102,8 +96,6 @@ steps:
10296
depends_on:
10397
- create-multi-arch-manifest
10498
- build-wheel-cuda-12-8
105-
- build-wheel-cuda-12-6
106-
- build-wheel-cuda-12-9
10799
id: annotate-release-workflow
108100
agents:
109101
queue: cpu_queue_postmerge

.buildkite/scripts/annotate-release.sh

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,33 @@ buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
1414
To download the wheel:
1515
\`\`\`
1616
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
17+
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
18+
1719
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu126/vllm-${RELEASE_VERSION}+cu126-cp38-abi3-manylinux1_x86_64.whl .
18-
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu118/vllm-${RELEASE_VERSION}+cu118-cp38-abi3-manylinux1_x86_64.whl .
20+
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu129/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
1921
\`\`\`
2022
2123
To download and upload the image:
2224
2325
\`\`\`
24-
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}
25-
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT} vllm/vllm-openai
26-
docker tag vllm/vllm-openai vllm/vllm-openai:latest
27-
docker tag vllm/vllm-openai vllm/vllm-openai:v${RELEASE_VERSION}
28-
docker push vllm/vllm-openai:latest
29-
docker push vllm/vllm-openai:v${RELEASE_VERSION}
26+
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
27+
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
28+
29+
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
30+
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
31+
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
32+
docker push vllm/vllm-openai:latest-x86_64
33+
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
34+
35+
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
36+
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
37+
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
38+
docker push vllm/vllm-openai:latest-aarch64
39+
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
40+
41+
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64 --amend
42+
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64 --amend
43+
docker manifest push vllm/vllm-openai:latest
44+
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
3045
\`\`\`
3146
EOF

.buildkite/test-pipeline.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -394,6 +394,7 @@ steps:
394394
- pytest -v -s compile/test_async_tp.py
395395
- pytest -v -s compile/test_fusion_all_reduce.py
396396
- pytest -v -s compile/test_decorator.py
397+
- pytest -v -s compile/test_noop_elimination.py
397398

398399
- label: PyTorch Fullgraph Smoke Test # 15min
399400
timeout_in_minutes: 30

csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm100_fp8_dispatch.cuh

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ void cutlass_gemm_caller_blockwise(torch::Tensor& out, torch::Tensor const& a,
146146

147147
using ElementAB = typename Gemm::ElementAB;
148148
using ElementD = typename Gemm::ElementD;
149+
using ElementBlockScale = typename Gemm::ElementBlockScale;
149150

150151
int32_t m = a.size(0), n = b.size(1), k = a.size(1);
151152

@@ -166,26 +167,29 @@ void cutlass_gemm_caller_blockwise(torch::Tensor& out, torch::Tensor const& a,
166167
ScaleConfig::tile_atom_to_shape_SFB(make_shape(n, m, k, 1)) :
167168
ScaleConfig::tile_atom_to_shape_SFB(make_shape(m, n, k, 1));
168169

169-
auto a_ptr = static_cast<ElementAB*>(a.data_ptr());
170-
auto b_ptr = static_cast<ElementAB*>(b.data_ptr());
171-
auto a_scales_ptr = static_cast<float*>(a_scales.data_ptr());
172-
auto b_scales_ptr = static_cast<float*>(b_scales.data_ptr());
170+
auto a_ptr = static_cast<ElementAB const*>(a.data_ptr());
171+
auto b_ptr = static_cast<ElementAB const*>(b.data_ptr());
172+
auto a_scales_ptr = static_cast<ElementBlockScale const*>(a_scales.data_ptr());
173+
auto b_scales_ptr = static_cast<ElementBlockScale const*>(b_scales.data_ptr());
173174

174-
auto mainloop_args = [&](){
175-
// layout_SFA and layout_SFB cannot be swapped since they are deduced.
176-
if (swap_ab) {
177-
return typename GemmKernel::MainloopArguments{
178-
b_ptr, b_stride, a_ptr, a_stride,
179-
b_scales_ptr, layout_SFA, a_scales_ptr, layout_SFB
180-
};
181-
}
182-
else {
183-
return typename GemmKernel::MainloopArguments{
184-
a_ptr, a_stride, b_ptr, b_stride,
185-
a_scales_ptr, layout_SFA, b_scales_ptr, layout_SFB
186-
};
187-
}
188-
}();
175+
typename GemmKernel::MainloopArguments mainloop_args{};
176+
mainloop_args.layout_SFA = layout_SFA;
177+
mainloop_args.layout_SFB = layout_SFB;
178+
if (swap_ab) {
179+
mainloop_args.ptr_A = b_ptr;
180+
mainloop_args.dA = b_stride;
181+
mainloop_args.ptr_B = a_ptr;
182+
mainloop_args.dB = a_stride;
183+
mainloop_args.ptr_SFA = b_scales_ptr;
184+
mainloop_args.ptr_SFB = a_scales_ptr;
185+
} else {
186+
mainloop_args.ptr_A = a_ptr;
187+
mainloop_args.dA = a_stride;
188+
mainloop_args.ptr_B = b_ptr;
189+
mainloop_args.dB = b_stride;
190+
mainloop_args.ptr_SFA = a_scales_ptr;
191+
mainloop_args.ptr_SFB = b_scales_ptr;
192+
}
189193
auto prob_shape = swap_ab ? cute::make_shape(n, m, k, 1) : cute::make_shape(m, n, k, 1);
190194

191195
auto c_ptr = static_cast<ElementD*>(out.data_ptr());

csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm120_fp8_dispatch.cuh

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ void cutlass_gemm_caller_blockwise(torch::Tensor& out, torch::Tensor const& a,
125125

126126
using ElementAB = typename Gemm::ElementAB;
127127
using ElementD = typename Gemm::ElementD;
128+
using ElementBlockScale = typename Gemm::ElementBlockScale;
128129

129130
int32_t m = a.size(0), n = b.size(1), k = a.size(1);
130131

@@ -143,17 +144,20 @@ void cutlass_gemm_caller_blockwise(torch::Tensor& out, torch::Tensor const& a,
143144
LayoutSFB layout_SFB =
144145
ScaleConfig::tile_atom_to_shape_SFB(make_shape(m, n, k, 1));
145146

146-
auto a_ptr = static_cast<ElementAB*>(a.data_ptr());
147-
auto b_ptr = static_cast<ElementAB*>(b.data_ptr());
148-
auto a_scales_ptr = static_cast<float*>(a_scales.data_ptr());
149-
auto b_scales_ptr = static_cast<float*>(b_scales.data_ptr());
150-
151-
auto mainloop_args = [&](){
152-
return typename GemmKernel::MainloopArguments{
153-
a_ptr, a_stride, b_ptr, b_stride,
154-
a_scales_ptr, layout_SFA, b_scales_ptr, layout_SFB
155-
};
156-
}();
147+
auto a_ptr = static_cast<ElementAB const*>(a.data_ptr());
148+
auto b_ptr = static_cast<ElementAB const*>(b.data_ptr());
149+
auto a_scales_ptr = static_cast<ElementBlockScale const*>(a_scales.data_ptr());
150+
auto b_scales_ptr = static_cast<ElementBlockScale const*>(b_scales.data_ptr());
151+
152+
typename GemmKernel::MainloopArguments mainloop_args{};
153+
mainloop_args.ptr_A = a_ptr;
154+
mainloop_args.dA = a_stride;
155+
mainloop_args.ptr_B = b_ptr;
156+
mainloop_args.dB = b_stride;
157+
mainloop_args.ptr_SFA = a_scales_ptr;
158+
mainloop_args.layout_SFA = layout_SFA;
159+
mainloop_args.ptr_SFB = b_scales_ptr;
160+
mainloop_args.layout_SFB = layout_SFB;
157161
auto prob_shape = cute::make_shape(m, n, k, 1);
158162

159163
auto c_ptr = static_cast<ElementD*>(out.data_ptr());

csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8_dispatch.cuh

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ void cutlass_gemm_caller_blockwise(torch::Tensor& out, torch::Tensor const& a,
115115

116116
using ElementAB = typename Gemm::ElementAB;
117117
using ElementD = typename Gemm::ElementD;
118+
using ElementBlockScale = typename Gemm::ElementBlockScale;
118119

119120
int32_t m = a.size(0), n = b.size(1), k = a.size(1);
120121

@@ -135,17 +136,20 @@ void cutlass_gemm_caller_blockwise(torch::Tensor& out, torch::Tensor const& a,
135136
LayoutSFB layout_SFB =
136137
ScaleConfig::tile_atom_to_shape_SFB(make_shape(m, n, k, 1));
137138

138-
auto a_ptr = static_cast<ElementAB*>(a.data_ptr());
139-
auto b_ptr = static_cast<ElementAB*>(b.data_ptr());
140-
auto a_scales_ptr = static_cast<float*>(a_scales.data_ptr());
141-
auto b_scales_ptr = static_cast<float*>(b_scales.data_ptr());
142-
143-
auto mainloop_args = [&](){
144-
return typename GemmKernel::MainloopArguments{
145-
a_ptr, a_stride, b_ptr, b_stride,
146-
a_scales_ptr, layout_SFA, b_scales_ptr, layout_SFB
147-
};
148-
}();
139+
auto a_ptr = static_cast<ElementAB const*>(a.data_ptr());
140+
auto b_ptr = static_cast<ElementAB const*>(b.data_ptr());
141+
auto a_scales_ptr = static_cast<ElementBlockScale const*>(a_scales.data_ptr());
142+
auto b_scales_ptr = static_cast<ElementBlockScale const*>(b_scales.data_ptr());
143+
144+
typename GemmKernel::MainloopArguments mainloop_args{};
145+
mainloop_args.ptr_A = a_ptr;
146+
mainloop_args.dA = a_stride;
147+
mainloop_args.ptr_B = b_ptr;
148+
mainloop_args.dB = b_stride;
149+
mainloop_args.ptr_SFA = a_scales_ptr;
150+
mainloop_args.layout_SFA = layout_SFA;
151+
mainloop_args.ptr_SFB = b_scales_ptr;
152+
mainloop_args.layout_SFB = layout_SFB;
149153
auto prob_shape = cute::make_shape(m, n, k, 1);
150154

151155
auto c_ptr = static_cast<ElementD*>(out.data_ptr());

docker/Dockerfile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ ARG SCCACHE_S3_NO_CREDENTIALS=0
196196

197197
# Flag to control whether to use pre-built vLLM wheels
198198
ARG VLLM_USE_PRECOMPILED=""
199+
ARG VLLM_MAIN_CUDA_VERSION=""
199200

200201
# if USE_SCCACHE is set, use sccache to speed up compilation
201202
RUN --mount=type=cache,target=/root/.cache/uv \
@@ -213,6 +214,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
213214
&& export SCCACHE_IDLE_TIMEOUT=0 \
214215
&& export CMAKE_BUILD_TYPE=Release \
215216
&& export VLLM_USE_PRECOMPILED="${VLLM_USE_PRECOMPILED}" \
217+
&& export VLLM_MAIN_CUDA_VERSION="${VLLM_MAIN_CUDA_VERSION}" \
216218
&& export VLLM_DOCKER_BUILD_CONTEXT=1 \
217219
&& sccache --show-stats \
218220
&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
@@ -375,7 +377,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
375377
# Install FlashInfer from source
376378
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
377379
# Keep this in sync with "flashinfer" extra in setup.py
378-
ARG FLASHINFER_GIT_REF="v0.3.0"
380+
ARG FLASHINFER_GIT_REF="v0.3.1"
379381
# Flag to control whether to compile FlashInfer AOT kernels
380382
# Set to "true" to enable AOT compilation:
381383
# docker build --build-arg FLASHINFER_AOT_COMPILE=true ...

setup.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,6 @@ def load_module_from_path(module_name, path):
5656
# fallback to cpu
5757
VLLM_TARGET_DEVICE = "cpu"
5858

59-
MAIN_CUDA_VERSION = "12.8"
60-
6159

6260
def is_sccache_available() -> bool:
6361
return which("sccache") is not None and \
@@ -507,15 +505,15 @@ def get_vllm_version() -> str:
507505
version += f"{sep}precompiled"
508506
else:
509507
cuda_version = str(get_nvcc_cuda_version())
510-
if cuda_version != MAIN_CUDA_VERSION:
508+
if cuda_version != envs.VLLM_MAIN_CUDA_VERSION:
511509
cuda_version_str = cuda_version.replace(".", "")[:3]
512510
# skip this for source tarball, required for pypi
513511
if "sdist" not in sys.argv:
514512
version += f"{sep}cu{cuda_version_str}"
515513
elif _is_hip():
516514
# Get the Rocm Version
517515
rocm_version = get_rocm_version() or torch.version.hip
518-
if rocm_version and rocm_version != MAIN_CUDA_VERSION:
516+
if rocm_version and rocm_version != envs.VLLM_MAIN_CUDA_VERSION:
519517
version += f"{sep}rocm{rocm_version.replace('.', '')[:3]}"
520518
elif _is_tpu():
521519
version += f"{sep}tpu"
@@ -664,7 +662,7 @@ def _read_requirements(filename: str) -> list[str]:
664662
"mistral_common[audio]"], # Required for audio processing
665663
"video": [], # Kept for backwards compatibility
666664
# FlashInfer should be updated together with the Dockerfile
667-
"flashinfer": ["flashinfer-python==0.3.0"],
665+
"flashinfer": ["flashinfer-python==0.3.1"],
668666
# Optional deps for AMD FP4 quantization support
669667
"petit-kernel": ["petit-kernel"],
670668
},

tests/compile/backend.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,4 +64,8 @@ def check_after_ops(self, ops: Sequence[OpOverload]):
6464
num_pre = len(list(find_op_nodes(op, self.graph_pre_pass)))
6565
num_post = len(list(find_op_nodes(op, self.graph_post_pass)))
6666
assert num_pre == 0, f"Unexpected op {op.name()} in pre-pass graph"
67-
assert num_post > 0, f"Op {op.name()} not found in post-pass graph"
67+
assert num_post > 0, f"Op {op.name()} not found in post-pass graph"
68+
69+
def op_count(self, op: OpOverload, before=False) -> int:
70+
graph = self.graph_pre_pass if before else self.graph_post_pass
71+
return len(list(find_op_nodes(op, graph)))

0 commit comments

Comments
 (0)