Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
6dd55af
[Doc] Update docs on handling OOM (#15357)
DarkLight1337 Mar 24, 2025
9d72daf
[V1][Perf] Simpler request output queues (#15156)
njhill Mar 24, 2025
623e2ed
[BugFix][V1] Quick fix for min_tokens with multiple EOS (#15407)
njhill Mar 24, 2025
23fdab0
[Hardware][TPU] Skip failed compilation test (#15421)
lsy323 Mar 24, 2025
8279201
[Build] Cython compilation support fix (#14296)
gshtras Mar 24, 2025
f533b58
[ROCm][Kernel] MoE weights padding (#14454)
gshtras Mar 24, 2025
ebcebee
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling (#15063)
WoosukKwon Mar 25, 2025
911c8eb
[Minor][Spec Decode] Remove compiled_softmax (#15416)
WoosukKwon Mar 25, 2025
97cfa65
Add pipeline parallel support to `TransformersModel` (#12832)
hmellor Mar 25, 2025
6db9457
[Misc] Remove LoRA log (#15388)
jeejeelee Mar 25, 2025
b5269db
Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" (#…
tlrmchlsmth Mar 25, 2025
10b34e3
[Bugfix] Fixed the issue of not being able to input video and image s…
chaunceyjiang Mar 25, 2025
a09ad90
[V1] guidance backend for structured output + `auto` fallback mode (#…
russellb Mar 25, 2025
25f560a
[V1][Spec Decode] Update target_logits in place for rejection samplin…
WoosukKwon Mar 25, 2025
051da7e
Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin…
houseroad Mar 25, 2025
4157f56
[Hardware][TPU][Bugfix] Fix v1 mp profiler (#15409)
lsy323 Mar 25, 2025
4f044b1
[Kernel][CPU] CPU MLA (#14744)
gau-nernst Mar 25, 2025
3e2f37a
Dockerfile.ppc64le changes to move to UBI (#15402)
Shafi-Hussain Mar 25, 2025
a9e879b
[Misc] Clean up MiniCPM-V/O code (#15337)
DarkLight1337 Mar 25, 2025
5994430
[Misc] Remove redundant `num_embeds` (#15443)
DarkLight1337 Mar 25, 2025
3f04a7f
[Doc] Update V1 user guide for multi-modality (#15460)
DarkLight1337 Mar 25, 2025
a608160
[Kernel] Fix conflicting macro names for gguf kernels (#15456)
SzymonOzog Mar 25, 2025
d0cfec7
[bugfix] fix inductor cache on max_position_embeddings (#15436)
youkaichao Mar 25, 2025
0a049c7
[CI/Build] Add tests for the V1 tpu_model_runner. (#14843)
yarongmu-google Mar 25, 2025
5d8e1c9
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vL…
oteroantoniogom Mar 25, 2025
5f063a8
[bugfix] add supports_v1 platform interface (#15417)
joerunde Mar 25, 2025
e977c11
Add workaround for shared field_names in pydantic model class (#13925)
maxdebayser Mar 25, 2025
a0dd7dc
[TPU][V1] Fix Sampler recompilation (#15309)
NickLucche Mar 25, 2025
6aa196c
[V1][Minor] Use `SchedulerInterface` type for engine scheduler field …
njhill Mar 25, 2025
082ab86
[V1] Support long_prefill_token_threshold in v1 scheduler (#15419)
houseroad Mar 25, 2025
ac3cd6e
[core] add bucket padding to tpu_model_runner (#14995)
Chenyaaang Mar 25, 2025
a5cfbab
[Core] LoRA: V1 Scheduler optimization (#15422)
varun-sundar-rabindranath Mar 25, 2025
ff38f0a
[CI/Build] LoRA: Delete long context tests (#15503)
varun-sundar-rabindranath Mar 26, 2025
e42389f
Transformers backend already supports V1 (#15463)
hmellor Mar 26, 2025
997c881
[Model] Support multi-image for Molmo (#15438)
DarkLight1337 Mar 26, 2025
23114d3
[Misc] Warn about v0 in benchmark_paged_attn.py (#15495)
tlrmchlsmth Mar 26, 2025
33437bc
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1,…
LucasWilkinson Mar 26, 2025
6c663df
[misc] LoRA - Skip LoRA kernels when not required (#15152)
varun-sundar-rabindranath Mar 26, 2025
5aefd6a
Fix raw_request extraction in load_aware_call decorator (#15382)
daniel-salib Mar 26, 2025
781d056
[Feature] Enhance EAGLE Architecture with Proper RMS Norms (#14990)
luyuzhe111 Mar 26, 2025
5ebf667
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER (#14967)
vllmellm Mar 26, 2025
99f536f
[Misc] Enhance warning information to user-defined chat template (#15…
wwl2755 Mar 26, 2025
4ec2cee
[Misc] improve example script output (#15528)
reidliu41 Mar 26, 2025
cf5c8f1
Separate base model from `TransformersModel` (#15467)
hmellor Mar 26, 2025
1aa162e
Apply torchfix (#15532)
cyyever Mar 26, 2025
c091c0a
Improve validation of TP in Transformers backend (#15540)
hmellor Mar 26, 2025
1711b92
[Model] Add Reasoning Parser for Granite Models (#14202)
alex-jw-brooks Mar 26, 2025
e64afa4
multi-node offline DP+EP example (#15484)
youkaichao Mar 26, 2025
0af4d76
Fix weight loading for some models in Transformers backend (#15544)
hmellor Mar 26, 2025
733e7c9
[Refactor] Remove unnecessary backend parameter in structured output …
aarnphm Mar 26, 2025
35fad35
[V1][Sampler] Faster top-k only implementation (#15478)
njhill Mar 26, 2025
27df519
Support SHA256 as hash function in prefix caching (#15297)
dr75 Mar 26, 2025
dd8a29d
Applying some fixes for K8s agents in CI (#15493)
Alexei-V-Ivanov-AMD Mar 26, 2025
b2e85e2
[V1] TPU - Revert to exponential padding by default (#15565)
alexm-redhat Mar 26, 2025
9d119a8
[V1] TPU CI - Fix test_compilation.py (#15570)
alexm-redhat Mar 26, 2025
94c662f
Merge remote-tracking branch 'upstream/main'
gshtras Mar 26, 2025
970a0b4
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025…
gshtras Mar 26, 2025
7a88827
Use Cache Hinting for fused_moe kernel (#15511)
wrmedford Mar 26, 2025
e74ff40
[TPU] support disabling xla compilation cache (#15567)
yaochengji Mar 27, 2025
7a6d45b
Support FIPS enabled machines with MD5 hashing (#15299)
MattTheCuber Mar 27, 2025
9239bf7
[Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
ElizaWszola Mar 27, 2025
ce78f9a
Add automatic tpu label to mergify.yml (#15560)
mgoin Mar 27, 2025
69db16a
add platform check back (#15578)
Chenyaaang Mar 27, 2025
8095341
[misc] LoRA: Remove unused long context test data (#15558)
varun-sundar-rabindranath Mar 27, 2025
7f301dd
[Doc] Update V1 user guide for fp8 kv cache support (#15585)
wayzeng Mar 27, 2025
fb22be5
[moe][quant] add weight name case for offset (#15515)
MengqingCao Mar 27, 2025
54aa619
[V1] Refactor num_computed_tokens logic (#15307)
comaniac Mar 27, 2025
dcf2a59
Allow torchao quantization in SiglipMLP (#15575)
jerryzh168 Mar 27, 2025
ecff830
[ROCm] Env variable to trigger custom PA (#15557)
gshtras Mar 27, 2025
619d3de
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SE…
yaochengji Mar 27, 2025
df8d3d1
[Misc] Restrict ray version dependency and update PP feature warning …
ruisearch42 Mar 27, 2025
e1e0fd7
[TPU] Avoid Triton Import (#15589)
robertgshaw2-redhat Mar 27, 2025
f4c98b4
[Misc] Consolidate LRUCache implementations (#15481)
Avabowler Mar 27, 2025
43ed414
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
robertgshaw2-redhat Mar 27, 2025
e6c9053
[Misc] Clean up `scatter_patch_features` (#15559)
DarkLight1337 Mar 27, 2025
3f532cb
[Misc] Use model_redirect to redirect the model name to a local folde…
noooop Mar 27, 2025
6278bc8
Fix incorrect filenames in vllm_compile_cache.py (#15494)
zou3519 Mar 27, 2025
8063dfc
[Doc] update --system for transformers installation in docker doc (#1…
reidliu41 Mar 27, 2025
ac5bc61
[Model] MiniCPM-V/O supports V1 (#15487)
DarkLight1337 Mar 27, 2025
8958217
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on…
h-sugi Mar 27, 2025
07bf813
[Doc] Link to onboarding tasks (#15629)
DarkLight1337 Mar 27, 2025
2471815
[Misc] Replace `is_encoder_decoder_inputs` with `split_enc_dec_inputs…
DarkLight1337 Mar 27, 2025
66aa4c0
[Feature] Add middleware to log API Server responses (#15593)
terrytangyuan Mar 27, 2025
13ac9ca
[Misc] Avoid direct access of global `mm_registry` in `compute_encode…
DarkLight1337 Mar 27, 2025
46450b8
Use absolute placement for Ask AI button (#15628)
hmellor Mar 27, 2025
4098b72
[Bugfix][TPU][V1] Fix recompilation (#15553)
NickLucche Mar 27, 2025
32d6692
Correct PowerPC to modern IBM Power (#15635)
clnperez Mar 27, 2025
112b3e5
[CI] Update rules for applying `tpu` label. (#15634)
russellb Mar 27, 2025
15dac21
[V1] AsyncLLM data parallel (#13923)
njhill Mar 27, 2025
bd45912
[TPU] Lazy Import (#15656)
robertgshaw2-redhat Mar 28, 2025
726efc6
[Quantization][V1] BitsAndBytes support V1 (#15611)
jeejeelee Mar 28, 2025
4e0f607
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. (…
kebe7jun Mar 28, 2025
b4245a4
[Doc] Fix dead links in Job Board (#15637)
wwl2755 Mar 28, 2025
8a49eea
[CI][TPU] Temporarily Disable Quant Test on TPU (#15649)
robertgshaw2-redhat Mar 28, 2025
4ae17bf
Revert "Use Cache Hinting for fused_moe kernel (#15511)" (#15645)
wrmedford Mar 28, 2025
e7f720e
[Misc]add coding benchmark for speculative decoding (#15303)
CXIAAAAA Mar 28, 2025
4d0ec37
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 (#…
gshtras Mar 28, 2025
cec8c7d
Refactor error handling for multiple exceptions in preprocessing (#15…
JasonZhu1313 Mar 28, 2025
8693e47
[Bugfix] Fix `mm_hashes` forgetting to be passed (#15668)
DarkLight1337 Mar 28, 2025
355f663
[V1] Remove legacy input registry (#15673)
DarkLight1337 Mar 28, 2025
2d9045f
[TPU][CI] Fix TPUModelRunner Test (#15667)
robertgshaw2-redhat Mar 28, 2025
32b14ba
[Refactor][Frontend] Keep all logic about reasoning into one class (#…
gaocegege Mar 28, 2025
280d074
[CPU][CI] Improve CPU Dockerfile (#15690)
bigPYJ1151 Mar 28, 2025
70f2c2a
[Bugfix] Fix 'InductorAdaptor object has no attribute 'cache_dir' (#1…
jeejeelee Mar 28, 2025
a10314c
[Misc] Fix test_sleep to use query parameters (#14373)
lizzzcai Mar 28, 2025
3bbaacb
[Bugfix][Frontend] Eliminate regex based check in reasoning full gene…
gaocegege Mar 28, 2025
fd5fd26
[Frontend] update priority for --api-key and VLLM_API_KEY (#15588)
reidliu41 Mar 28, 2025
0b41675
[Docs] Add "Generation quality changed" section to troubleshooting (#…
hmellor Mar 28, 2025
91276c5
[Model] Adding torch compile annotations to chatglm (#15624)
jeejeelee Mar 28, 2025
3b00ff9
[Bugfix][v1] xgrammar structured output supports Enum. (#15594)
chaunceyjiang Mar 28, 2025
812a41c
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025…
gshtras Mar 28, 2025
82ade59
Merge remote-tracking branch 'origin/main' into upstream_merge_2025_0…
gshtras Mar 28, 2025
a4dba75
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025…
gshtras Mar 28, 2025
541d1df
[Bugfix] `embed_is_patch` for Idefics3 (#15696)
DarkLight1337 Mar 28, 2025
7329ff5
[V1] Support disable_any_whtespace for guidance backend (#15584)
russellb Mar 28, 2025
2914006
[doc] add missing imports (#15699)
reidliu41 Mar 28, 2025
432cf22
[Bugfix] Fix regex compile display format (#15368)
kebe7jun Mar 28, 2025
47e9038
Fix cpu offload testing for gptq/awq/ct (#15648)
mgoin Mar 28, 2025
70e1322
[Minor] Remove TGI launching script (#15646)
WoosukKwon Mar 28, 2025
c6bc003
[Misc] Remove unused utils and clean up imports (#15708)
DarkLight1337 Mar 28, 2025
d03308b
[Misc] Remove stale func in KVTransferConfig (#14746)
ShangmingCai Mar 28, 2025
038bede
[TPU] [Perf] Improve Memory Usage Estimation (#15671)
robertgshaw2-redhat Mar 28, 2025
04437e3
[Bugfix] [torch.compile] Add Dynamo metrics context during compilatio…
ProExpertProg Mar 28, 2025
c3f687a
[V1] TPU - Fix the chunked prompt bug (#15713)
alexm-redhat Mar 28, 2025
26df46e
[Misc] cli auto show default value (#15582)
reidliu41 Mar 28, 2025
f3f8d8f
implement prometheus fast-api-instrumentor for http service metrics (…
daniel-salib Mar 29, 2025
cff8991
[Docs][V1] Optimize diagrams in prefix caching design (#15716)
simpx Mar 29, 2025
c802f54
[ROCm][AMD][Build] Update AMD supported arch list (#15632)
gshtras Mar 29, 2025
de1cb38
[Model] Support Skywork-R1V (#15397)
pengyuange Mar 29, 2025
762b424
[Docs] Document v0 engine support in reasoning outputs (#15739)
gaocegege Mar 29, 2025
6d531ad
[Misc][V1] Misc code streamlining (#15723)
njhill Mar 29, 2025
1286211
[Bugfix] LoRA V1: add and fix entrypoints tests (#15715)
varun-sundar-rabindranath Mar 29, 2025
7a79920
[CI] Speed up V1 structured output tests (#15718)
russellb Mar 29, 2025
8427f70
Use numba 0.61 for python 3.10+ to support numpy>=2 (#15692)
cyyever Mar 29, 2025
5b800f0
[Bugfix] set VLLM_WORKER_MULTIPROC_METHOD=spawn for vllm.entrypoionts…
jinzhen-lin Mar 29, 2025
da461f3
[TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K (#15714)
NickLucche Mar 29, 2025
7c1f760
[Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 (#15659)
yarongmu-google Mar 29, 2025
73aa704
[doc] update doc (#15740)
reidliu41 Mar 29, 2025
4965ec4
[FEAT] [ROCm] Add AITER int8 scaled gemm kernel (#15433)
tjtanaa Mar 29, 2025
94744ba
[V1] [Feature] Collective RPC (#15444)
wwl2755 Mar 29, 2025
6fa7cd3
[Feature][Disaggregated] Support XpYd disaggregated prefill with Moon…
ShangmingCai Mar 29, 2025
c67abd6
[V1] Support interleaved modality items (#15605)
ywang96 Mar 29, 2025
2bc4be4
[V1][Minor] Simplify rejection sampler's parse_output (#15741)
WoosukKwon Mar 29, 2025
3c0ff91
[Bugfix] Fix Mllama interleaved images input support (#15564)
Isotr0py Mar 29, 2025
0455337
[CI] xgrammar structured output supports Enum. (#15757)
chaunceyjiang Mar 30, 2025
6909a76
[Bugfix] Fix Mistral guided generation using xgrammar (#15704)
juliendenize Mar 30, 2025
44c3a5a
[doc] update conda to usage link in installation (#15761)
reidliu41 Mar 30, 2025
7fd8c0f
fix test_phi3v (#15321)
pansicheng Mar 30, 2025
803d5c3
[V1] Override `mm_counts` for dummy data creation (#15703)
DarkLight1337 Mar 30, 2025
248e76c
fix: lint fix a ruff checkout syntax error (#15767)
yihong0618 Mar 30, 2025
bb103b2
[Bugfix] Added `embed_is_patch` mask for fuyu model (#15731)
kylehh Mar 30, 2025
70fedd0
fix: Comments to English for better dev experience (#15768)
yihong0618 Mar 30, 2025
9b459ec
[V1][Scheduler] Avoid calling `_try_schedule_encoder_inputs` for ever…
WoosukKwon Mar 30, 2025
18ed313
[Misc] update the comments (#15780)
lcy4869 Mar 31, 2025
effc5d2
[Benchmark] Update Vision Arena Dataset and HuggingFaceDataset Setup …
JenZhao Mar 31, 2025
e858294
[Feature][ROCm]Enable fusion pass for torch.compile on ROCm (#15050)
charlifu Mar 31, 2025
b932c04
Recommend developing with Python 3.12 in developer guide (#15811)
hmellor Mar 31, 2025
e7ae3bf
fix: better install requirement for install in setup.py (#15796)
yihong0618 Mar 31, 2025
555aa21
[V1] Fully Transparent Implementation of CPU Offloading (#15354)
youkaichao Mar 31, 2025
3aa2b6a
[Model] Update support for NemotronNAS models (#15008)
Naveassaf Mar 31, 2025
c2e7507
[Bugfix] Fix Crashing When Loading Modules With Batchnorm Stats (#15813)
alex-jw-brooks Mar 31, 2025
037bcd9
[Bugfix] Fix missing return value in load_weights method of adapters.…
noc-turne Mar 31, 2025
e294861
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025…
gshtras Mar 31, 2025
f264100
Merge branch 'main' into upstream_merge_2025_03_31
gshtras Mar 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain -f Dockerfile.cpu ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
7 changes: 4 additions & 3 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,10 @@ if [[ $commands == *"--shard-id="* ]]; then
# assign shard-id for each shard
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
echo "Shard ${GPU} commands:$commands_gpu"
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES="${GPU}" \
Expand Down Expand Up @@ -166,7 +167,7 @@ else
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network host \
--network=host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
Expand Down
18 changes: 11 additions & 7 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,19 @@ set -ex
CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}

# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
remove_docker_container() {
set -e;
docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true;
docker image rm cpu-test-"$BUILDKITE_BUILD_NUMBER" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 || true;
}
trap remove_docker_container EXIT
remove_docker_container

# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$BUILDKITE_BUILD_NUMBER" --target vllm-test -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 --target vllm-test -f Dockerfile.cpu .

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
Expand All @@ -36,8 +40,8 @@ function cpu_tests() {
# Run basic model test
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pip install -r vllm/requirements/test.txt
pip install -r vllm/requirements/cpu.txt
pytest -v -s tests/kernels/test_cache.py -m cpu_model
pytest -v -s tests/kernels/test_mla_decode_cpu.py -m cpu_model
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
Expand Down
9 changes: 6 additions & 3 deletions .buildkite/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,20 @@ docker run --privileged --net host --shm-size=16G -it \
&& export VLLM_USE_V1=1 \
&& export VLLM_XLA_CHECK_RECOMPILATION=1 \
&& echo TEST_1 \
&& python3 /workspace/vllm/tests/tpu/test_compilation.py \
&& pytest -v -s /workspace/vllm/tests/tpu/test_compilation.py \
&& echo TEST_2 \
&& pytest -v -s /workspace/vllm/tests/v1/tpu/test_basic.py \
&& echo TEST_3 \
&& pytest -v -s /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine \
&& echo TEST_4 \
&& pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& echo TEST_5 \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py" \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py \
&& echo TEST_6 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/worker/test_tpu_model_runner.py \
&& echo TEST_7 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_sampler.py" \


# TODO: This test fails because it uses RANDOM_SEED sampling
# && VLLM_USE_V1=1 pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \

16 changes: 10 additions & 6 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -138,21 +138,23 @@ steps:
- examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py
- tests/examples/offline_inference/data_parallel.py
- tests/v1/test_async_llm_dp.py
commands:
# test with tp=2 and external_dp=2
- VLLM_USE_V1=0 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
- torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
# test with internal dp
- python3 ../examples/offline_inference/data_parallel.py
- TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py
- pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
- pushd ../examples/offline_inference
- VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 rlhf.py
- VLLM_ENABLE_V1_MULTIPROCESSING=0 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- python3 rlhf.py
- RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- popd

- label: Metrics, Tracing Test # 10min
Expand Down Expand Up @@ -295,7 +297,7 @@ steps:
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py --ignore=lora/test_transfomers_model.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py --ignore=lora/test_transfomers_model.py
parallelism: 4

- label: PyTorch Fullgraph Smoke Test # 9min
Expand Down Expand Up @@ -441,6 +443,7 @@ steps:
- pytest -v -s models/encoder_decoder/audio_language -m core_model
- pytest -v -s models/encoder_decoder/language -m core_model
- pytest -v -s models/encoder_decoder/vision_language -m core_model
- pytest -v -s models/decoder_only/vision_language/test_interleaved.py

- label: Multi-Modal Models Test (Extended) 1 # 48m
optional: true
Expand Down Expand Up @@ -526,8 +529,11 @@ steps:
- vllm/worker/worker.py
- vllm/worker/model_runner.py
- entrypoints/llm/test_collective_rpc.py
- tests/v1/test_async_llm_dp.py
- vllm/v1/engine/
commands:
- VLLM_ENABLE_V1_MULTIPROCESSING=0 pytest -v -s entrypoints/llm/test_collective_rpc.py
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py
- pytest -v -s entrypoints/llm/test_collective_rpc.py
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
Expand Down Expand Up @@ -604,8 +610,6 @@ steps:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
# This test runs llama 13B, so it is required to run on 4 GPUs.
- pytest -v -s -x lora/test_long_context.py
# There is some Tensor Parallelism related processing logic in LoRA that
# requires multi-GPU testing for validation.
- pytest -v -s -x lora/test_chatglm3_tp.py
Expand Down
30 changes: 30 additions & 0 deletions .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,36 @@ pull_request_rules:
add:
- v1

- name: label-tpu
description: Automatically apply tpu label
# Keep this list in sync with `label-tpu-remove` conditions
conditions:
- or:
- files~=tpu.py
- files~=_tpu
- files~=tpu_
- files~=/tpu/
- files~=pallas
actions:
label:
add:
- tpu

- name: label-tpu-remove
description: Automatically remove tpu label
# Keep this list in sync with `label-tpu` conditions
conditions:
- and:
- -files~=tpu.py
- -files~=_tpu
- -files~=tpu_
- -files~=/tpu/
- -files~=pallas
actions:
label:
remove:
- tpu

- name: ping author on conflicts and add 'needs-rebase' label
conditions:
- conflict
Expand Down
30 changes: 29 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")

# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")

#
# Supported/expected torch versions for CUDA/ROCm.
Expand Down Expand Up @@ -235,6 +235,7 @@ set(VLLM_EXT_SRC
"csrc/activation_kernels.cu"
"csrc/layernorm_kernels.cu"
"csrc/layernorm_quant_kernels.cu"
"csrc/cuda_view.cu"
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
Expand Down Expand Up @@ -462,6 +463,33 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
set(FP4_ARCHS)
endif()

#
# CUTLASS MoE kernels

# The MoE kernel cutlass_moe_mm requires CUDA 12.3 or later (and only works
# on Hopper). get_cutlass_moe_mm_data should only be compiled if it's possible
# to compile MoE kernels that use its output.
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a;" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu"
"csrc/quantization/cutlass_w8a8/moe/moe_data.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM90=1")
message(STATUS "Building grouped_mm_c3x for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
message(STATUS "Not building grouped_mm_c3x kernels as CUDA Compiler version is "
"not >= 12.3, we recommend upgrading to CUDA 12.3 or later "
"if you intend on running FP8 quantized MoE models on Hopper.")
else()
message(STATUS "Not building grouped_mm_c3x as no compatible archs found "
"in CUDA target architectures")
endif()
endif()

#
# Machete kernels

Expand Down
143 changes: 0 additions & 143 deletions Dockerfile.base_navi

This file was deleted.

Loading