Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
dfada85
[Frontend] Expose custom args in OpenAI APIs (#16862)
afeldman-nm Jun 19, 2025
36239f7
Fix FA2 fallback for Blackwell V1 (#19781)
mgoin Jun 19, 2025
8d1e89d
[Misc][ROCm] Enforce no unused variable in ROCm C++ files (#19796)
houseroad Jun 19, 2025
4959915
[Quantization] Modify the logic of BNB double quantization (#19742)
jeejeelee Jun 19, 2025
799397e
Support embedding models in V1 (#16188)
maxdebayser Jun 19, 2025
b1098b4
[Bugfix] Fix the linter (#19826)
houseroad Jun 19, 2025
e2148dc
[Bugfix] Add check_health to v1 async client. (#19821)
kouroshHakha Jun 19, 2025
83ca9ae
Mark invariant normalizer in Gemma as non-persistent (#19788)
yhtang Jun 19, 2025
2de12be
[ROCm] [AITER] [Bugfix] Patch for AITER commit `648764942e552a8bb5fe1…
tjtanaa Jun 19, 2025
aa20d10
[Misc] [ROCm] Prevent surplus tensor reshape (#19803)
zsolt-borbely-htec Jun 19, 2025
c7b370c
raise exception for pin_lora (#19809)
andyxning Jun 19, 2025
6021999
[Minor] Allow redirecting model path for HfRunner in test (#19795)
Isotr0py Jun 19, 2025
1d0ae26
Add xLAM tool parser support (#17148)
zuxin666 Jun 19, 2025
466166d
[Frontend] Add optional token-level progress bar to `LLM.beam_search`…
NekoMimiUnagi Jun 19, 2025
4719460
Fixing Chunked Prefill Test. (#19762)
Alexei-V-Ivanov-AMD Jun 19, 2025
6f68c49
[Doc] Update V1 user guide for embedding models (#19842)
22quinn Jun 19, 2025
01220ce
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI (#19838)
bigPYJ1151 Jun 19, 2025
ead2110
[Core][Bugfix] Fix Online MM Beam Search (#19688)
alex-jw-brooks Jun 19, 2025
ea10dd9
[Frontend] early return chat format resolution when specified (#19735)
xzbdmw Jun 19, 2025
10d82f9
[Benchmark][Bugfix] Fix Dataset Length Calculation (#19868)
robertgshaw2-redhat Jun 20, 2025
ee9a153
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI (#19872)
Isotr0py Jun 20, 2025
b6bad3d
[CI][Neuron] Fail and exit on first error (#19622)
elaineyz Jun 20, 2025
5aa4a01
[Benchmark] Fix `Value of type "SampleRequest" is not indexable` (#18…
b8zhong Jun 20, 2025
e41bf15
[Chore]: qwen3-moe-type-hints-mistake (#19860)
Xerxes-cn Jun 20, 2025
e3a3e4d
[Bugfix] Enable PP with AITER+V1 (#19822)
qli88 Jun 20, 2025
5e666f7
[Bugfix][Ray] Set the cuda context eagerly in the ray worker (#19583)
kouroshHakha Jun 20, 2025
089a306
[Misc] update cuda version (#19526)
reidliu41 Jun 20, 2025
e384f2f
[Misc] refactor example - openai_transcription_client (#19851)
reidliu41 Jun 20, 2025
71d1219
[Kernel] correct cpu worker function parameter type (#19745)
andyxning Jun 20, 2025
7771d1d
[Fix] import regex instead of re (#19875)
tdoublep Jun 20, 2025
f1e840e
[Model] GPT2ForSequenceClassification model (#19663)
nie3e Jun 20, 2025
7e8977f
[custom_op][vllm-plugin] update custom_op class to use op_registry (#…
xuechendi Jun 20, 2025
2e3e3c8
Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
vladmihailescu Jun 20, 2025
79f2f1c
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tes…
bigPYJ1151 Jun 20, 2025
71baf85
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (#19749)
andyxning Jun 20, 2025
e773a9e
[Misc] Clean up useless code (#19889)
wangxiyuan Jun 20, 2025
8ca81bb
Fix: Check the type of params to be a Sequence not list. (#19910)
rabinadk1 Jun 20, 2025
6f170f1
[Bugfix] Fix bnb 8bit model weights loading (#19917)
Isotr0py Jun 21, 2025
c3bf9ba
[New model support]Support Tarsier2 (#19887)
princepride Jun 21, 2025
caa680f
[doc] add contact us in community (#19922)
reidliu41 Jun 21, 2025
2c5302f
[Multimodal] Optimize Qwen2/2.5-VL startup time (#19756)
WoosukKwon Jun 21, 2025
3b1e4c6
[Docs] Add GPT2ForSequenceClassification to supported models in docs …
nie3e Jun 21, 2025
4c409ca
[Misc] add vllm_config in __init__ (#19866)
andyxning Jun 22, 2025
2bb246b
[MISC] add cpu_kvcache_space_bytes to CacheConfig (#19812)
andyxning Jun 22, 2025
202c5df
[Benchmark] fix request loss if "ping" is returned (#19535)
sywangyi Jun 22, 2025
c305a21
[CI/Build] Auto tag perf benchmarks related PRs (#19943)
22quinn Jun 22, 2025
ec0db6f
[doc] use snippets for contact us (#19944)
reidliu41 Jun 22, 2025
c76a506
[Misc] Update model-specific PR tagging (#19949)
Jun 22, 2025
2c11a29
[Misc] Simplify vllm bench cli subcommand implementation (#19948)
yeqcharlotte Jun 22, 2025
e91386c
[Chore] dedup logs (#19955)
aarnphm Jun 22, 2025
33d51f5
[BugFix] Add an env to disable moe chunking to work around compile in…
yeqcharlotte Jun 22, 2025
c4cf260
[Perf][CLI] Improve overall startup time (#19941)
aarnphm Jun 22, 2025
4a0f788
[Core] feat: Implement Priority Scheduling in V1 Engine (#19057)
amitm02 Jun 23, 2025
f39ab2d
[Misc] Configurable timeout for execute_model RPC calls via env var (…
jinqinn Jun 23, 2025
493c275
Fix(models/siglip): Add compatibility for Gemma models quantized by l…
Flink-ddd Jun 23, 2025
f17aec0
[doc] Fold long code blocks to improve readability (#19926)
reidliu41 Jun 23, 2025
2ebff5b
[P/D][NixlConnector] Support `tp_size > num_kv_heads` deployments (#1…
NickLucche Jun 23, 2025
1bcd15e
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned u…
lk-chen Jun 23, 2025
5111642
[Doc] Update V1 status for decoder-only embedding models (#19952)
Isotr0py Jun 23, 2025
b82e0f8
[doc] use MkDocs collapsible blocks - supplement (#19973)
reidliu41 Jun 23, 2025
a6e6604
[Bugfix] Fix CI bitsandbytes failure (#19969)
jeejeelee Jun 23, 2025
53243e5
[doc] improve readability for long commands (#19920)
reidliu41 Jun 23, 2025
c3649e4
[Docs] Fix syntax highlighting of shell commands (#19870)
lgeiger Jun 23, 2025
68aaeb3
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low l…
tlrmchlsmth Jun 23, 2025
61f4fc5
[Bugfix][v1] Fix step pooler implementation and step pooling usage in…
Isotr0py Jun 23, 2025
d0132f0
[Misc] Add type alias `ReqId` and `EngineId` for better readability (…
lk-chen Jun 23, 2025
e6327c9
[Feature] Support sequence parallelism for static fp8 quantization (#…
cascade812 Jun 23, 2025
a3bc76e
[CI/Build] Push latest tag for cpu and neuron docker image (#19897)
22quinn Jun 23, 2025
dd2ccf8
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend (#19395)
Jun-Howie Jun 23, 2025
4671ac6
[Bugfix][Benchmark] Fix Marlin benchmark (#19929)
22quinn Jun 23, 2025
33d5e29
[TPU] Fix tpu model runner test (#19995)
Chenyaaang Jun 23, 2025
a738dbb
Update test case parameter to have the throughput above 8.0 (#19994)
QiliangCui Jun 24, 2025
ee5ad8d
[Misc][Tools][Benchmark] Add profile to autotune script (#19711)
Chenyaaang Jun 24, 2025
0eed516
[doc] Fix broken link in the installation for CPU (#19980)
yankay Jun 24, 2025
3014c92
add some examples for other benchmark scripts (#19893)
reidliu41 Jun 24, 2025
9a3b883
[PERF] Speedup of MRoPE prepare inputs (#19939)
vadiklyutiy Jun 24, 2025
53da4cd
[Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 (#20014)
bigPYJ1151 Jun 24, 2025
26d34eb
refactor example - qwen3_reranker (#19847)
reidliu41 Jun 24, 2025
981eeca
[Fix][V1] Remove --scheduling-policy oracle (#20010)
amitm02 Jun 24, 2025
a045b7e
[Perf] Improve/Fix-regression for FA3 in High QPS regimes (#19463)
LucasWilkinson Jun 24, 2025
c635c5f
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the ben…
dtransposed Jun 24, 2025
8619e71
[BugFix] Fix multi-node offline data parallel (#19937)
njhill Jun 24, 2025
91f7d9d
[P/D] Asynchronously do _nixl_handshake (#19836)
lk-chen Jun 24, 2025
c6e3bba
[Feature] Integrate new deepgemm (#19820)
yewentao256 Jun 24, 2025
ead3698
[Easy] Remove submodule added in #19463 (#20039)
b8zhong Jun 24, 2025
c01d1c5
use .dev for version comparison with pytorch nightly release (#20031)
BoyuanFeng Jun 24, 2025
0d06b53
cmake: Update vllm_flash_attn for vllm_kernels (#20032)
seemethere Jun 24, 2025
1afa994
[Llama4] Update `attn_temperature_tuning` (#19997)
b8zhong Jun 25, 2025
a6c4b87
Revert "[Feature] Integrate new deepgemm (#19820)" (#20049)
yewentao256 Jun 25, 2025
2273ec3
Revert "Fix(models/siglip): Add compatibility for Gemma models quanti…
Isotr0py Jun 25, 2025
3443aaf
Move to a faster base64 implementation (#19984)
h-avsha Jun 25, 2025
7108934
[Frontend] speed up import time of vllm.config (#18036)
davidxia Jun 25, 2025
879f69b
[Refactor] Remove duplicate `ceil_div` (#20023)
yewentao256 Jun 25, 2025
f59fc60
[Feat][CLI] enforce-include-usage (#19695)
max-wittig Jun 25, 2025
015fab8
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. …
bnellnm Jun 25, 2025
ba7ba35
[Chore] debloat some initial logs (#19438)
aarnphm Jun 25, 2025
0f9e735
[BugFix] Fix full-cuda-graph illegal memory access in FA3 (#20057)
LucasWilkinson Jun 25, 2025
c53fec1
[doc] add reference link for Intel XPU (#20064)
reidliu41 Jun 25, 2025
bf51815
[Doc] Guide for Incremental Compilation Workflow (#19109)
mgoin Jun 25, 2025
8359f4c
[V1][Speculative Decoding] Fix DeepSeek MTP (#20022)
cjackal Jun 25, 2025
e795d72
[Frontend] Add `/v1/audio/translations` OpenAI API endpoint (#19615)
NickLucche Jun 25, 2025
02c97d9
[Quantization] Add compressed-tensors emulations support for NVFP4 (#…
dsikka Jun 25, 2025
23a04e0
[Fix] Support cls pooling in ModernBertPooler (#20067)
lsz05 Jun 25, 2025
8b8c209
static_scaled_fp8_quant should not run when scale.numel is not 1 (#20…
eldarkurtic Jun 25, 2025
4734704
[PD] let toy proxy handle /chat/completions (#19730)
lk-chen Jun 25, 2025
c40692b
[Misc] Add parallel state `node_count` function (#20045)
njhill Jun 25, 2025
4e0db57
Fix the path to the testing script. (#20082)
QiliangCui Jun 25, 2025
9f0608f
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine (…
izhuhaoran Jun 25, 2025
2cc2069
[TPU][Bugfix] fix kv cache padding (#20048)
yaochengji Jun 25, 2025
55c65ab
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue (…
njhill Jun 25, 2025
2d7620c
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN (#19919)
Chenyaaang Jun 25, 2025
296ce95
[CI] Add SM120 to the Dockerfile (#19794)
mgoin Jun 25, 2025
754b00e
[Bugfix] Fix Mistral tool-parser regex for nested JSON (#20093)
mgoin Jun 26, 2025
2582683
[PD] Skip `tp_size` exchange with rank0 (#19413)
NickLucche Jun 26, 2025
9502c38
[Benchmark][Bug] Fix multiple bugs in bench and add args to spec_deco…
ekagra-ranjan Jun 26, 2025
65397e4
[Bugfix] Allow `CUDA_VISIBLE_DEVICES=''` in `Platform.device_id_to_ph…
eicherseiji Jun 26, 2025
1d7c29f
[Doc] Update docs for New Model Implementation (#20115)
DarkLight1337 Jun 26, 2025
d188913
[Refactor] Remove unused library (#20099)
yewentao256 Jun 26, 2025
0567c82
[CPU] Fix torch version in x86 CPU backend (#19258)
bigPYJ1151 Jun 26, 2025
167aca4
[Misc] Use collapsible blocks for benchmark examples. (#20017)
reidliu41 Jun 26, 2025
84c260c
[Docs] Improve frameworks/helm.md (#20113)
windsonsea Jun 26, 2025
27c065d
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break a…
tjtanaa Jun 26, 2025
1f5d178
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 …
mgoin Jun 26, 2025
c894c5d
[Bug Fix] Fix address/port already in use error for deep_ep test (#20…
yewentao256 Jun 26, 2025
0907d50
[Doc] Automatically signed-off by PyCharm (#20120)
noooop Jun 26, 2025
6393b03
[Doc] Auto sign-off for VSCode (#20132)
DarkLight1337 Jun 26, 2025
34878a0
[Doc] Rename page titles (#20130)
DarkLight1337 Jun 26, 2025
0bceac9
Spam folks if config.py changes (#20131)
tlrmchlsmth Jun 26, 2025
b69781f
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention b…
jikunshang Jun 26, 2025
04e1642
[TPU] add kv cache update kernel (#19928)
yaochengji Jun 26, 2025
5623088
[Refactor] Rename commnication utils (#20091)
yewentao256 Jun 26, 2025
07b8fae
[Doc] correct LoRA capitalization (#20135)
kyolebu Jun 26, 2025
e9fd658
[Feature] Expert Parallelism Load Balancer (EPLB) (#18343)
abmfy Jun 26, 2025
71799fd
[CI Failure] Fix OOM with test_oot_registration_embedding (#20144)
mgoin Jun 27, 2025
a57d57f
[Quantization] Bump to use latest `compressed-tensors` (#20033)
dsikka Jun 27, 2025
2d7779f
[Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler (#20071)
ilmarkov Jun 27, 2025
44d2e6a
[Bugfix] Build moe_data for both sm100 and sm90 (#20086)
mgoin Jun 27, 2025
0740e29
[Feature] add quick all reduce (#19744)
lihaoyang-amd Jun 27, 2025
8b64c89
[CI] Sync test dependency with test.in for torch nightly (#19632)
yangw-dev Jun 27, 2025
e110930
[Fix] Fix gemma CI test failing on main (#20124)
tdoublep Jun 27, 2025
cd4cfee
[Model][1/N] Automatic conversion of CrossEncoding model (#20012)
noooop Jun 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Please download the visualization scripts in the post
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```console
```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
Expand Down
2 changes: 2 additions & 0 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ steps:
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
Expand All @@ -117,6 +118,7 @@ steps:
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest --progress plain -f docker/Dockerfile.neuron ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
3 changes: 2 additions & 1 deletion .buildkite/scripts/hardware_ci/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,11 @@ docker run --rm -it --device=/dev/neuron0 --network bridge \
--name "${container_name}" \
${image_name} \
/bin/bash -c "
set -e; # Exit on first error
python3 /workspace/vllm/examples/offline_inference/neuron.py;
python3 -m pytest /workspace/vllm/tests/neuron/1_core/ -v --capture=tee-sys;
for f in /workspace/vllm/tests/neuron/2_core/*.py; do
echo 'Running test file: '$f;
echo \"Running test file: \$f\";
python3 -m pytest \$f -v --capture=tee-sys;
done
"
2 changes: 2 additions & 0 deletions .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,8 @@ run_and_track_test 14 "test_tpu_qkv_linear.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_tpu_qkv_linear.py"
run_and_track_test 15 "test_spmd_model_weight_loading.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_spmd_model_weight_loading.py"
run_and_track_test 16 "test_kv_cache_update_kernel.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_kv_cache_update_kernel.py"

# After all tests have been attempted, exit with the overall status.
if [ "$overall_script_exit_code" -ne 0 ]; then
Expand Down
1 change: 1 addition & 0 deletions .buildkite/scripts/hardware_ci/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,5 @@ docker run \
sh -c '
VLLM_USE_V1=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
VLLM_USE_V1=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m -tp 2
VLLM_USE_V1=1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
'
4 changes: 2 additions & 2 deletions .buildkite/scripts/tpu/config_v6e_1.env
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ CONTAINER_NAME=vllm-tpu

# vllm config
MODEL=meta-llama/Llama-3.1-8B-Instruct
MAX_NUM_SEQS=512
MAX_NUM_BATCHED_TOKENS=512
MAX_NUM_SEQS=256
MAX_NUM_BATCHED_TOKENS=1024
TENSOR_PARALLEL_SIZE=1
MAX_MODEL_LEN=2048
DOWNLOAD_DIR=/mnt/disks/persist
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/scripts/tpu/docker_run_bm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ docker run \

echo "run script..."
echo
docker exec "$CONTAINER_NAME" /bin/bash -c ".buildkite/scripts/hardware_ci/run_bm.sh"
docker exec "$CONTAINER_NAME" /bin/bash -c ".buildkite/scripts/tpu/run_bm.sh"

echo "copy result back..."
VLLM_LOG="$LOG_ROOT/$TEST_NAME"_vllm_log.txt
Expand Down
45 changes: 43 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,16 @@ steps:
# TODO: add `--strict` once warnings in docstrings are fixed
- mkdocs build

- label: Pytorch Nightly Dependency Override Check # 2min
# if this test fails, it means the nightly torch version is not compatible with some
# of the dependencies. Please check the error message and add the package to whitelist
# in /vllm/tools/generate_nightly_torch_test.py
soft_fail: true
source_file_dependencies:
- requirements/nightly_torch_test.txt
commands:
- bash standalone_tests/pytorch_nightly_dependency.sh

- label: Async Engine, Inputs, Utils, Worker Test # 24min
mirror_hardwares: [amdexperimental]
source_file_dependencies:
Expand Down Expand Up @@ -89,7 +99,7 @@ steps:
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py

- label: Chunked Prefill Test
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies:
- vllm/
- tests/basic_correctness/test_chunked_prefill
Expand Down Expand Up @@ -168,6 +178,23 @@ steps:
- VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- popd

- label: EPLB Algorithm Test
working_dir: "/vllm-workspace/tests"
source_file_dependencies:
- vllm/distributed/eplb
- tests/distributed/test_eplb_algo.py
commands:
- pytest -v -s distributed/test_eplb_algo.py

- label: EPLB Execution Test # 5min
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/eplb
- tests/distributed/test_eplb_execute.py
commands:
- pytest -v -s distributed/test_eplb_execute.py

- label: Metrics, Tracing Test # 10min
mirror_hardwares: [amdexperimental, amdproduction]
num_gpus: 2
Expand Down Expand Up @@ -271,6 +298,15 @@ steps:
commands:
- pytest -v -s prefix_caching


- label: Platform Tests (CUDA)
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/
- tests/cuda
commands:
- pytest -v -s cuda/test_cuda_context.py

- label: Samplers Test # 36min
mirror_hardwares: [amdexperimental]
source_file_dependencies:
Expand Down Expand Up @@ -606,13 +642,18 @@ steps:
- vllm/executor/
- vllm/model_executor/models/
- tests/distributed/
- tests/examples/offline_inference/data_parallel.py
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code

- label: Distributed Tests (2 GPUs) # 40min
mirror_hardwares: [amdexperimental]
Expand Down Expand Up @@ -736,7 +777,7 @@ steps:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models.txt

- label: Weight Loading Multiple GPU Test - Large Models # optional
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental]
working_dir: "/vllm-workspace/tests"
num_gpus: 2
gpu: a100
Expand Down
4 changes: 4 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@
/vllm/entrypoints @aarnphm
CMakeLists.txt @tlrmchlsmth

# Any change to the VllmConfig changes can have a large user-facing impact,
# so spam a lot of people
/vllm/config.py @simon-mo @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor

# vLLM V1
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
/vllm/v1/structured_output @mgoin @russellb @aarnphm
Expand Down
15 changes: 14 additions & 1 deletion .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ pull_request_rules:
- files~=^vllm/entrypoints/openai/tool_parsers/llama.*\.py
- files~=^vllm/model_executor/models/.*llama.*\.py
- files~=^vllm/transformers_utils/configs/.*llama.*\.py
- title~=(?i)llama
actions:
label:
add:
Expand All @@ -65,6 +66,19 @@ pull_request_rules:
add:
- multi-modality

- name: label-performance
description: Automatically apply performance label
conditions:
- or:
- files~=^benchmarks/
- files~=^vllm/benchmarks/
- files~=^tests/benchmarks/
- files~=^\.buildkite/nightly-benchmarks/
actions:
label:
add:
- performance

- name: label-qwen
description: Automatically apply qwen label
conditions:
Expand All @@ -74,7 +88,6 @@ pull_request_rules:
- files~=^vllm/model_executor/models/.*qwen.*\.py
- files~=^vllm/reasoning/.*qwen.*\.py
- title~=(?i)Qwen
- body~=(?i)Qwen
actions:
label:
add:
Expand Down
10 changes: 10 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ repos:
files: ^requirements/test\.(in|txt)$
- repo: local
hooks:
- id: format-torch-nightly-test
name: reformat nightly_torch_test.txt to be in sync with test.in
language: python
entry: python tools/generate_nightly_torch_test.py
files: ^requirements/test\.(in|txt)$
- id: mypy-local
name: Run mypy for local Python installation
entry: tools/mypy.sh 0 "local"
Expand Down Expand Up @@ -115,6 +120,11 @@ repos:
entry: python tools/check_spdx_header.py
language: python
types: [python]
- id: check-root-lazy-imports
name: Check root lazy imports
entry: python tools/check_init_lazy_imports.py
language: python
types: [python]
- id: check-filenames
name: Check for spaces in all filenames
entry: bash
Expand Down
22 changes: 20 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
CUDA_ARCHS "${FP4_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4=1")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1")
message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}")
else()
message(STATUS "Not building NVFP4 as no compatible archs were found.")
Expand Down Expand Up @@ -547,8 +548,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# if it's possible to compile MoE kernels that use its output.
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu"
"csrc/quantization/cutlass_w8a8/moe/moe_data.cu")
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
Expand All @@ -566,6 +566,16 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
endif()

# moe_data.cu is used by all CUTLASS MoE kernels.
cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND CUTLASS_MOE_DATA_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/moe_data.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${CUTLASS_MOE_DATA_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
endif()

#
# Machete kernels

Expand Down Expand Up @@ -638,6 +648,14 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# if CUDA endif
endif()

if (VLLM_GPU_LANG STREQUAL "HIP")
# Add QuickReduce kernels
list(APPEND VLLM_EXT_SRC
"csrc/custom_quickreduce.cu"
)
# if ROCM endif
endif()

message(STATUS "Enabling C extension.")
define_gpu_extension_target(
_C
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,11 +154,13 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs

## Contact Us

<!-- --8<-- [start:contact-us] -->
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) or [Discussions](https://github.com/vllm-project/vllm/discussions)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
<!-- --8<-- [end:contact-us] -->

## Media Kit

Expand Down
Loading