Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [AITER] Support AITER operators: Fused MoE, Linear, Norm #436

Merged
merged 782 commits into from
Feb 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
782 commits
Select commit Hold shift + click to select a range
8a1f938
[Doc] Update Quantization Hardware Support Documentation (#12025)
tjtanaa Jan 14, 2025
ff39141
[HPU][misc] add comments for explanation (#12034)
youkaichao Jan 14, 2025
bb354e6
[Bugfix] Fix various bugs in multi-modal processor (#12031)
DarkLight1337 Jan 14, 2025
1f18adb
[Kernel] Revert the API change of Attention.forward (#12038)
heheda12345 Jan 14, 2025
2e0e017
[Platform] Add output for Attention Backend (#11981)
wangxiyuan Jan 14, 2025
a2d2acb
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
heheda12345 Jan 14, 2025
c9d6ff5
Explain where the engine args go when using Docker (#12041)
hmellor Jan 14, 2025
16f8680
Docs lint
gshtras Jan 14, 2025
eb4abfd
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_13
gshtras Jan 14, 2025
87054a5
[Doc]: Update the Json Example of the `Engine Arguments` document (#1…
maang-h Jan 14, 2025
5976f48
Merge pull request #358 from ROCm/upstream_merge_25_01_13
gshtras Jan 14, 2025
a3a3ee4
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_…
jeejeelee Jan 14, 2025
42f5e7c
[Kernel] Support MulAndSilu (#11624)
jeejeelee Jan 15, 2025
1a51b9f
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in se…
kzawora-intel Jan 15, 2025
9ddac56
[Platform] move current_memory_usage() into platform (#11369)
shen-shanshan Jan 15, 2025
b7ee940
[V1][BugFix] Fix edge case in VLM scheduling (#12065)
WoosukKwon Jan 15, 2025
0794e74
[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)
elfiegg Jan 15, 2025
f218f9c
[core] Turn off GPU communication overlap for Ray executor (#12051)
ruisearch42 Jan 15, 2025
ad34c0d
[core] platform agnostic executor via collective_rpc (#11256)
youkaichao Jan 15, 2025
3f9b7ab
[Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
kylesayrs Jan 15, 2025
994fc65
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCache…
heheda12345 Jan 15, 2025
cbe9439
Fix: cases with empty sparsity config (#12057)
rahul-tuli Jan 15, 2025
ad388d2
Type-fix: make execute_model output type optional (#12020)
youngkent Jan 15, 2025
3adf0ff
[Platform] Do not raise error if _Backend is not found (#12023)
wangxiyuan Jan 15, 2025
97eb97b
[Model]: Support internlm3 (#12037)
RunningLeon Jan 15, 2025
5ecf3e0
Misc: allow to use proxy in `HTTPConnection` (#12042)
zhouyuan Jan 15, 2025
de0526f
[Misc][Quark] Upstream Quark format to VLLM (#10765)
kewang-xlnx Jan 15, 2025
57e729e
[Doc]: Update `OpenAI-Compatible Server` documents (#12082)
maang-h Jan 15, 2025
edce722
[Bugfix] use right truncation for non-generative tasks (#12050)
joerunde Jan 15, 2025
70755e8
[V1][Core] Autotune encoder cache budget (#11895)
ywang96 Jan 15, 2025
ebd8c66
[Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
varun-sundar-rabindranath Jan 15, 2025
cd9d06f
Allow hip sources to be directly included when compiling for rocm. (#…
tvirolai-amd Jan 15, 2025
fa0050d
[Core] Default to using per_token quantization for fp8 when cutlass i…
elfiegg Jan 16, 2025
f8ef146
[Doc] Add documentation for specifying model architecture (#12105)
DarkLight1337 Jan 16, 2025
9aa1519
Various cosmetic/comment fixes (#12089)
mgoin Jan 16, 2025
dd7c9ad
[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12…
Isotr0py Jan 16, 2025
bf53e0c
Support torchrun and SPMD-style offline inference (#12071)
youkaichao Jan 16, 2025
92e793d
[core] LLM.collective_rpc interface and RLHF example (#12084)
youkaichao Jan 16, 2025
874f7c2
[Bugfix] Fix max image feature size for Llava-one-vision (#12104)
ywang96 Jan 16, 2025
8bd76fb
Enable user marker for vllm profiling (#357)
Lzy17 Jan 16, 2025
5fd24ec
[misc] Add LoRA kernel micro benchmarks (#11579)
varun-sundar-rabindranath Jan 16, 2025
62b06ba
[Model] Add support for deepseek-vl2-tiny model (#12068)
Isotr0py Jan 16, 2025
c5a9406
Deepseek V3 support (#364)
gshtras Jan 16, 2025
d06e824
[Bugfix] Set enforce_eager automatically for mllama (#12127)
heheda12345 Jan 16, 2025
ebc73f2
[Bugfix] Fix a path bug in disaggregated prefill example script. (#12…
KuntaiDu Jan 17, 2025
fead53b
[CI]add genai-perf benchmark in nightly benchmark (#10704)
jikunshang Jan 17, 2025
1475847
[Doc] Add instructions on using Podman when SELinux is active (#12136)
terrytangyuan Jan 17, 2025
b8bfa46
[Bugfix] Fix issues in CPU build Dockerfile (#12135)
terrytangyuan Jan 17, 2025
d1adb9b
[BugFix] add more `is not None` check in VllmConfig.__post_init__ (#1…
heheda12345 Jan 17, 2025
d75ab55
[Misc] Add deepseek_vl2 chat template (#12143)
Isotr0py Jan 17, 2025
8027a72
[ROCm][MoE] moe tuning support for rocm (#12049)
divakar-amd Jan 17, 2025
69d765f
[V1] Move more control of kv cache initialization from model_executor…
heheda12345 Jan 17, 2025
07934cc
[Misc][LoRA] Improve the readability of LoRA error messages (#12102)
jeejeelee Jan 17, 2025
d4e6194
[CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
bigPYJ1151 Jan 17, 2025
87a0c07
[core] allow callable in collective_rpc (#12151)
youkaichao Jan 17, 2025
58fd57f
[Bugfix] Fix score api for missing max_model_len validation (#12119)
wallashss Jan 17, 2025
54cacf0
[Bugfix] Mistral tokenizer encode accept list of str (#12149)
jikunshang Jan 17, 2025
b5b57e3
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
gshtras Jan 17, 2025
7b98a65
[torch.compile] disable logging when cache is disabled (#12043)
youkaichao Jan 17, 2025
2b83503
[misc] fix cross-node TP (#12166)
youkaichao Jan 18, 2025
c09503d
[AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)
hongxiayang Jan 18, 2025
da02cb4
[core] further polish memory profiling (#12126)
youkaichao Jan 18, 2025
813f249
[Docs] Fix broken link in SECURITY.md (#12175)
russellb Jan 18, 2025
02798ec
[Model] Port deepseek-vl2 processor, remove dependency (#12169)
Isotr0py Jan 18, 2025
6d0e3d3
[core] clean up executor class hierarchy between v1 and v0 (#12171)
youkaichao Jan 18, 2025
32eb0da
[Misc] Support register quantization method out-of-tree (#11969)
ice-tong Jan 19, 2025
7a8a48d
[V1] Collect env var for usage stats (#12115)
simon-mo Jan 19, 2025
4e94951
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#…
madamczykhabana Jan 19, 2025
630eb5b
[Bugfix] Fix multi-modal processors for transformers 4.48 (#12187)
DarkLight1337 Jan 19, 2025
e66faf4
[torch.compile] store inductor compiled Python file (#12182)
youkaichao Jan 19, 2025
936db11
benchmark_serving support --served-model-name param (#12109)
gujingit Jan 19, 2025
edaae19
[Misc] Add BNB support to GLM4-V model (#12184)
Isotr0py Jan 19, 2025
81763c5
[V1] Add V1 support of Qwen2-VL (#12128)
ywang96 Jan 19, 2025
bbe5f9d
[Model] Support for fairseq2 Llama (#11442)
MartinGleize Jan 19, 2025
df450aa
[Bugfix] Fix num_heads value for simple connector when tp enabled (#1…
ShangmingCai Jan 20, 2025
51ef828
[torch.compile] fix sym_tensor_indices (#12191)
youkaichao Jan 20, 2025
3ea7b94
Move linting to `pre-commit` (#11975)
hmellor Jan 20, 2025
c5c0620
[DOC] Fix typo in docstring and assert message (#12194)
terrytangyuan Jan 20, 2025
d264312
[DOC] Add missing docstring in LLMEngine.add_request() (#12195)
terrytangyuan Jan 20, 2025
0974c9b
[Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)
terrytangyuan Jan 20, 2025
8360979
[Model] Add Qwen2 PRM model support (#12202)
Isotr0py Jan 20, 2025
59a0192
[Core] Interface for accessing model from `VllmRunner` (#10353)
DarkLight1337 Jan 20, 2025
5c89a29
[misc] add placeholder format.sh (#12206)
youkaichao Jan 20, 2025
4001ea1
[CI/Build] Remove dummy CI steps (#12208)
DarkLight1337 Jan 20, 2025
3127e97
[CI/Build] Make pre-commit faster (#12212)
DarkLight1337 Jan 20, 2025
b37d827
[Model] Upgrade Aria to transformers 4.48 (#12203)
DarkLight1337 Jan 20, 2025
170eb35
[misc] print a message to suggest how to bypass commit hooks (#12217)
youkaichao Jan 20, 2025
c222f47
[core][bugfix] configure env var during import vllm (#12209)
youkaichao Jan 20, 2025
5f0ec39
[V1] Remove `_get_cache_block_size` (#12214)
heheda12345 Jan 20, 2025
86bfb6d
[Misc] Pass `attention` to impl backend (#12218)
wangxiyuan Jan 20, 2025
18572e3
[Bugfix] Fix `HfExampleModels.find_hf_info` (#12223)
DarkLight1337 Jan 20, 2025
9666369
[CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)
heheda12345 Jan 20, 2025
031e6eb
Merge remote-tracking branch 'upstream/main'
gshtras Jan 20, 2025
3e1cadb
Merge pull request #368 from ROCm/upstream_merge_25_01_20
gshtras Jan 20, 2025
faa1815
Using ROCm6.3.1 base docker and building hipblas-common (#366)
gshtras Jan 20, 2025
7bd3630
[Misc] Update CODEOWNERS (#12229)
ywang96 Jan 20, 2025
af69a6a
fix: update platform detection for M-series arm based MacBook process…
isikhi Jan 20, 2025
da75122
[misc] add cuda runtime version to usage data (#12190)
youkaichao Jan 21, 2025
06a760d
[bugfix] catch xgrammar unsupported array constraints (#12210)
Jason-CKY Jan 21, 2025
750f4ca
[Kernel] optimize moe_align_block_size for cuda graph and large num_e…
jinzhen-lin Jan 21, 2025
ecf6781
Add quantization and guided decoding CODEOWNERS (#12228)
mgoin Jan 21, 2025
d4b62d4
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)
gshtras Jan 21, 2025
5fe6bf2
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
NickLucche Jan 21, 2025
2fc6944
[ci/build] disable failed and flaky tests (#12240)
youkaichao Jan 21, 2025
9691255
[Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244)
DarkLight1337 Jan 21, 2025
1f1542a
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#1…
jeejeelee Jan 21, 2025
f2e9f2a
[Misc] Remove redundant TypeVar from base model (#12248)
DarkLight1337 Jan 21, 2025
a94eee4
[Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
DarkLight1337 Jan 21, 2025
c81081f
[torch.compile] transparent compilation with more logging (#12246)
youkaichao Jan 21, 2025
b197a5c
[V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
ywang96 Jan 21, 2025
9a7c3a0
Remove pytorch comments for outlines + compressed-tensors (#12260)
tdoublep Jan 21, 2025
c646128
[Platform] improve platforms getattr (#12264)
MengqingCao Jan 21, 2025
3aec49e
[ci/build] update nightly torch for gh200 test (#12270)
youkaichao Jan 21, 2025
9705b90
[Bugfix] fix race condition that leads to wrong order of token return…
joennlae Jan 21, 2025
1e60f87
[Kernel] fix moe_align_block_size error condition (#12239)
jinzhen-lin Jan 21, 2025
132a132
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
rickyyx Jan 21, 2025
18fd4a8
[Bugfix] Multi-sequence broken (#11898)
andylolu2 Jan 21, 2025
347eeeb
[Misc] Remove experimental dep from tracing.py (#12007)
codefromthecrypt Jan 21, 2025
fa9ee08
[Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
wangxiyuan Jan 21, 2025
9c485d9
[Core] Free CPU pinned memory on environment cleanup (#10477)
janimo Jan 21, 2025
78d7d30
Update pre-commit.yml (#374)
gshtras Jan 21, 2025
2acba47
[bugfix] moe tuning. rm is_navi() (#12273)
divakar-amd Jan 21, 2025
69196a9
[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…
maleksan85 Jan 21, 2025
09ccc9c
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …
hongxiayang Jan 21, 2025
df76e5a
[VLM] Simplify post-processing of replacement info (#12269)
DarkLight1337 Jan 22, 2025
64ea24d
[ci/lint] Add back default arg for pre-commit (#12279)
khluu Jan 22, 2025
016e367
[CI] add docker volume prune to neuron CI (#12291)
liangfu Jan 22, 2025
cbdc4ad
[Ci/Build] Fix mypy errors on main (#12296)
DarkLight1337 Jan 22, 2025
222a9dc
[Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288)
njhill Jan 22, 2025
66818e5
[core] separate builder init and builder prepare for each batch (#12253)
youkaichao Jan 22, 2025
4004f14
[Build] update requirements of no-device (#12299)
MengqingCao Jan 22, 2025
68ad4e3
[Core] Support fully transparent sleep mode (#11743)
youkaichao Jan 22, 2025
cd7b6f0
[VLM] Avoid unnecessary tokenization (#12310)
DarkLight1337 Jan 22, 2025
528dbca
[Model][Bugfix]: correct Aria model output (#12309)
xffxff Jan 22, 2025
16366ee
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for…
ywang96 Jan 22, 2025
6609cdf
[Doc] Add docs for prompt replacement (#12318)
DarkLight1337 Jan 22, 2025
fc66dee
[Misc] Fix the error in the tip for the --lora-modules parameter (#12…
WangErXiao Jan 22, 2025
84bee4b
[Misc] Improve the readability of BNB error messages (#12320)
jeejeelee Jan 22, 2025
b5839a1
Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-…
maleksan85 Jan 22, 2025
96f6a75
[Bugfix] Fix HPU multiprocessing executor (#12167)
kzawora-intel Jan 22, 2025
7206ce4
[Core] Support `reset_prefix_cache` (#12284)
comaniac Jan 22, 2025
aea9436
[Frontend][V1] Online serving performance improvements (#12287)
njhill Jan 22, 2025
68c4421
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is brok…
rasmith Jan 23, 2025
a600e9f
FP8 FA fixes (#381)
ilia-cher Jan 23, 2025
5f9b40b
Returning the use of the proper stream in allreduce (#382)
gshtras Jan 23, 2025
8d7aa9d
[Bugfix] Fixing AMD LoRA CI test. (#12329)
Alexei-V-Ivanov-AMD Jan 23, 2025
01a5594
[Docs] Update FP8 KV Cache documentation (#12238)
mgoin Jan 23, 2025
7551a34
[Docs] Document vulnerability disclosure process (#12326)
russellb Jan 23, 2025
f0ef372
[V1] Add `uncache_blocks` (#12333)
comaniac Jan 23, 2025
5116274
[doc] explain common errors around torch.compile (#12340)
youkaichao Jan 23, 2025
8ae5ff2
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package u…
zhenwei-intel Jan 23, 2025
c5b4b11
[Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Isotr0py Jan 23, 2025
978b45f
[Kernel] Flash Attention 3 Support (#12093)
LucasWilkinson Jan 23, 2025
d07efb3
[Doc] Troubleshooting errors during model inspection (#12351)
DarkLight1337 Jan 23, 2025
99d01a5
[V1] Simplify M-RoPE (#12352)
ywang96 Jan 23, 2025
8c01b80
[Bugfix] Fix broken internvl2 inference with v1 (#12360)
Isotr0py Jan 23, 2025
3f50c14
[core] add wake_up doc and some sanity check (#12361)
youkaichao Jan 23, 2025
6e650f5
[torch.compile] decouple compile sizes and cudagraph sizes (#12243)
youkaichao Jan 23, 2025
e97f802
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
gshtras Jan 23, 2025
2c85529
[TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
lsy323 Jan 23, 2025
2cbeeda
[Docs] Document Phi-4 support (#12362)
Isotr0py Jan 23, 2025
eb5cb5e
[BugFix] Fix parameter names and `process_after_weight_loading` for W…
dsikka Jan 23, 2025
9726ad6
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
jsato8094 Jan 23, 2025
682b55b
[Docs] Add meetup slides (#12345)
WoosukKwon Jan 23, 2025
84f5d47
Using pytorch commit past the point when rowwise PR (https://github.c…
gshtras Jan 23, 2025
f09aa4a
Integrated ater: kvcache pa gemm rmsnorm
amd-ruitang3 Jan 21, 2025
f641276
fix pa
amd-ruitang3 Jan 21, 2025
c0fbbc7
fix
amd-ruitang3 Jan 21, 2025
05d7a19
replace topk softmax
charlifu Jan 22, 2025
c5cffcd
[Docs] Update spec decode + structured output in compat matrix (#12373)
russellb Jan 24, 2025
dae7f32
replace fp moe kernel with aiter kernel
charlifu Jan 22, 2025
24b0205
[V1][Frontend] Coalesce bunched `RequestOutput`s (#12298)
njhill Jan 24, 2025
d3d6bb1
Set weights_only=True when using torch.load() (#12366)
russellb Jan 24, 2025
5e5630a
[Bugfix] Path join when building local path for S3 clone (#12353)
omer-dayan Jan 24, 2025
c4430d2
change ater to aiter
charlifu Jan 24, 2025
55ef66e
Update compressed-tensors version (#12367)
dsikka Jan 24, 2025
0e74d79
[V1] Increase default batch size for H100/H200 (#12369)
WoosukKwon Jan 24, 2025
6dd94db
[perf] fix perf regression from #12253 (#12380)
youkaichao Jan 24, 2025
3c818bd
[Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
ywang96 Jan 24, 2025
c7c9851
[ci/build] fix wheel size check (#12396)
youkaichao Jan 24, 2025
9a0f3bd
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)
MohitIntel Jan 24, 2025
e784c6b
[ci/build] sync default value for wheel size (#12398)
youkaichao Jan 24, 2025
3bb8e2c
[Misc] Enable proxy support in benchmark script (#12356)
jsato8094 Jan 24, 2025
ab5bbf5
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
LucasWilkinson Jan 24, 2025
a1923ec
Applying scales rename to fp8 config
gshtras Jan 24, 2025
8e87b08
Applying scales rename to fp8 config (#387)
gshtras Jan 24, 2025
1e08251
Update Dockerfile.rocm
gshtras Jan 24, 2025
df5dafa
[Misc] Remove deprecated code (#12383)
DarkLight1337 Jan 24, 2025
3132a93
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build o…
LucasWilkinson Jan 24, 2025
e5ad0fd
Using aiter moe kernel
gshtras Jan 24, 2025
28b1ad9
Dev-docker Documentation Updates (#378)
JArnoldAMD Jan 25, 2025
221d388
[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413)
ElizaWszola Jan 25, 2025
fb30ee9
[Bugfix] Fix BLIP-2 processing (#12412)
DarkLight1337 Jan 25, 2025
bf21481
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
divakar-amd Jan 25, 2025
f1fc051
[Misc] Add FA2 support to ViT MHA layer (#12355)
Isotr0py Jan 25, 2025
324960a
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
lsy323 Jan 25, 2025
2a0309a
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
ywang96 Jan 26, 2025
fa63e71
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync…
youngkent Jan 26, 2025
0ee349b
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
ywang96 Jan 26, 2025
0cdd12a
fix pa copy
junhaha666 Jan 26, 2025
ec40f65
pa update
valarLip Jan 26, 2025
a525527
[Misc] Revert FA on ViT #12355 and #12435 (#12445)
ywang96 Jan 26, 2025
9ddc352
[Frontend] generation_config.json for maximum tokens(#12242)
mhendrey Jan 26, 2025
aa2cd2c
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
tlrmchlsmth Jan 26, 2025
bf18378
add fp16 pa support for aiter
valarLip Jan 26, 2025
72f4880
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
tlrmchlsmth Jan 26, 2025
68f1114
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
LucasWilkinson Jan 26, 2025
72bac73
[Build/CI] Fix libcuda.so linkage (#12424)
tlrmchlsmth Jan 26, 2025
0034b09
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
K-Mistele Jan 27, 2025
582cf78
[DOC] Add link to vLLM blog (#12460)
terrytangyuan Jan 27, 2025
28e0750
[V1] Avoid list creation in input preparation (#12457)
WoosukKwon Jan 27, 2025
0cc6b38
[Frontend] Support scores endpoint in run_batch (#12430)
pooyadavoodi Jan 27, 2025
5204ff5
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
DarkLight1337 Jan 27, 2025
372bf08
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Isotr0py Jan 27, 2025
624a1e4
[V1][Minor] Minor optimizations for update_from_output (#12454)
WoosukKwon Jan 27, 2025
ce69f7f
[Bugfix] Fix gpt2 GGUF inference (#12467)
Isotr0py Jan 27, 2025
6a1ba42
aiter build instructions
gshtras Jan 27, 2025
103bd17
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
LucasWilkinson Jan 27, 2025
2ef325b
Copy to the right path
gshtras Jan 27, 2025
01ba927
[V1][Metrics] Add initial Prometheus logger (#12416)
markmc Jan 27, 2025
3f1fc74
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
WoosukKwon Jan 27, 2025
2bc3fbb
[FlashInfer] Upgrade to 0.2.0 (#11194)
abmfy Jan 27, 2025
8e6d987
Merge remote-tracking branch 'upstream/main'
gshtras Jan 27, 2025
6b2147f
Support FP8 FA from Quark format (#388)
BowenBao Jan 28, 2025
a892ecc
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_27
gshtras Jan 28, 2025
c8b8654
Direct call on ROCm
gshtras Jan 28, 2025
b2c3b22
Merge pull request #391 from ROCm/upstream_merge_25_01_27
gshtras Jan 28, 2025
7a292f9
20250127 docs update (#392)
arakowsk-amd Jan 29, 2025
f3012f8
Add env var toggles to disable AITER MoE or PA (both by default on)
mawong-amd Jan 30, 2025
1a88ae1
Update accuracy benchmark for batch size > 1
mawong-amd Jan 30, 2025
89677d8
Add a few more AITER toggles for norm and linear layers
mawong-amd Jan 30, 2025
273c949
Faster Custom Paged Attention kernels (#372)
sanyalington Jan 30, 2025
22141e7
Using a more precise profiling on ROCm to properly account for weight…
gshtras Jan 30, 2025
76230f7
Merge remote-tracking branch 'origin/main' into aiter_intergration_final
gshtras Jan 30, 2025
c112787
Public aiter repo
gshtras Jan 30, 2025
f27785a
Fail if aiter build failed silently
gshtras Jan 30, 2025
54a4b47
Aiter can only be built on MI300x
gshtras Jan 30, 2025
6ecbea1
Typo fix
gshtras Jan 30, 2025
9dc3394
Aiter PA off by default
gshtras Jan 31, 2025
5e275a6
Changes to support updated aiter FP8 PA
mawong-amd Feb 6, 2025
de48bed
Support FP8 and INT8 KV cache according to https://github.com/ROCm/ai…
mawong-amd Feb 7, 2025
be2e940
add moe weight shuffle for dynamic quant and unquantized path
Feb 7, 2025
7428ffd
Use FP16-native PA after support in https://github.com/ROCm/aiter/pul…
mawong-amd Feb 7, 2025
08cc2e8
Fix: Use FP8 pertoken quantize if KV cache dtype is FP8
mawong-amd Feb 7, 2025
c619358
revert rocm_flash_attn.py line 883
amd-ruitang3 Feb 7, 2025
c24ea63
Don't enable by default to use an RC for main vllm-dev docker
gshtras Feb 7, 2025
0c3476b
use ck moe for bf16 and fp16 fused_moe
charlifu Feb 10, 2025
f07a88f
Merge remote-tracking branch 'origin/aiter_intergration_final' into m…
vllmellm Feb 11, 2025
5470522
Merge remote-tracking branch 'origin/aiter_intergration_final' into m…
vllmellm Feb 11, 2025
1567df6
[Bugfix] include moe shuffle env variable
vllmellm Feb 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -109,11 +109,18 @@ ARG COMMON_WORKDIR
COPY --from=export_vllm /benchmarks ${COMMON_WORKDIR}/vllm/benchmarks
COPY --from=export_vllm /examples ${COMMON_WORKDIR}/vllm/examples

RUN git clone --recursive https://github.com/ROCm/aiter.git
RUN cd /app/aiter && pip install -r requirements.txt && PREBUILD_KERNELS=1 GPU_ARCHS=gfx942 python3 setup.py develop && pip show aiter


ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
ENV TOKENIZERS_PARALLELISM=false

# Performance environment variable.
ENV HIP_FORCE_DEV_KERNARG=1

# Enable Aiter. Make sure this only exists on the aiter branch.
# ENV VLLM_USE_AITER=1

CMD ["/bin/bash"]

113 changes: 75 additions & 38 deletions benchmarks/test_accuracy.py
Original file line number Diff line number Diff line change
@@ -1,45 +1,82 @@
import time
import argparse
import dataclasses

# from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser


def main():
llm = LLM(
'/data/AI-ModelScope/Mixtral-8x7B-Instruct-v0___1/',
tensor_parallel_size=1,
#quantization="serenity",
dtype='float16',
#swap_space=16,
#enforce_eager=True,
#kv_cache_dtype="fp8",
#quantization="fp8",
#quantized_weights_path="/quantized/quark/llama.safetensors",
#worker_use_ray=True,
#trust_remote_code=True,
#distributed_executor_backend="mp",
)
batch_size = 5
max_tokens = 256
prompt = """The sun is a"""
sampling_params = SamplingParams(temperature=0,
top_p=0.95,
max_tokens=max_tokens)

start_time = time.perf_counter()
outs = llm.generate([prompt] * batch_size, sampling_params=sampling_params)
end_time = time.perf_counter()
elapsed_time = end_time - start_time

out_lengths = [len(x.token_ids) for out in outs for x in out.outputs]
num_tokens = sum(out_lengths)

print(
f"{num_tokens} tokens. {num_tokens / batch_size} on average. {num_tokens / elapsed_time:.2f} tokens/s. {elapsed_time} seconds" # noqa: E501
def main(args: argparse.Namespace):
print(args)

engine_args = EngineArgs.from_cli_args(args)

# NOTE(woosuk): If the request cannot be processed in a single batch,
# the engine will automatically process the request in multiple batches.
llm = LLM(**dataclasses.asdict(engine_args))

sampling_params = SamplingParams(
n=args.n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=args.output_len,
)
for out in outs:
print("===========")
print(out.outputs[0].text)
print(sampling_params)

# tokenizer = AutoTokenizer.from_pretrained(engine_args.model)
# inputs = tokenizer('Hello, world!', return_tensors='pt').input_ids
inputs = [
'Where is the capital of China?',
'The capital of Russia is ',
'The CEO of DeepSeek is ',
'The future of AI is',
] * 32
outputs = llm.generate(inputs, sampling_params)
for i, output in enumerate(outputs):
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt {i}: {prompt!r}, Generated text: {generated_text!r}")
# print(tokenizer.decode(outputs[0]))


if __name__ == '__main__':
parser = FlexibleArgumentParser(
description='Benchmark the latency of processing a single batch of '
'requests till completion.')
parser.add_argument('--input-len', type=int, default=32)
parser.add_argument('--output-len', type=int, default=128)
parser.add_argument('--batch-size', type=int, default=8)
parser.add_argument('--n',
type=int,
default=1,
help='Number of generated sequences per prompt.')
parser.add_argument('--use-beam-search', action='store_true')
parser.add_argument('--num-iters-warmup',
type=int,
default=10,
help='Number of iterations to run for warmup.')
parser.add_argument('--num-iters',
type=int,
default=30,
help='Number of iterations to run.')
parser.add_argument(
'--profile',
action='store_true',
help='profile the generation process of a single batch')
parser.add_argument(
'--profile-result-dir',
type=str,
default=None,
help=('path to save the pytorch profiler output. Can be visualized '
'with ui.perfetto.dev or Tensorboard.'))
parser.add_argument(
'--output-json',
type=str,
default=None,
help='Path to save the latency results in JSON format.')

if __name__ == "__main__":
main()
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()
main(args)
2 changes: 0 additions & 2 deletions vllm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.llm_engine import LLMEngine
from vllm.entrypoints.fast_sync_llm import FastSyncLLM
from vllm.entrypoints.llm import LLM
from vllm.executor.ray_utils import initialize_ray_cluster
from vllm.inputs import PromptType, TextPrompt, TokensPrompt
Expand Down Expand Up @@ -38,7 +37,6 @@
"__version__",
"__version_tuple__",
"LLM",
"FastSyncLLM",
"ModelRegistry",
"PromptType",
"TextPrompt",
Expand Down
55 changes: 53 additions & 2 deletions vllm/attention/backends/rocm_flash_attn.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,14 @@
AttentionMetadata, AttentionType)
from vllm.attention.backends.utils import (CommonAttentionState,
CommonMetadataBuilder)
from vllm.attention.ops.paged_attn import (PagedAttention,
PagedAttentionMetadata)

if envs.VLLM_USE_AITER_PAGED_ATTN:
from vllm.attention.ops.paged_attn_ater import (PagedAttention,
PagedAttentionMetadata)
else:
from vllm.attention.ops.paged_attn import (PagedAttention,
PagedAttentionMetadata)

from vllm.logger import init_logger
from vllm.platforms import current_platform

Expand Down Expand Up @@ -460,6 +466,9 @@ def __init__(
logits_soft_cap: Optional[float] = None,
attn_type: str = AttentionType.DECODER,
) -> None:
self.k_scale = torch.tensor([1.0], dtype=torch.float32)
self.v_scale = torch.tensor([1.0], dtype=torch.float32)
self.init_kv_scales = False
if blocksparse_params is not None:
raise ValueError(
"ROCmFlashAttention does not support blocksparse attention.")
Expand Down Expand Up @@ -609,6 +618,25 @@ def forward(
else:
assert value is None

if (envs.VLLM_USE_AITER_PAGED_ATTN and kv_cache.dtype.itemsize == 1
and self.init_kv_scales is False
and kv_cache.shape != torch.Size([0])):
num_blocks = kv_cache.shape[1]
block_size = kv_cache.shape[2] // (self.num_kv_heads *
self.head_size)
self.k_scale = torch.ones(
(self.num_kv_heads, num_blocks * block_size),
dtype=torch.float32,
device=kv_cache.device)
self.v_scale = torch.ones(
(self.num_kv_heads, num_blocks * block_size),
dtype=torch.float32,
device=kv_cache.device)
self.init_kv_scales = True
# if self.init_kv_scales:
layer._k_scale = self.k_scale
layer._v_scale = self.v_scale

if self.attn_type != AttentionType.ENCODER and kv_cache.numel() > 0:
key_cache, value_cache = PagedAttention.split_kv_cache(
kv_cache, self.num_kv_heads, self.head_size)
Expand Down Expand Up @@ -780,6 +808,29 @@ def forward(
use_custom = _use_rocm_custom_paged_attention(
decode_query.dtype, head_size, block_size, gqa_ratio,
decode_meta.max_decode_seq_len)
if envs.VLLM_USE_AITER_PAGED_ATTN:
out = output[num_prefill_tokens:]
PagedAttention.forward_decode(
decode_query,
key_cache,
value_cache,
decode_meta.block_tables
if self.attn_type != AttentionType.ENCODER_DECODER else
decode_meta.cross_block_tables,
decode_meta.seq_lens_tensor
if self.attn_type != AttentionType.ENCODER_DECODER else
decode_meta.encoder_seq_lens_tensor,
decode_meta.max_decode_seq_len
if self.attn_type != AttentionType.ENCODER_DECODER else
decode_meta.max_encoder_seq_len,
self.kv_cache_dtype,
self.num_kv_heads,
self.scale,
self.alibi_slopes,
layer._k_scale,
layer._v_scale,
out=out)
return output.view(-1, self.num_heads * self.head_size)
if use_custom:
max_seq_len = (decode_meta.max_decode_seq_len if
self.attn_type != AttentionType.ENCODER_DECODER
Expand Down
Loading
Loading