-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Gemma model fails with GPTQ marlin #5088
Comments
@arunpatala Can you share the model checkpoint so we can take a look? |
You can check with this public model. The engine throws error when using either gptq_marlin or marlin as quantization. The same model works with gptq as quantization.
|
Thanks @alexm-neuralmagic can you take a look? |
@arunpatala Here is the fix #5108 (will land soon) |
thanks a lot. |
no problem |
* [Hardware][Intel] Optimize CPU backend and add more performance tips (vllm-project#4971) Co-authored-by: Jianan Gu <jianan.gu@intel.com> * [Docs] Add 4th meetup slides (vllm-project#5509) * [Misc] Add vLLM version getter to utils (vllm-project#5098) * [CI/Build] Simplify OpenAI server setup in tests (vllm-project#5100) * [Doc] Update LLaVA docs (vllm-project#5437) Co-authored-by: Roger Wang <ywang@roblox.com> * [Kernel] Factor out epilogues from cutlass kernels (vllm-project#5391) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> * [MISC] Remove FP8 warning (vllm-project#5472) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> * Seperate dev requirements into lint and test (vllm-project#5474) * Revert "[Core] Remove unnecessary copies in flash attn backend" (vllm-project#5478) * [misc] fix format.sh (vllm-project#5511) * [CI/Build] Disable test_fp8.py (vllm-project#5508) * [Kernel] Disable CUTLASS kernels for fp8 (vllm-project#5505) * Add `cuda_device_count_stateless` (vllm-project#5473) * [Hardware][Intel] Support CPU inference with AVX2 ISA (vllm-project#5452) * [Misc] Fix arg names in quantizer script (vllm-project#5507) * bump version to v0.5.0.post1 (vllm-project#5522) * [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (vllm-project#5073) Co-authored-by: simon-mo <simon.mo@hey.com> * [CI/Build] Disable LLaVA-NeXT CPU test (vllm-project#5529) * [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (vllm-project#5516) * [Misc] Fix arg names (vllm-project#5524) * [ Misc ] Rs/compressed tensors cleanup (vllm-project#5432) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> * [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (vllm-project#5401) * [mis] fix flaky test of test_cuda_device_count_stateless (vllm-project#5546) * [Core] Remove duplicate processing in async engine (vllm-project#5525) * [misc][distributed] fix benign error in `is_in_the_same_node` (vllm-project#5512) * [Docs] Add ZhenFund as a Sponsor (vllm-project#5548) * [Doc] Update documentation on Tensorizer (vllm-project#5471) * [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (vllm-project#5460) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Bugfix] Fix typo in Pallas backend (vllm-project#5558) * [Core][Distributed] improve p2p cache generation (vllm-project#5528) * Add ccache to amd (vllm-project#5555) * [Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#5364) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> * [mypy] Enable type checking for test directory (vllm-project#5017) * [CI/Build] Test both text and token IDs in batched OpenAI Completions API (vllm-project#5568) * [misc] Do not allow to use lora with chunked prefill. (vllm-project#5538) Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145) * [BugFix] Don't start a Ray cluster when not using Ray (vllm-project#5570) * [Fix] Correct OpenAI batch response format (vllm-project#5554) * Add basic correctness 2 GPU tests to 4 GPU pipeline (vllm-project#5518) * [CI][BugFix] Flip is_quant_method_supported condition (vllm-project#5577) * [build][misc] limit numpy version (vllm-project#5582) * [Doc] add debugging tips for crash and multi-node debugging (vllm-project#5581) * Fix w8a8 benchmark and add Llama-3-8B (vllm-project#5562) * [Model] Rename Phi3 rope scaling type (vllm-project#5595) * Correct alignment in the seq_len diagram. (vllm-project#5592) Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai> * [Kernel] `compressed-tensors` marlin 24 support (vllm-project#5435) * [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (vllm-project#5588) * [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * [CI/BUILD] Support non-AVX512 vLLM building and testing (vllm-project#5574) * [CI] the readability of benchmarking and prepare for dashboard (vllm-project#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (vllm-project#5571) * [bugfix][distributed] fix 16 gpus local rank arrangement (vllm-project#5604) * [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (vllm-project#5584) * [Bugfix] Fix KV head calculation for MPT models when using GQA (vllm-project#5142) * [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (vllm-project#5606) * [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (vllm-project#5131) * [Model] Initialize Phi-3-vision support (vllm-project#4986) * [Kernel] Add punica dimensions for Granite 13b (vllm-project#5559) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> * [misc][typo] fix typo (vllm-project#5620) * [Misc] Fix typo (vllm-project#5618) * [CI] Avoid naming different metrics with the same name in performance benchmark (vllm-project#5615) * [bugfix][distributed] improve p2p capability test (vllm-project#5612) [bugfix][distributed] do not error if two processes do not agree on p2p capability (vllm-project#5612) * [Misc] Remove import from transformers logging (vllm-project#5625) * [CI/Build][Misc] Update Pytest Marker for VLMs (vllm-project#5623) * [ci] Deprecate original CI template (vllm-project#5624) Signed-off-by: kevin <kevin@anyscale.com> * [Misc] Add OpenTelemetry support (vllm-project#4687) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here * [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (vllm-project#5542) * [ci] Setup Release pipeline and build release wheels with cache (vllm-project#5610) Signed-off-by: kevin <kevin@anyscale.com> * [Model] LoRA support added for command-r (vllm-project#5178) * [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (vllm-project#5639) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Doc] Added cerebrium as Integration option (vllm-project#5553) * [Bugfix] Fix CUDA version check for mma warning suppression (vllm-project#5642) * [Bugfix] Fix w8a8 benchmarks for int8 case (vllm-project#5643) * [Bugfix] Fix Phi-3 Long RoPE scaling implementation (vllm-project#5628) * [Bugfix] Added test for sampling repetition penalty bug. (vllm-project#5659) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (vllm-project#5641) * [misc][distributed] use 127.0.0.1 for single-node (vllm-project#5619) * [Model] Add FP8 kv cache for Qwen2 (vllm-project#5656) * [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (vllm-project#5684) * [Misc]Add param max-model-len in benchmark_latency.py (vllm-project#5629) * [CI/Build] Add tqdm to dependencies (vllm-project#5680) * [ci] Add A100 queue into AWS CI template (vllm-project#5648) Signed-off-by: kevin <kevin@anyscale.com> * [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (vllm-project#5688) * [ci][distributed] add tests for custom allreduce (vllm-project#5689) * [Bugfix] AsyncLLMEngine hangs with asyncio.run (vllm-project#5654) * [Doc] Update docker references (vllm-project#5614) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> * [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (vllm-project#5650) * [ci] Limit num gpus if specified for A100 (vllm-project#5694) Signed-off-by: kevin <kevin@anyscale.com> * [Misc] Improve conftest (vllm-project#5681) * [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (vllm-project#5703) * [Kernel] Update Cutlass int8 kernel configs for SM90 (vllm-project#5514) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Model] Port over CLIPVisionModel for VLMs (vllm-project#5591) * [Kernel] Update Cutlass int8 kernel configs for SM80 (vllm-project#5275) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (vllm-project#5715) * [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (vllm-project#5718) * [distributed][misc] use fork by default for mp (vllm-project#5669) * [Model] MLPSpeculator speculative decoding support (vllm-project#4947) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com> * [Kernel] Add punica dimension for Qwen2 LoRA (vllm-project#5441) * [BugFix] Fix test_phi3v.py (vllm-project#5725) * [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (vllm-project#5665) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * [Core][Distributed] add shm broadcast (vllm-project#5399) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [Kernel][CPU] Add Quick `gelu` to CPU (vllm-project#5717) * [Doc] Documentation on supported hardware for quantization methods (vllm-project#5745) * [BugFix] exclude version 1.15.0 for modelscope (vllm-project#5668) * [ci][test] fix ca test in main (vllm-project#5746) * [LoRA] Add support for pinning lora adapters in the LRU cache (vllm-project#5603) * [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (vllm-project#5616) * [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (vllm-project#5710) Co-authored-by: Roger Wang <ywang@roblox.com> * [Misc] Remove vllm-project#4789 workaround left in vllm/entrypoints/openai/run_batch.py (vllm-project#5756) * [Bugfix] Fix pin_lora error in TPU executor (vllm-project#5760) * [Docs][TPU] Add installation tip for TPU (vllm-project#5761) * [core][distributed] improve shared memory broadcast (vllm-project#5754) * [BugFix] [Kernel] Add Cutlass2x fallback kernels (vllm-project#5744) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Distributed] Add send and recv helpers (vllm-project#5719) * [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772) * [doc][faq] add warning to download models for every nodes (vllm-project#5783) * post-rebase api adjustments * [Doc] Add "Suggest edit" button to doc pages (vllm-project#5789) * [Doc] Add Phi-3-medium to list of supported models (vllm-project#5788) * [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795) * [ci] Remove aws template (vllm-project#5757) Signed-off-by: kevin <kevin@anyscale.com> * [Doc] Add notice about breaking changes to VLMs (vllm-project#5818) * [Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414) * add pin_lora to habana components * add WA for model loader * fix api mismatches with ray * tensor parallel fixes * workers cpu alignment fix * [Misc] Remove useless code in cpu_worker (vllm-project#5824) * prefill/decode metadata fixes * [Core] Add fault tolerance for `RayTokenizerGroupPool` (vllm-project#5748) * re-enable attn metadata trimming * worker_use_ray fix * [doc][distributed] add both gloo and nccl tests (vllm-project#5834) * [CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798) * [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (vllm-project#5794) * [Hardware][TPU] Refactor TPU backend (vllm-project#5831) * [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (vllm-project#5422) * [Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850) * [CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841) * [Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408) Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Co-authored-by: Stephanie <swang@anyscale.com> * [Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832) * [bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801) * [Bugfix] Fix embedding to support 2D inputs (vllm-project#5829) * [Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860) * [CI/Build] Refactor image test assets (vllm-project#5821) * [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (vllm-project#5560) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Frontend] Add tokenize/detokenize endpoints (vllm-project#5054) * [Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855) * [Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869) * Support CPU inference with VSX PowerPC ISA (vllm-project#5652) * [doc] update usage of env var to avoid conflict (vllm-project#5873) * [Misc] Add example for LLaVA-NeXT (vllm-project#5879) * [BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875) Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com> * [Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887) * [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (vllm-project#5880) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> * [Model] Add base class for LoRA-supported models (vllm-project#5018) * [Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888) * [CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526) * add collective crash WA * add comment to the weird mark_step * [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896) * [doc][misc] add note for Kubernetes users (vllm-project#5916) * [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (vllm-project#5876) * [BugFix] Fix `min_tokens` behaviour for multiple eos tokens (vllm-project#5849) * [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (vllm-project#5922) * [Model] Add Gemma 2 (vllm-project#5908) * [core][misc] remove logical block (vllm-project#5882) * [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932) * [Hardware][TPU] Optimize KV cache swapping (vllm-project#5878) * [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (vllm-project#5905) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956) * [Core] Registry for processing model inputs (vllm-project#5214) Co-authored-by: ywang96 <ywang@roblox.com> * Unmark fused_moe config json file as executable (vllm-project#5960) * [Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379) * [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (vllm-project#5894) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904) * [Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> * [Spec Decode] Introduce DraftModelRunner (vllm-project#5799) * [Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931) * [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * Support Deepseek-V2 (vllm-project#4650) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> * [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (vllm-project#5936) * Unmark more files as executable (vllm-project#5962) * [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai> * [Bugfix][TPU] Fix TPU sampler output (vllm-project#5978) * [Bugfix][TPU] Fix pad slot id (vllm-project#5977) * [Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926) * [Misc] Extend vLLM Metrics logging API (vllm-project#5925) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * [Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930) Signed-off-by: Joe Runde <joe@joerun.de> * [Bugfix] Fix precisions in Gemma 1 (vllm-project#5913) * [Misc] Update Phi-3-Vision Example (vllm-project#5981) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Bugfix] Support `eos_token_id` from `config.json` (vllm-project#5954) * [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (vllm-project#5974) * [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939) * [ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [CI/Build] Add TP test for vision models (vllm-project#5892) * [ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949) * [CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989) * [CI/Build] Reuse code for checking output consistency (vllm-project#5988) * [CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966) * [ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (vllm-project#5991) * [Frontend]: Support base64 embedding (vllm-project#5935) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909) Co-authored-by: sang <sangcho@anyscale.com> * [ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005) Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic> * [Misc] Fix `get_min_capability` (vllm-project#5971) * [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (vllm-project#5940) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007) * [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348) * Revert test changes * cleanup * llm engine cleanup * utils.py cleanup * custom ops refactor * move xops to ops * remove vllm/hpu/attn_bias.py * whitespace fix * revert accidental changes in rmsnorm * Fix hpugraph hashing * add trim_attn_metadata comment * fix prompt bucketing: * [ CI ] Re-enable Large Model LM Eval (vllm-project#6031) * [doc][misc] remove deprecated api server in doc (vllm-project#6037) * [Misc] update benchmark backend for scalellm (vllm-project#6018) * [doc][misc] further lower visibility of simple api server (vllm-project#6041) Co-authored-by: Simon Mo <simon.mo@hey.com> * [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (vllm-project#6039) * [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029) * add FAQ doc under 'serving' (vllm-project#5946) * [Bugfix][Doc] Fix Doc Formatting (vllm-project#6048) * [Bugfix] Add explicit `end_forward` calls to flashinfer (vllm-project#6044) * [BugFix] Ensure worker model loop is always stopped at the right time (vllm-project#5987) * [Frontend] Relax api url assertion for openai benchmarking (vllm-project#6046) * [Model] Changes to MLPSpeculator to support tie_weights and input_scale (vllm-project#5965) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com> * [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (vllm-project#5602) * [Frontend] Add template related params to request (vllm-project#5709) * [VLM] Remove `image_input_type` from VLM config (vllm-project#5852) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Doc] Reinstate doc dependencies (vllm-project#6061) * guard model loader wa for hpu --------- Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Lei Wen <wenlei03@qiyi.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Signed-off-by: Joe Runde <joe@joerun.de> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Jianan Gu <jianan.gu@intel.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com> Co-authored-by: Allen.Dou <allen.dou@hotmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: leiwen83 <leiwen83@users.noreply.github.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Amit Garg <gargamit@microsoft.com> Co-authored-by: Charles Riggins <liqianchen123@foxmail.com> Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai> Co-authored-by: zhyncs <me@zhyncs.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Bruce Fontaine <bruce@2.7182.net> Co-authored-by: zifeitong <zifeitong@gmail.com> Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: sergey-tinkoff <167607910+sergey-tinkoff@users.noreply.github.com> Co-authored-by: milo157 <43028253+milo157@users.noreply.github.com> Co-authored-by: Shukant Pal <SukantK2002@outlook.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: DearPlanet <junsong.zhang2021.work@outlook.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Joshua Rosenkranz <joshua.rosenkranz@gmail.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Jee Li <pandaleefree@163.com> Co-authored-by: rohithkrn <rohith.nallamaddi@gmail.com> Co-authored-by: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com> Co-authored-by: Woo-Yeon Lee <wooyeonlee0@gmail.com> Co-authored-by: Matt Wong <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: aws-patlange <90803007+aws-patlange@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Stephanie <swang@anyscale.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: sasha0552 <admin@sasha0552.org> Co-authored-by: Chip Kerchner <49959681+ChipKerchner@users.noreply.github.com> Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: wangding zeng <155410488+zwd003@users.noreply.github.com> Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai> Co-authored-by: mcalman <68564154+mcalman@users.noreply.github.com> Co-authored-by: William Lin <SolitaryThinker@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: llmpros <10524065+llmpros@users.noreply.github.com> Co-authored-by: sang <sangcho@anyscale.com> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: James Whedbee <jamesw@telnyx.com> Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com> Co-authored-by: danieljannai21 <100521221+danieljannai21@users.noreply.github.com>
c8a7e932 [core][scheduler] simplify and improve scheduler (#6867) 3c10591e [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) 0437492e PP comm optimization: replace send with partial send + allgather (#6695) 630dd9e0 [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758) 23993a79 [Bugfix][TPU] Do not use torch.Generator for TPUs (#6981) 1d2e7fb7 [Model] Pipeline parallel support for Qwen2 (#6924) 7ecee343 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 7eb0cb4a Revert "[Frontend] Factor out code for running uvicorn" (#7012) a0dce938 [Misc] Add compressed-tensors to optimized quant list (#7006) 35e9c12b [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996) 93548eb3 [Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950) 460c1884 [Bugfix] Support cpu offloading with fp8 quantization (#6960) bd700134 [MISC] Introduce pipeline parallelism partition strategies (#6920) 2ee8d3ba [Model] use FusedMoE layer in Jamba (#6935) daed30c4 [Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982) 2f4e108f [Bugfix] Clean up MiniCPM-V (#6939) 6512937d Support W4A8 quantization for vllm (#5218) c0644cf9 [Bugfix] fix logit processor excceed vocab size issue (#6927) 533d1932 [Bugfix][TPU] Set readonly=True for non-root devices (#6980) 9f0e69b6 [CI/Build] Fix mypy errors (#6968) f230cc2c [Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836) da1f7cc1 [mypy] Enable following imports for some directories (#6681) c32ab8be [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding (#6964) fb4f530b [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists (#6706) 79319ced [Nightly benchmarking suite] Remove pkill python from run benchmark suite (#6965) 40c27a7c [Build] Temporarily Disable Kernels and LoRA tests (#6961) 6ca8031e [core][misc] improve free_finished_seq_groups (#6865) d7a299ed [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) 052b6f8c [Bugfix] Fix tensorizer memory profiling bug during testing (#6881) 5895b246 [OpenVINO] Updated OpenVINO requirements and build docs (#6948) cbbc9044 [Kernel] Squash a few more warnings (#6914) 5cf9254a [BugFix] Fix use of per-request seed with pipeline parallel (#6698) f0584036 [Doc] Super tiny fix doc typo (#6949) c66c7f86 [Bugfix] Fix PaliGemma MMP (#6930) 6e063ea3 [TPU] Fix greedy decoding (#6933) af647fb8 [Kernel] Tuned int8 kernels for Ada Lovelace (#6848) 61a97c32 [Kernel] Fix marlin divide-by-zero warnings (#6904) 4fbf4aa1 [ci] GHA workflow to remove ready label upon "/notready" comment (#6921) aae6d36f [Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908) 9f69d824 [Frontend] New `allowed_token_ids` decoding request parameter (#6753) 9a7e2d05 [Bugfix] Allow vllm to still work if triton is not installed. (#6786) 7f8d612d [TPU] Support tensor parallelism in async llm engine (#6891) 60d1c6e5 [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901) db9e5708 [Core] Reduce unnecessary compute when logprobs=None (#6532) 766435e6 [Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677) 7cbd9ec7 [Model] Initialize support for InternVL2 series models (#6514) 3eeb148f [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (#6871) b1366a95 Add Nemotron to PP_SUPPORTED_MODELS (#6863) 75acdaa4 [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) fad5576c [TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) f954d071 [Docs] Add RunLLM chat widget (#6857) 1ad86acf [Model] Initial support for BLIP-2 (#5920) ecb33a28 [CI/Build][Doc] Update CI and Doc for VLM example changes (#6860) a57d7582 [bugfix] make args.stream work (#6831) 925de97e [Bugfix] Fix VLM example typo (#6859) aa46953a [Misc][VLM][Doc] Consolidate offline examples for vision language models (#6858) 593e79e7 [Bugfix] torch.set_num_threads() in multiproc_gpu_executor (#6802) c53041ae [Doc] Add missing mock import to docs `conf.py` (#6834) 52f07e3d [Hardware][TPU] Implement tensor parallelism with Ray (#5871) 14dbd5a7 [Model] H2O Danube3-4b (#6451) ed94e4f4 [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba (#6784) 3c301239 [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron (#6844) ced36cd8 [ROCm] Upgrade PyTorch nightly version (#6845) 969d0322 [Bugfix]: Fix Tensorizer test failures (#6835) 55712941 [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852) 981b0d56 [Frontend] Factor out code for running uvicorn (#6828) d09b94ca [TPU] Support collective communications in XLA devices (#6813) bb549467 enforce eager mode with bnb quantization temporarily (#6846) b5f49ee5 Update README.md (#6847) 150a1ffb [Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283) 281977bd [Doc] Add Nemotron to supported model docs (#6843) 3bbb4936 [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation (#6125) aa486779 [Misc][TPU] Support TPU in initialize_ray_cluster (#6812) 71734f1b [Build/CI][ROCm] Minor simplification to Dockerfile.rocm (#6811) 50704f52 [Bugfix][Kernel] Promote another index to int64_t (#6838) 07278c37 [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611) 85ad7e2d [doc][debugging] add known issues for hangs (#6816) 89a84b0b [Core] Use array to speedup padding (#6779) 084a01fd [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. (#6770) 062a1d0f Fix ReplicatedLinear weight loading (#6793) 2eb9f4ff [ci] Mark tensorizer as soft fail and separate from grouped test (#6810) 443c7cf4 [ci][distributed] fix flaky tests (#6806) 1adddb14 [Core] Fix ray forward_dag error mssg (#6792) b7215de2 [Docs] Publish 5th meetup slides (#6799) f3ff63c3 [doc][distributed] improve multinode serving doc (#6804) cd7edc4e [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors (#6798) 6a1e25b1 [Doc] Add documentations for nightly benchmarks (#6412) 95db75de [Bugfix] Add synchronize to prevent possible data race (#6788) 65b1f121 [Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761) 889da130 [ Misc ] `fp8-marlin` channelwise via `compressed-tensors` (#6524) b75e314f [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V (#6787) 316a41ac [Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) 0310029a [Bugfix] Fix awq_marlin and gptq_marlin flags (#6745) 309aaef8 [Bugfix] Fix decode tokens w. CUDA graph (#6757) 9e169a4c [Model] Adding support for MiniCPM-V (#4087) 5689e256 [Frontend] Represent tokens with identifiable strings (#6626) 740374d4 [core][distributed] fix zmq hang (#6759) d88c458f [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users (#6754) 421e218b [Bugfix] Bump transformers to 4.43.2 (#6752) 5448f676 [Core] Tweaks to model runner/input builder developer APIs (#6712) 0e63494c Add fp8 support to `reshape_and_cache_flash` (#6667) ee812580 [Frontend] split run_server into build_server and run_server (#6740) 40468b13 [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686) 2cf0df33 [Bugfix] Fix speculative decode seeded test (#6743) 54514634 Adding f-string to validation error which is missing (#6748) f4f8a9d8 [Bugfix]fix modelscope compatible issue (#6730) b5708117 [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. (#6711) ccc4a732 [Docs][ROCm] Detailed instructions to build from source (#6680) 0a740a11 [Bugfix] Fix token padding for chameleon (#6724) c882a7f5 [SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714) 5e8ca973 [Bugfix] fix flashinfer cudagraph capture for PP (#6708) 87525fab [bitsandbytes]: support read bnb pre-quantized model (#5753) 2f808e69 [Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645) 01c16ede [CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) 72fc7048 [build] relax wheel size limit (#6704) 1bedf210 Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon (#6690) 507ef787 [Model] Pipeline Parallel Support for DeepSeek v2 (#6519) 58f53034 [Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652) 0eb0757b [Misc] Add ignored layers for `fp8` quantization (#6657) 38c4b7e8 Bump version to 0.5.3.post1 (#6696) a112a84a [BugFix] Fix RoPE error in Llama 3.1 (#6693) 461089a2 [Bugfix] Fix a log error in chunked prefill (#6694) 71950af7 [doc][distributed] fix doc argument order (#6691) cb1362a8 [Docs] Announce llama3.1 support (#6688) bb2fc080 Bump version to v0.5.3 (#6674) 3eda4ec7 support ignore patterns in model loader (#6673) 22fa2e35 [VLM][Model] Support image input for Chameleon (#6633) c5201240 [misc] only tqdm for first rank (#6672) 97234be0 [Misc] Manage HTTP connections in one place (#6600) c051bfe4 [doc][distributed] doc for setting up multi-node environment (#6529) 9e0b558a [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) e519ae09 add tqdm when loading checkpoint shards (#6569) 7c2749a4 [misc] add start loading models for users information (#6670) 729171ae [Misc] Enable chunked prefill by default for long context models (#6666) c5e83309 [Bugfix] Fix null `modules_to_not_convert` in FBGEMM Fp8 quantization (#6665) e0c15758 [Core] Modulize prepare input and attention metadata builder (#6596) bdf5fd13 [Misc] Remove deprecation warning for beam search (#6659) 5a96ee52 [ci][build] add back vim in docker (#6661) 42c7f66a [Core] Support dynamically loading Lora adapter from HuggingFace (#6234) 69d5ae38 [ci] Use different sccache bucket for CUDA 11.8 wheel build (#6656) fea59c77 [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) 739b61a3 [Frontend] Refactor prompt processing (#4028) 89c1c6a1 [Bugfix] Fix `vocab_size` field access in `llava_next.py` (#6624) 42de2cef [Misc] Add a wrapper for torch.inference_mode (#6618) c9eef37f [Model] Initial Support for Chameleon (#5770) 396d92d5 [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 25e778aa [Model] Refactor and decouple phi3v image embedding (#6621) b6df37f9 [Misc] Remove abused noqa (#6619) 14f91fe6 [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) d7f4178d [Frontend] Move chat utils (#6602) 082ecd80 [ Bugfix ] Fix AutoFP8 fp8 marlin (#6609) f952bbc8 [Misc] Fix input_scale typing in w8a8_utils.py (#6579) 9364f74e [ Kernel ] Enable `fp8-marlin` for `fbgemm-fp8` models (#6606) 06d6c5fe [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) 683e3cb9 [ Misc ] `fbgemm` checkpoints (#6559) 9042d683 [Misc] Consolidate and optimize logic for building padded tensors (#6541) 3f8d42c8 Pipeline Parallel: Guard for KeyErrors at request abort (#6587) 7bd82002 [Core] Allow specifying custom Executor (#6557) 2e265642 [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) e81522e8 [build] add ib in image for out-of-the-box infiniband support (#6599) 45ceb85a [Docs] Update PP docs (#6598) 4cc24f01 [ Kernel ] Enable Dynamic Per Token `fp8` (#6547) 07eb6f19 [bugfix][distributed] fix multi-node bug for shared memory (#6597) f0bbfaf9 [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578) 30efe415 [Docs] Update docs for wheel location (#6580) 9ed82e70 [Misc] Small perf improvements (#6520) 51f8aa90 [Bugfix][Frontend] remove duplicate init logger (#6581) a5314e86 [Model] RowParallelLinear: pass bias to quant_method.apply (#6327) a921e863 [BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) 6366efc6 [Bugfix][Frontend] Fix missing `/metrics` endpoint (#6463) dbe55885 [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515) d4201e06 [Bugfix] Make spec. decode respect per-request seed. (#6034) b5672a11 [Core] Multiprocessing Pipeline Parallel support (#6130) c5df56f8 Add support for a rope extension method (#6553) 1689219e [CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517) 4ffffccb [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552) f53b8f0d [ci][test] add correctness test for cpu offloading (#6549) 2d4733ba Fix PR comment bot (#6554) 15c6a079 [Model] Support Mistral-Nemo (#6548) ecdb462c [ci] Reword Github bot comment (#6534) 58ca6632 [ Misc ] Improve Min Capability Checking in `compressed-tensors` (#6522) 4634c872 [TPU] Refactor TPU worker & model runner (#6506) c8a7d51c [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501) e2fbaee7 [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227) 8a74c68b [Misc] Minor patch for draft model runner (#6523) 61e59274 [Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032) d25877dd [BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530) 1c27d25f [core][model] yet another cpu offload implementation (#6496) 18fecc35 [ Kernel ] Fp8 Channelwise Weight Support (#6487) b5af8c22 [Model] Pipeline parallel support for Mixtral (#6516) b5241e41 [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) e76466dd [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) 5f0b9933 [Bugfix] Fix Ray Metrics API usage (#6354) a38524f3 [DOC] - Add docker image to Cerebrium Integration (#6510) 2fa4623d [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) a9a2e74d [Misc] Use `torch.Tensor` for type annotation (#6505) e09ce759 [TPU] Remove multi-modal args in TPU backend (#6504) 5fa6e987 [Bugfix] Fix for multinode crash on 4 PP (#6495) 5bf35a91 [Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431) a19e8d37 [Misc][Speculative decoding] Typos and typing fixes (#6467) 10383887 [ROCm] Cleanup Dockerfile and remove outdated patch (#6482) 1d094fd7 [Distributed][PP] only create embedding & lm head when necessary (#6455) ce37be7b [misc][distributed] add seed to dummy weights (#6491) 7f62077a [misc][distributed] improve tests (#6488) 09c2eb85 [ci][distributed] add pipeline parallel correctness test (#6410) 978aed53 [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) 160e1d8c [Misc] Log spec decode metrics (#6454) 94162beb [Doc] Fix the lora adapter path in server startup script (#6230) c467dff2 [Hardware][TPU] Support MoE with Pallas GMM kernel (#6457) 9f4ccec7 [doc][misc] remind to cancel debugging environment variables (#6481) 38ef9488 [CI/Build] Remove "boardwalk" image asset (#6460) 2bb0489c [Core] Use numpy to speed up padded token processing (#6442) 7508a3dc [Misc] Fix typos in spec. decode metrics logging. (#6470) 7a3d2a5b [Frontend] Support for chat completions input in the tokenize endpoint (#5923) d9701151 [CI/Build] vLLM cache directory for images (#6444) 37d77660 [Docs] Announce 5th meetup (#6458) d92b3c5c [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419) 9ad32dac [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) d6f3b3d5 Pin sphinx-argparse version (#6453) 4552e37b [CI/Build][TPU] Add TPU CI test (#6277) ec9933f4 [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) 3dee97b0 [Docs] Add Google Cloud to sponsor list (#6450) 4cf256ae [misc][distributed] fix pp missing layer condition (#6446) 64fdc08c bump version to v0.5.2 (#6433) 4ef95b0f [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409) eaec4b91 [Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140) a63a4c63 [Misc] Use 0.0.9 version for flashinfer (#6447) c8fd97f2 [Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270) 94b82e8c [doc][distributed] add suggestion for distributed inference (#6418) 6ae1597d [VLM] Minor space optimization for `ClipVisionModel` (#6436) 22e79ee8 [doc][misc] doc update (#6439) de199163 [Bugfix] Convert image to RGB by default (#6430) 69672f11 [core][distributed] simplify code to support pipeline parallel (#6406) 44874a0b [Doc] add env docs for flashinfer backend (#6437) b47008b4 [BugFix] BatchResponseData body should be optional (#6345) 9bfece89 Add FUNDING.yml (#6435) 32c9d7f7 Report usage for beam search (#6404) ccb20db8 [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428) a754dc2c [CI/Build] Cross python wheel (#6394) 61e85dba [Doc] xpu backend requires running setvars.sh (#6393) dbfe254e [Feature] vLLM CLI (#5090) 73030b7d [ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423) ccd3c045 [ci][build] fix commit id (#6420) 9dad5cc8 [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384) 6ef3bf91 Remove unnecessary trailing period in spec_decode.rst (#6405) 540c0368 [Model] Initialize Fuyu-8B support (#3924) fb6af8bc [ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417) eeceadae [Misc] Add deprecation warning for beam search (#6402) babf52da [ Misc ] More Cleanup of Marlin (#6359) 9da4aad4 Updating LM Format Enforcer version to v10.3 (#6411) 41708e50 [ci] try to add multi-node tests (#6280) d80aef37 [Docs] Clean up latest news (#6401) e1684a76 [Bugfix] Fix hard-coded value of x in context_attention_fwd (#6373) a27f87da [Doc] Fix Typo in Doc (#6392) 16ff6bd5 [ci] Fix wording for GH bot (#6398) f8f9ff57 [Bugfix][TPU] Fix megacore setting for v5e-litepod (#6397) 6bc9710f Fix release pipeline's dir permission (#6391) 111fc6e7 [Misc] Add generated git commit hash as `vllm.__commit__` (#6386) 75f64d8b [Bugfix] Fix illegal memory access in FP8 MoE kernel (#6382) 21b2dced Fix release pipeline's -e flag (#6390) 07b35af8 Fix interpolation in release pipeline (#6389) bb1a784b Fix release-pipeline.yaml (#6388) d719ba24 Build some nightly wheels by default (#6380) aa48e502 [MISC] Upgrade dependency to PyTorch 2.3.1 (#5327) 4dbebd03 [ci] Add GHA workflows to enable full CI run (#6381) b75bce10 [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline (#6365) b039cbbc [Misc] add fixture to guided processor tests (#6341) f9d25c25 [Build/CI] Checking/Waiting for the GPU's clean state (#6379) 024ad87c [Bugfix] Fix dtype mismatch in PaliGemma (#6367) aea19f09 [ Misc ] Support Models With Bias in `compressed-tensors` integration (#6356) f7160d94 [Misc][Bugfix] Update transformers for tokenizer issue (#6364) 6047187c [ Misc ] Remove separate bias add (#6353) b6c16cf8 [ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352) d26a8b3f [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub (#6350) d59eb984 [Model][Phi3-Small] Remove scipy from blocksparse_attention (#6343) adf32e0a [Bugfix] Fix usage stats logging exception warning with OpenVINO (#6349) 2b0fb534 [distributed][misc] be consistent with pytorch for libcudart.so (#6346) d6ab5289 [Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351) 7ed6a4f0 [ BugFix ] Prompt Logprobs Detokenization (#6223) a4feba92 [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362) 2d23b42d [doc] update pipeline parallel in readme (#6347) 1df43de9 [bug fix] Fix llava next feature size calculation. (#6339) 52b7fcb3 Benchmark: add H100 suite (#6047) b675069d [ Misc ] Refactor Marlin Python Utilities (#6082) 55f692b4 [BugFix] get_and_reset only when scheduler outputs are not empty (#6266) 8a1415cf [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. (#6326) 546b101f [BugFix]: fix engine timeout due to request abort (#6255) 3963a533 [Misc] refactor(config): clean up unused code (#6320) c4774eb8 [Bugfix] Fix snapshot download in serving benchmark (#6318) fc17110b [BugFix]: set outlines pkg version (#6262) 439c8458 [Doc] Update description of vLLM support for CPUs (#6003) 99ded1e1 [Doc] Remove comments incorrectly copied from another project (#6286) 997df46a [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor (#6313) ae151d73 [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765) 44cc7661 [Bugfix] Fix OpenVINOExecutor abstractmethod error (#6296) b422d496 [CI/Build] Enable mypy typing for remaining folders (#6268) c38eba30 [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. (#6303) e72ae80b [Bugfix] Support 2D input shape in MoE layer (#6287) 8a924d22 [Doc] Guide for adding multi-modal plugins (#6205) 5ed3505d [Bugfix][TPU] Add prompt adapter methods to TPUExecutor (#6279) da78caec [core][distributed] zmq fallback for broadcasting large objects (#6183) 2416b26e [Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) d3a24513 [Bugfix]fix and needs_scalar_to_array logic check (#6238) 673dd4ca [Docs] Docs update for Pipeline Parallel (#6222) 4d6ada94 [CORE] Adding support for insertion of soft-tuned prompts (#4645) a0550cbc Add support for multi-node on CI (#5955) 08c5bdec [Bugfix][TPU] Fix outlines installation in TPU Dockerfile (#6256) 5d5b4c5f [Bugfix][TPU] Add missing None to model input (#6245) 70c232f8 [core][distributed] fix ray worker rank assignment (#6235) a3c9435d [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability (#6216) 4f0e0ea1 Add FlashInfer to default Dockerfile (#6172) ddc369fb [Bugfix] Mamba cache Cuda Graph padding (#6214) 185ad31f [Bugfix] use diskcache in outlines _get_guide #5436 (#6203) 543aa485 [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888) f7a8fa39 [Kernel] reloading fused_moe config on the last chunk (#6210) 717f4bce Feature/add benchmark testing (#5947) 16620f43 do not exclude `object` field in CompletionStreamResponse (#6196) 3b08fe2b [misc][frontend] log all available endpoints (#6195) abfe705a [ Misc ] Support Fp8 via `llm-compressor` (#6110) 333306a2 add benchmark for fix length input and output (#5857) 6206dcb2 [Model] Add PaliGemma (#5189) 93893800 [Doc] Move guide for multimodal model and other improvements (#6168) 175c43ec [Doc] Reorganize Supported Models by Type (#6167) bc96d5c3 Move release wheel env var to Dockerfile instead (#6163) f0250620 Fix release wheel build env var (#6162) 2de490d6 Update wheel builds to strip debug (#6161) 79d406e9 [Docs] Fix readthedocs for tag build (#6158) abad5746 bump version to v0.5.1 (#6157) e58294dd [Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695) f1e15da6 [Frontend] Continuous usage stats in OpenAI completion API (#5742) 0097bb18 [Bugfix] Use templated datasource in grafana.json to allow automatic imports (#6136) ea4b5704 [VLM] Cleanup validation and update docs (#6149) a41357e9 [VLM] Improve consistency between feature size calculation and dummy data for profiling (#6146) ae96ef8f [VLM] Calculate maximum number of multi-modal tokens by model (#6121) 69ec3ca1 [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) 81d7a50f [Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008) 27902d42 [misc][doc] try to add warning for latest html (#5979) 56b325e9 [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043) 3dd50708 [CI/Build] Cleanup VLM tests (#6107) 0ed646b7 [Distributed][Core] Support Py39 and Py38 for PP (#6120) 1dab9bc8 [Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109) 3de6e6a3 [core][distributed] support n layers % pp size != 0 (#6115) 966fe721 [doc][misc] bump up py version in installation doc (#6119) 62963d12 [ Misc ] Clean Up `CompressedTensorsW8A8` (#6113) d9e98f42 [vlm] Remove vision language config. (#6089) 3c6325f0 [core][distributed] custom allreduce when pp size > 1 (#6117) 47f0954a [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) 7cd2ebb0 [Bugfix] Fix `compute_logits` in Jamba (#6093) f1c78138 [Doc] Fix Mock Import (#6094) 3a86b54f [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API (#6091) f6662071 [misc][distributed] error on invalid state (#6092) d830656a [BugFix] Avoid unnecessary Ray import warnings (#6079) d18bab35 [CI] Fix base url doesn't strip "/" (#6087) 9831aec4 [Core] Dynamic image size support for VLMs (#5276) 482045ee [hardware][misc] introduce platform abstraction (#6080) 9d6a8daa [Model] Jamba support (#4115) ee93f4f9 [CORE] Quantized lm-head Framework (#4442) 7c008c51 [ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) 4d26d806 Update conftest.py (#6076) c5832d2a [Core] Pipeline Parallel Support (#4412) 15aba081 [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050) 31354e56 [Doc] Reinstate doc dependencies (#6061) 98d6682c [VLM] Remove `image_input_type` from VLM config (#5852) 2c37540a [Frontend] Add template related params to request (#5709) 3476ed08 [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) 54600709 [Model] Changes to MLPSpeculator to support tie_weights and input_scale (#5965) e373853e [Frontend] Relax api url assertion for openai benchmarking (#6046) c87ebc3e [BugFix] Ensure worker model loop is always stopped at the right time (#5987) c4059ea5 [Bugfix] Add explicit `end_forward` calls to flashinfer (#6044) 8e0817c2 [Bugfix][Doc] Fix Doc Formatting (#6048) 83bdcb6a add FAQ doc under 'serving' (#5946) 12a59959 [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029) dec6fc6f [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (#6039) 8893130b [doc][misc] further lower visibility of simple api server (#6041) bb603268 [Misc] update benchmark backend for scalellm (#6018) 4050d646 [doc][misc] remove deprecated api server in doc (#6037) d76084c1 [ CI ] Re-enable Large Model LM Eval (#6031) 80ca1e6a [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) 614aa512 [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) af9ad46f [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) 7836fdcc [Misc] Fix `get_min_capability` (#5971) deacb7ec [ CI ] Temporarily Disable Large LM-Eval Tests (#6005) f5e73c9f [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909) c6c240aa [Frontend]: Support base64 embedding (#5935) 2be6955a [ci][distributed] fix device count call 9d47f64e [CI/Build] [3/3] Reorganize entrypoints tests (#5966) cff6a1fe [CI/Build] Reuse code for checking output consistency (#5988) bcc6a09b [CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989) 9def1066 [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949) 75aa1442 [ CI/Build ] LM Eval Harness Based CI Testing (#5838) 99397da5 [CI/Build] Add TP test for vision models (#5892) 8dbfcd35 [ CI/Build ] Added E2E Test For Compressed Tensors (#5839) f7dac83d [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939) 7c01f706 [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (#5974) 51e971d3 [Bugfix] Support `eos_token_id` from `config.json` (#5954) 329df38f [Misc] Update Phi-3-Vision Example (#5981) 580353da [Bugfix] Fix precisions in Gemma 1 (#5913) ba499444 [Kernel] Add punica dimensions for Granite 3b and 8b (#5930) 906a19cd [Misc] Extend vLLM Metrics logging API (#5925) c4bca740 [Bugfix] fix missing last itl in openai completions benchmark (#5926) 7f83f40d [Bugfix][TPU] Fix pad slot id (#5977) 54814fd8 [Bugfix][TPU] Fix TPU sampler output (#5978) 7041de43 [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628) 6a62cb82 [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963) 5d2a1a9c Unmark more files as executable (#5962) 4bf35ed9 [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (#5936) be0b3af9 Support Deepseek-V2 (#4650) 2cd402e1 [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) b1852307 [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (#5928) 6a2d659d [Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931) b2c62023 [Spec Decode] Introduce DraftModelRunner (#5799) b90d8cd8 [Distributed] Make it clear that % should not be in tensor dict keys. (#5927) 3b752a65 [CI/Build] [2/3] Reorganize entrypoints tests (#5904) ec1ad004 [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (#5894) 57f09a41 [Hardware][Intel] OpenVINO vLLM backend (#5379) 59326344 Unmark fused_moe config json file as executable (#5960) 5cbe8d15 [Core] Registry for processing model inputs (#5214) 0d0e3a42 [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956) 74d55c06 [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (#5905) f136da15 [Hardware][TPU] Optimize KV cache swapping (#5878) c3dde367 [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932) 64e8d2a7 [core][misc] remove logical block (#5882) 79c92c7c [Model] Add Gemma 2 (#5908) 736ed388 [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922) 365791ff [BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849) 691e29ec [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5876) 3fd02bda [doc][misc] add note for Kubernetes users (#5916) 98cf2ed6 [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) e9d32d07 [CI/Build] [1/3] Reorganize entrypoints tests (#5526) 2061f0b8 [Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888) 96354d6a [Model] Add base class for LoRA-supported models (#5018) d12af207 [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (#5880) 6eabc6cb [Doc] Add note about context length in Phi-3-Vision example (#5887) 2110557d [BugFix] Fix cuda graph for MLPSpeculator (#5875) b9e84259 [Misc] Add example for LLaVA-NeXT (#5879) 294104c3 [doc] update usage of env var to avoid conflict (#5873) 38a1674a Support CPU inference with VSX PowerPC ISA (#5652) f5c8628f [Bugfix][TPU] Fix CPU cache allocation (#5869) cbc53b6b [Hardware][TPU] Support parallel sampling & Swapping (#5855) c54269d9 [Frontend] Add tokenize/detokenize endpoints (#5054) 5bfd1bbc [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560) 6984c02a [CI/Build] Refactor image test assets (#5821) 3439c5a8 [Bugfix][TPU] Fix KV cache size calculation (#5860) 6806998b [Bugfix] Fix embedding to support 2D inputs (#5829) 515080ad [bugfix][distributed] fix shm broadcast when the queue size is full (#5801) 3aa7b6cf [Misc][Doc] Add Example of using OpenAI Server with VLM (#5832) dda48115 [Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) 82079729 [Bugfix] Fix assertion in NeuronExecutor (#5841) c2a8ac75 [CI/Build] Add E2E tests for MLPSpeculator (#5791) f178e56c [Hardware][TPU] Raise errors for unsupported sampling params (#5850) dd793d1d [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) bc34937d [Hardware][TPU] Refactor TPU backend (#5831) dd248f76 [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794) d9b34bae [CI/Build] Add unit testing for FlexibleArgumentParser (#5798) c18ebfdd [doc][distributed] add both gloo and nccl tests (#5834) 67882dbb [Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748) 7b993143 [Misc] Remove useless code in cpu_worker (#5824) 2ce5d668 [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) f23871e9 [Doc] Add notice about breaking changes to VLMs (#5818) e9de9dd5 [ci] Remove aws template (#5757) ba991d5c [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (#5795) 1744cc99 [Doc] Add Phi-3-medium to list of supported models (#5788) e72dc6cb [Doc] Add "Suggest edit" button to doc pages (#5789) c2462129 [doc][faq] add warning to download models for every nodes (#5783) edd5fe5f [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772) 5d4d9053 [Distributed] Add send and recv helpers (#5719) 6c916ac8 [BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744) 832ea88f [core][distributed] improve shared memory broadcast (#5754) 8c00f9c1 [Docs][TPU] Add installation tip for TPU (#5761) 0cbc1d2b [Bugfix] Fix pin_lora error in TPU executor (#5760) ff9ddbce [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py (#5756) 9c62db07 [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (#5710) cf90ae01 [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616) f5dda63e [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) 71875073 [ci][test] fix ca test in main (#5746) f1e72cc1 [BugFix] exclude version 1.15.0 for modelscope (#5668) 5b15bde5 [Doc] Documentation on supported hardware for quantization methods (#5745) bd620b01 [Kernel][CPU] Add Quick `gelu` to CPU (#5717) d9a252bc [Core][Distributed] add shm broadcast (#5399) 67005a07 [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665) c35e4a3d [BugFix] Fix test_phi3v.py (#5725) 1f567421 [Kernel] Add punica dimension for Qwen2 LoRA (#5441) b12518d3 [Model] MLPSpeculator speculative decoding support (#4947) 6c5b7af1 [distributed][misc] use fork by default for mp (#5669) 8065a7e2 [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 3f3b6b21 [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) a7dcc620 [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275) ad137cd1 [Model] Port over CLIPVisionModel for VLMs (#5591) 111af1fa [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514) 1b2eaac3 [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703) 3730a1c8 [Misc] Improve conftest (#5681) 949e49a6 [ci] Limit num gpus if specified for A100 (#5694) 4a30d7e3 [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) e83db9e7 [Doc] Update docker references (#5614) 78687504 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) d571ca01 [ci][distributed] add tests for custom allreduce (#5689) afed90a0 [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) 3ee5c4bc [ci] Add A100 queue into AWS CI template (#5648) e9c2732b [CI/Build] Add tqdm to dependencies (#5680) d8714530 [Misc]Add param max-model-len in benchmark_latency.py (#5629) 7d46c8d3 [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684) da971ec7 [Model] Add FP8 kv cache for Qwen2 (#5656) 3eea7488 [misc][distributed] use 127.0.0.1 for single-node (#5619) f758aed0 [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (#5641) e5150f2c [Bugfix] Added test for sampling repetition penalty bug. (#5659) 59a1eb59 [Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628) 6820724e [Bugfix] Fix w8a8 benchmarks for int8 case (#5643) b23ce920 [Bugfix] Fix CUDA version check for mma warning suppression (#5642) 2bd231a7 [Doc] Added cerebrium as Integration option (#5553) 8a173382 [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639) 07feecde [Model] LoRA support added for command-r (#5178) 19091efc [ci] Setup Release pipeline and build release wheels with cache (#5610) 95db455e [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542) 7879f24d [Misc] Add OpenTelemetry support (#4687) 13db4369 [ci] Deprecate original CI template (#5624) 4ad7b53e [CI/Build][Misc] Update Pytest Marker for VLMs (#5623) f0cc0e68 [Misc] Remove import from transformers logging (#5625) db5ec52a [bugfix][distributed] improve p2p capability test (#5612) 114d7270 [CI] Avoid naming different metrics with the same name in performance benchmark (#5615) 32c86e49 [Misc] Fix typo (#5618) 8eadcf0b [misc][typo] fix typo (#5620) 5002175e [Kernel] Add punica dimensions for Granite 13b (#5559) daef218b [Model] Initialize Phi-3-vision support (#4986) fa9e3852 [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131) 26e1188e [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606) a3e8a05d [Bugfix] Fix KV head calculation for MPT models when using GQA (#5142) e441bad6 [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584) 1b44aaf4 [bugfix][distributed] fix 16 gpus local rank arrangement (#5604) 9e4e6fe2 [CI] the readability of benchmarking and prepare for dashboard (#5571) ab66536d [CI/BUILD] Support non-AVX512 vLLM building and testing (#5574) 728c4c8a [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) 1f12122b [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588) 890d8d96 [Kernel] `compressed-tensors` marlin 24 support (#5435) 9e74d9d0 Correct alignment in the seq_len diagram. (#5592) 9333fb8e [Model] Rename Phi3 rope scaling type (#5595) e2b85cf8 Fix w8a8 benchmark and add Llama-3-8B (#5562) 845a3f26 [Doc] add debugging tips for crash and multi-node debugging (#5581) f07d5133 [build][misc] limit numpy version (#5582) 4a676905 [CI][BugFix] Flip is_quant_method_supported condition (#5577) f31c1f90 Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518) 3ce2c050 [Fix] Correct OpenAI batch response format (#5554) 1c0afa13 [BugFix] Don't start a Ray cluster when not using Ray (#5570) d919ecc7 add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145) e691918e [misc] Do not allow to use lora with chunked prefill. (#5538) 81fbb365 [CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568) 0e9164b4 [mypy] Enable type checking for test directory (#5017) 1b8a0d71 [Core][Bugfix]: fix prefix caching for blockv2 (#5364) bd7efe95 Add ccache to amd (#5555) f5bb85b4 [Core][Distributed] improve p2p cache generation (#5528) 28c145eb [Bugfix] Fix typo in Pallas backend (#5558) e2afb03c [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) 6e2527a7 [Doc] Update documentation on Tensorizer (#5471) cdab68dc [Docs] Add ZhenFund as a Sponsor (#5548) d1c3d7d1 [misc][distributed] fix benign error in `is_in_the_same_node` (#5512) 77490c6f [Core] Remove duplicate processing in async engine (#5525) 48f589e1 [mis] fix flaky test of test_cuda_device_count_stateless (#5546) 348616ac [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401) 15985680 [ Misc ] Rs/compressed tensors cleanup (#5432) d74674bb [Misc] Fix arg names (#5524) 703475f6 [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) d47af2bc [CI/Build] Disable LLaVA-NeXT CPU test (#5529) 319ad7f1 [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073) 0f0d8bc0 bump version to v0.5.0.post1 (#5522) 55d6361b [Misc] Fix arg names in quantizer script (#5507) cd9c0d65 [Hardware][Intel] Support CPU inference with AVX2 ISA (#5452) 50eed24d Add `cuda_device_count_stateless` (#5473) e38042d4 [Kernel] Disable CUTLASS kernels for fp8 (#5505) 33e3b372 [CI/Build] Disable test_fp8.py (#5508) 1696efe6 [misc] fix format.sh (#5511) 6b0511a5 Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478) a8fda4f6 Seperate dev requirements into lint and test (#5474) 30299a41 [MISC] Remove FP8 warning (#5472) 85657b56 [Kernel] Factor out epilogues from cutlass kernels (#5391) 0ce7b952 [Doc] Update LLaVA docs (#5437) 39873476 [CI/Build] Simplify OpenAI server setup in tests (#5100) 03dccc88 [Misc] Add vLLM version getter to utils (#5098) a65634d3 [Docs] Add 4th meetup slides (#5509) 80aa7e91 [Hardware][Intel] Optimize CPU backend and add more performance tips (#4971) bd439735 [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) 23ec72fa [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) c2637a61 [Kernel] `w4a16` support for `compressed-tensors` (#5385) 88407532 [Bugfix]if the content is started with ":"(response of ping), client should i… (#5303) 916d219d [ci] Use sccache to build images (#5419) ea3890a5 [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) 2135cacb [Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451) 7d19de2e [Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425) 94a07bbd [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) b8d4dfff [Doc] Update debug docs (#5438) 622d4512 [misc] add hint for AttributeError (#5462) 51602eef [Frontend] [Core] Support for sharded tensorized models (#4990) 5cc50a53 [Bugfix] TYPE_CHECKING for MultiModalData (#5444) 5985e342 [Kernel] Vectorized FP8 quantize kernel (#5396) 8b82a899 [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464) c3c2903e [Bugfix] Add device assertion to TorchSDPA (#5402) 1a8bfd92 [Hardware] Initial TPU integration (#5292) 847cdcca [CI] Upgrade codespell version. (#5381) e3c12bf6 Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463) 3dd6853b [CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253) 8f89d720 [Doc] add common case for long waiting time (#5430) 99dac099 [Core][Doc] Default to multiprocessing for single-node distributed case (#5230) c4bd03c7 [Core][Distributed] add same-node detection (#5369) dcbf4286 [Frontend] Customizable RoPE theta (#5197) 00e6a2dc [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) 2e02311a [Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 (#5254) 89ec06c3 [Docs] [Spec decode] Fix docs error in code example (#5427) 9fde251b [Doc] Add an automatic prefix caching section in vllm documentation (#5324) 4c2ffb28 [Speculative decoding] Initial spec decode docs (#5400) 246598a6 [CI] docfix (#5410) 8bab4959 [Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389) 3c4cebf7 [Doc][Typo] Fixing Missing Comma (#5403) d8f31f2f [Doc] add debugging tips (#5409) 640052b0 [Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026) 351d5e7b [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) a0086298 [Misc] Various simplifications and typing fixes (#5368) 76477a93 [ci] Fix Buildkite agent path (#5392) 77c87beb [Doc] Add documentation for FP8 W8A8 (#5388) 114332b8 Bump version to v0.5.0 (#5384) cb77ad83 [Docs] Alphabetically sort sponsors (#5386) 856c9900 [Docs] Add Docs on Limitations of VLM Support (#5383) c5602f0b [ci] Mount buildkite agent on Docker container to upload benchmark results (#5330) f7f9c5f9 [ci] Use small_cpu_queue for doc build (#5331) 2c0d9335 [Bugfix] Fix LLaVA-NeXT (#5380) 774d1035 [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319) 6b29d6fe [Model] Initial support for LLaVA-NeXT (#4199) 0bfa1c4f [Misc] Improve error message when LoRA parsing fails (#5194) c81da5f5 [misc][typo] fix typo (#5372) 68bc8170 [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (#5374) 5884c2b4 [Misc] Update to comply with the new `compressed-tensors` config (#5350) 45f92c00 [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) 5467ac31 [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 5d7e3d01 [mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361) 0373e183 [Core][CUDA Graph] add output buffer for cudagraph (#5074) c09dade2 [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) 8ea5e44a [CI/Test] improve robustness of test (vllm_runner) (#5357) 9fb900f9 [CI/Test] improve robustness of test (hf_runner) (#5347) c96fc067 [ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965) b3376e5c [Misc] Add args for selecting distributed executor to benchmarks (#5335) e69ded7d [Bug Fix] Fix the support check for FP8 CUTLASS (#5352) 767c727a fix DbrxFusedNormAttention missing cache_config (#5340) 6840a716 [Misc] Remove unused cuda_utils.h in CPU backend (#5345) 7a9cb294 [Frontend] Add OpenAI Vision API Support (#5237) ca3ea51b [Kernel] Dynamic Per-Token Activation Quantization (#5037) dc49fb89 Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) 18a277b5 Remove Ray health check (#4693) 8d75fe48 [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) 388596c9 [Misc][Utils] allow get_open_port to be called for multiple times (#5333) baa15a9e [Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135) 15063741 [Misc] Missing error message for custom ops import (#5282) ccdc490d [Core] Change LoRA embedding sharding to support loading methods (#5038) a31cab75 [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) 828da0d4 [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) abe855d6 [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) 4efff036 Bugfix: fix broken of download models from modelscope (#5233) 89c92078 [CI/Build] Update vision tests (#5307) 7b0a0dfb [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) 3a6ae1d3 [CI] Disable flash_attn backend for spec decode (#5286) 8f1729b8 [Docs] Add Ray Summit CFP (#5295) 6a7c7711 [Misc] Skip for logits_scale == 1.0 (#5291) 0f83ddd4 [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) 065aff6c [Bugfix] Make EngineArgs use named arguments for config construction (#5285) 3d33e372 [BugFix] Fix log message about default max model length (#5284) faf71bcd [Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252) f270a395 [Docs] Add Sequoia as sponsors (#5287) 51a08e7d [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) eb8fcd26 [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207) 5563a4de [Model] Correct Mixtral FP8 checkpoint loading (#5231) ccd4f129 [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) 02cc3b51 [misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263) d5b1eb08 [CI] Add nightly benchmarks (#5260) f0a50054 [Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False) (#5278) c65146e7 [Misc] Fix docstring of get_attn_backend (#5271) 41ca62cf [Misc] Add CustomOp interface for device portability (#5255) 974fc9b8 [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) fee4dcc3 [Misc] update collect env (#5261) 650a4cc5 [Misc] Add transformers version to collect_env.py (#5259) 9ca62d86 [CI] mark AMD test as softfail to prevent blockage (#5256) 45c35f0d [CI/Build] Reducing CPU CI execution time (#5241) 9ba093b4 [CI/Build] Simplify model loading for `HfRunner` (#5251) 27208be6 [Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) 87d5abef [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend (#5249) ec784b25 [CI/Build] Add inputs tests (#5215) a58f24e5 [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229) f42a006b [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210) 3a434b07 [Kernel] Enhance MoE benchmarking & tuning script (#4921) bd0e7802 [Bugfix] Add warmup for prefix caching example (#5235) 06b2550c [Bugfix] Support `prompt_logprobs==0` (#5217) f775a07e [FRONTEND] OpenAI `tools` support named functions (#5032) 4f0d17c0 New CI template on AWS stack (#5110) 10c38e3e [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) cafb8e06 [CI/BUILD] enable intel queue for longer CPU tests (#4113) cbb2f59c [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) 0ab278ca [Core] Remove unnecessary copies in flash attn backend (#5138) 7a64d24a [Core] Support image processor (#4197) dfbe60dc [Misc] Simplify code and fix type annotations in `conftest.py` (#5118) a66cf40b [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) f790ad3c [Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643) ed59a7ed Update test_ignore_eos (#4898) 044793d8 [BugFix] Prevent `LLM.encode` for non-generation Models (#5184) c2d6d2f9 [Bugfix]: Fix issues related to prefix caching example (#5177) (#5180) 8279078e [Bugfix] Remove deprecated @abstractproperty (#5174) b9c0605a [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) 37464a0f [Bugfix] Fix call to init_logger in openai server (#4765) c3540728 [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) f081c3ce [Kernel] Update Cutlass fp8 configs (#5144) 260d119e [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) a360ff80 [CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034) 1197e021 [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) 65757911 [Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171) e9899fb7 [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) a377f0bd [Misc]: optimize eager mode host time (#4196) e9d3aa04 Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149) a22dea54 [Model] Support MAP-NEO model (#5081) 533c2177 Fix cutlass sm_90a vesrion in CMakeList 6d21fa1c [Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136) b35be540 [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) 45a1a69b [Build] Disable sm_90a in cu11 (#5141) 87a658c8 Bump version to v0.4.3 (#5046) 429d8972 add doc about serving option on dstack (#3074) a9bcc7af [Doc] Use intersphinx and update entrypoints docs (#5125) d79d9eaa [Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129) f758505c [CI/Build] increase wheel size limit to 200 MB (#5130) d910816c [Bugfix] Automatically Detect SparseML models (#5119) 87d41c84 [BUGFIX] [FRONTEND] Correct chat logprobs (#5029) e07aff9e [CI/Build] Docker cleanup functionality for amd servers (#5112) 5bf185a1 [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) 4fbcb0f2 [Doc][Build] update after removing vllm-nccl (#5103) 7c3604fb [Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031) b1c25563 [Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099) eb6c50cd [Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` (#5097) eecd8643 [Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096) ae495c74 [Doc]Replace deprecated flag in readme (#4526) 4238bc82 [Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837) 594392d2 [Core][Distributed] improve p2p access check (#4992) 18c1f16d [Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092) 5bd3c650 [Core][Optimization] remove vllm-nccl (#5091) 616e600e [Misc] add gpu_memory_utilization arg (#5079) dfba529b [Bugfix] Remove the last EOS token unless explicitly specified (#5077) 5ae5ed1e [Core] Consolidate prompt arguments to LLM engines (#4328) 290f4ada [Docs] Add Dropbox as sponsors (#5089) dd8de11f [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951) 9ba41558 [BugFix] Fix Embedding Models with TP>1 (#5075) d4f39859 [Core] Sliding window for block manager v2 (#4545) 890aa93d [Model] Add support for falcon-11B (#5069) fbdb7b3e [Core] Allow AQLM on Pascal (#5058) 1102bef2 [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) git-subtree-dir: vllm git-subtree-split: c8a7e93273ff4338d6f89f8a63ff16426ac240b8
🐛 Describe the bug
Using docker and gemma finetuned model
--model /data/merged_model_GPTQ --max-model-len 8192 --max-num-seqs 1024 --served-model-name model --quantization gptq_marlin
fails with
RuntimeError: Some weights are not initialized from checkpoints: {'model.layers.3.mlp.gate_up_proj.g_idx_sort_indices', 'model.layers.8.self_attn.qkv_proj.g_idx_sort_indices', 'model.layers.9.mlp.gate_up_proj.g_idx_sort_indices .....
The same works with --quantization gptq
The text was updated successfully, but these errors were encountered: