Skip to content

Conversation

@renovate
Copy link
Contributor

@renovate renovate bot commented May 8, 2025

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
vllm ==0.6.3.post1 -> ==0.8.5 age adoption passing confidence

GitHub Vulnerability Alerts

CVE-2025-24357

Description

The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.

Impact

This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.

Note that most models now use the safetensors format, which is not vulnerable to this issue.

References

CVE-2025-25183

Summary

Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.

Details

vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.

Impact

The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.

Solution

We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.

Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.

To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.

References

CVE-2025-29770

Impact

The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server.

The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. vLLM should have this off by default and allow administrators to opt-in due to the potential for abuse.

A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space.

Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the guided_decoding_backend key of the extra_body field of the request.

This issue applies to the V0 engine only. The V1 engine is not affected.

Patches

The fix is to disable this cache by default since it does not provide an option to limit its size. If you want to use this cache anyway, you may set the VLLM_V0_USE_OUTLINES_CACHE environment variable to 1.

Workarounds

There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server.

References

GHSA-ggpf-24jw-3fcw

Description

GHSA-rh4j-5rhw-hr54 reported a vulnerability where loading a malicious model could result in code execution on the vllm host. The fix applied to specify weights_only=True to calls to torch.load() did not solve the problem prior to PyTorch 2.6.0.

PyTorch has issued a new CVE about this problem: GHSA-53q9-r3pm-6pq6

This means that versions of vLLM using PyTorch before 2.6.0 are vulnerable to this problem.

Background Knowledge

When users install VLLM according to the official manual
image

But the version of PyTorch is specified in the requirements. txt file
image

So by default when the user install VLLM, it will install the PyTorch with version 2.5.1
image

In CVE-2025-24357, weights_only=True was used for patching, but we know this is not secure.
Because we found that using Weights_only=True in pyTorch before 2.5.1 was unsafe

Here, we use this interface to prove that it is not safe.
image

Fix

update PyTorch version to 2.6.0

Credit

This vulnerability was found By Ji'an Zhou and Li'shuo Song

CVE-2025-30202

Impact

In a multi-node vLLM deployment, vLLM uses ZeroMQ for some multi-node communication purposes. The primary vLLM host opens an XPUB ZeroMQ socket and binds it to ALL interfaces. While the socket is always opened for a multi-node deployment, it is only used when doing tensor parallelism across multiple hosts.

Any client with network access to this host can connect to this XPUB socket unless its port is blocked by a firewall. Once connected, these arbitrary clients will receive all of the same data broadcasted to all of the secondary vLLM hosts. This data is internal vLLM state information that is not useful to an attacker.

By potentially connecting to this socket many times and not reading data published to them, an attacker can also cause a denial of service by slowing down or potentially blocking the publisher.

Detailed Analysis

The XPUB socket in question is created here:

https://github.com/vllm-project/vllm/blob/c21b99b91241409c2fdf9f3f8c542e8748b317be/vllm/distributed/device_communicators/shm_broadcast.py#L236-L237

Data is published over this socket via MessageQueue.enqueue() which is called by MessageQueue.broadcast_object():

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L452-L453

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L475-L478

The MessageQueue.broadcast_object() method is called by the GroupCoordinator.broadcast_object() method in parallel_state.py:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L364-L366

The broadcast over ZeroMQ is only done if the GroupCoordinator was created with use_message_queue_broadcaster set to True:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L216-L219

The only case where GroupCoordinator is created with use_message_queue_broadcaster is the coordinator for the tensor parallelism group:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L931-L936

To determine what data is broadcasted to the tensor parallism group, we must continue tracing. GroupCoordinator.broadcast_object() is called by GroupCoordinator.broadcoast_tensor_dict():

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L489

which is called by broadcast_tensor_dict() in communication_op.py:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/communication_op.py#L29-L34

If we look at _get_driver_input_and_broadcast() in the V0 worker_base.py, we'll see how this tensor dict is formed:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/worker/worker_base.py#L332-L352

but the data actually sent over ZeroMQ is the metadata_list portion that is split from this tensor_dict. The tensor parts are sent via torch.distributed and only metadata about those tensors is sent via ZeroMQ.

https://github.com/vllm-project/vllm/blob/54a66e5fee4a1ea62f1e4c79a078b20668e408c6/vllm/distributed/parallel_state.py#L61-L83

Patches

Workarounds

Prior to the fix, your options include:

  1. Do not expose the vLLM host to a network where any untrusted connections may reach the host.
  2. Ensure that only the other vLLM hosts are able to connect to the TCP port used for the XPUB socket. Note that port used is random.

References


Release Notes

vllm-project/vllm (vllm)

v0.8.5

Compare Source

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support
V1 Engine
  • Add structural_tag support using xgrammar (#​17085)
  • Disaggregated serving:
  • Clean up: Remove Sampler from Model Code (#​17084)
  • MLA: Simplification to batch P/D reordering (#​16673)
  • Move usage stats to worker and start logging TPU hardware (#​16211)
  • Support FlashInfer Attention (#​16684)
  • Faster incremental detokenization (#​15137)
  • EAGLE-3 Support (#​16937)
Features
  • Validate urls object for multimodal content parts (#​16990)
  • Prototype support sequence parallelism using compilation pass (#​16155)
  • Add sampling params to v1/audio/transcriptions endpoint (#​16591)
  • Enable vLLM to Dynamically Load LoRA from a Remote Server (#​10546)
  • Add vllm bench [latency, throughput] CLI commands (#​16508)
Performance
  • Attention:
    • FA3 decode perf improvement - single mma warp group support for head dim 128 (#​16864)
    • Update to lastest FA3 code (#​13111)
    • Support Cutlass MLA for Blackwell GPUs (#​16032)
  • MoE:
    • Add expert_map support to Cutlass FP8 MOE (#​16861)
    • Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#​16753)
  • Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#​6036)
  • Optimize rotary_emb implementation to use Triton operator for improved performance (#​16457)
Hardwares
  • TPU:
    • Enable structured decoding on TPU V1 (#​16499)
    • Capture multimodal encoder during model compilation (#​15051)
    • Enable Top-P (#​16843)
  • AMD:
    • AITER Fused MOE V1 Support (#​16752)
    • Integrate Paged Attention Kernel from AITER (#​15001)
    • Support AITER MLA (#​15893)
    • Upstream prefix prefill speed up for vLLM V1 (#​13305)
    • Adding fp8 and variable length sequence support to Triton FAv2 kernel (#​12591)
    • Add skinny gemms for unquantized linear on ROCm (#​15830)
    • Follow-ups for Skinny Gemms on ROCm. (#​17011)
Documentation
  • Add open-webui example (#​16747)
  • Document Matryoshka Representation Learning support (#​16770)
  • Add a security guide (#​17230)
  • Add example to run DeepSeek with Ray Serve LLM (#​17134)
  • Benchmarks for audio models (#​16505)
Security and Dependency Updates
  • Don't bind tcp zmq socket to all interfaces (#​17197)
  • Use safe serialization and fix zmq setup for mooncake pipe (#​17192)
  • Bump Transformers to 4.51.3 (#​17116)
Build and testing
  • Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#​16721)
Breaking changes 🚨

What's Changed


Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@murthyrudra murthyrudra merged commit 96fbdcd into main May 8, 2025
1 check passed
@murthyrudra murthyrudra deleted the renovate/pypi-vllm-vulnerability branch May 8, 2025 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants