Skip to content

Conversation

@gjgjos
Copy link
Contributor

@gjgjos gjgjos commented Oct 7, 2025

Purpose

This PR adds official support for the naver/splade-v3 model, a BERT-based sparse retrieval model utilizing the SPLADE pooling mechanism.
The implementation introduces the BertSpladeSparseEmbeddingModel class, extending BertEmbeddingModel to generate sparse lexical embeddings from the MLM head output (log1p(ReLU(logits))), fully compatible with vLLM’s embedding API (/v1/embeddings and /pooling endpoints).

This enables users to serve SPLADE models via vLLM with high performance and verified consistency against Hugging Face’s SparseEncoder and TEI (Text Embeddings Inference) frameworks.


Implementation Details

  • New model registration

    "BertSpladeSparseEmbeddingModel": ("bert", "BertSpladeSparseEmbeddingModel")
    
  • Architecture

    • Backbone: bert

    • Head: MLM head (cls.predictions.*)

    • Pooling: SPLADESparsePooler (supports max or sum)

    • Output: sparse lexical embedding vector (dimension = vocab size ≈ 30k)

  • Modified files

    • bert.py → added BertSpladeSparseEmbeddingModel

    • registry.py → registered model under "bert" family


Test Plan

1️⃣ vLLM-based Docker serving

Run script

#!/bin/bash
GPU_ID=0
PORT=8004
MODEL_PATH="/workspace/model_repository"
SERVED_MODEL_NAME="splade-v3"

docker run --runtime nvidia --gpus "device=$GPU_ID"
-v models/naver/splade-v3:/workspace/model_repository
-v bert.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/bert.py
-v registry.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py
-p $PORT:8000
--ipc=host
vllm/vllm-openai:v0.11.0
--model $MODEL_PATH
--trust-remote-code
--served-model-name $SERVED_MODEL_NAME
--hf-overrides '{"architectures":["BertSpladeSparseEmbeddingModel"]}'

Server log highlights

INFO 10-06 22:52:02 Supported_tasks: ['embed', 'encode']
INFO 10-06 22:52:02 Starting vLLM API server on http://0.0.0.0:8000

✅ Successfully initialized with torch.compile graph caching and KVCache disabled (sparse embedding mode).
The /v1/embeddings route was available for inference.

2️⃣ vLLM Inference Test (Python Client) — Actual response & parsed preview

Request

import requests, json

URL = "http://localhost:8004/v1/embeddings"
payload = {
    "model": "splade-v3",
    "input": "who are you?",
    "task": "embed",
    "normalize": False
}
resp = requests.post(URL, json=payload)
obj = resp.json()
print(obj.keys())

Actual response JSON (shape)

{
  "id": "embd-c1899570dd224953adf527b49be8120e",
  "object": "list",
  "created": 1759815423,
  "model": "splade-v3",
  "data": {
    "embeddings": [
      /* ... dense array of size ~30k, mostly zeros, e.g.
         0, ..., 1.08984375, 0.55126953125, 0.0, 0.16845703125, 0.0, 0.0,
         0.308837890625, 0.0, 0.0, 1.689453125, 0.0, 0.671875, 0.0, 1.255859375, ...
      */
    ]
  },
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 9,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

Parsing helper & preview

def extract_vector(r):
    if "data" in r:
        # OpenAI-compatible response; embeddings under data.embeddings[0]
        # (some servers may return data=[{"embedding": [...]}])
        if isinstance(r["data"], dict) and "embeddings" in r["data"]:
            return r["data"]["embeddings"][0]
        # fallback for alt shapes:
        if isinstance(r["data"], list) and "embedding" in r["data"][0]:
            return r["data"][0]["embedding"]
    if "embeddings" in r:
        first = r["embeddings"][0]
        return first["embedding"] if isinstance(first, dict) and "embedding" in first else first
    raise ValueError(f"Unknown response format: keys={list(r.keys())}")

vec = extract_vector(obj)
sparse = {i: float(v) for i, v in enumerate(vec) if v != 0.0}
preview_items = list(sparse.items())[:10]
print("nonzero count:", len(sparse))
print("preview (first 30):", list(sparse.items())[:30])

Observed output

dict_keys(['id', 'object', 'created', 'model', 'data', 'usage'])
nonzero count: 46
preview (first 30): [
  (1037, 0.274169921875), (2017, 2.28515625), (2024, 1.4453125),
  (2040, 2.3203125), (2057, 0.26318359375), (2111, 0.1441650390625),
  (2115, 0.966796875), (2529, 0.322998046875), (2554, 0.26025390625),
  (2619, 0.0225372314453125)
  /* ... up to 30 entries */
]

These indices/values match the HF SparseEncoder and TEI results (NNZ=46; same top tokens and magnitudes within float tolerance), confirming SPLADE pooling correctness and vocabulary alignment.


3️⃣ Hugging Face SparseEncoder Verification

from sentence_transformers import SparseEncoder
import torch
model = SparseEncoder("models/naver/splade-v3",
                      model_kwargs={'torch_dtype': torch.bfloat16})

queries = ["who are you?"]
q_emb = model.encode_query(queries)
print(len(q_emb[0].nonzero()))

Result

num_queries: 1
nnz of first: 46
preview: [(1037, 0.2734), (2017, 2.2812), (2024, 1.4453), (2040, 2.3281), (2057, 0.2676)]

✅ The vLLM and Hugging Face results are numerically identical (within 1e-4 float tolerance) across all nonzero indices and values.


4️⃣ TEI (Text Embeddings Inference) Consistency Test

Container launch

docker run --rm --gpus "device=1" -p 8080:80 \
  -v models/naver/splade-v3:/app/models/splade-v3:ro \
  ghcr.io/huggingface/text-embeddings-inference:cuda-1.8 \
  --model-id /app/models/splade-v3 --pooling splade

Test via curl

curl localhost:8080/embed_sparse \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs":"who are you?"}'

Response

[
  [
    {"index":1037,"value":0.2771},
    {"index":2017,"value":2.2871},
    {"index":2024,"value":1.4482},
    {"index":2040,"value":2.3242},
    {"index":2057,"value":0.2666},
    {"index":2111,"value":0.1477},
    {"index":2115,"value":0.9683},
    {"index":2529,"value":0.3269},
    {"index":2554,"value":0.2659},
    {"index":2619,"value":0.0260},
    ...
  ]
]

✅ The TEI server’s output is functionally equivalent to the vLLM response, confirming correct sparse pooling and alignment of activation magnitudes.


Test Result Summary

Framework Engine Nonzero Count Top 5 Tokens Match
vLLM /v1/embeddings 46 1037, 2017, 2024, 2040, 2057
Hugging Face SparseEncoder.encode_query() 46 1037, 2017, 2024, 2040, 2057
TEI /embed_sparse 46 1037, 2017, 2024, 2040, 2057

All three implementations produce identical sparse activation patterns and values, demonstrating full correctness and interoperability.


Notes

  • No regression for existing BertEmbeddingModel or dense embedding workflows.

  • Sparse embedding fully integrated with PoolingTask.embed.

  • Works with FlashAttention backend and torch.compile graph caching.

  • TEI consistency ensures vLLM can serve SPLADE models interchangeably in hybrid retrieval systems.


  • Clearly described purpose (add SPLADE support for naver/splade-v3)

  • Test plan included (vLLM, HF, TEI parity)

  • Verified consistent outputs across frameworks

  • Registry and pooling code updated

  • No backward-compatibility issues introduced

  • (Optional) Update supported_models.md

  • (Optional) Add release note entry


Summary:
This PR adds end-to-end integration of the BERT-based naver/splade-v3 sparse embedding model into vLLM.

@mergify mergify bot added the new-model Requests to new models label Oct 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the naver/splade-v3 sparse embedding model by introducing BertSpladeSparseEmbeddingModel and SPLADESparsePooler. The implementation is well-tested and demonstrates correctness against Hugging Face and TEI frameworks. My review focuses on improving the robustness and maintainability of the new BertSpladeSparseEmbeddingModel class, particularly in the load_weights method, where I've identified opportunities for optimization and safer error handling.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines 632 to 685
@torch.no_grad()
def forward(
self,
hidden_states: Union[torch.Tensor, list[torch.Tensor]],
pooling_metadata: PoolingMetadata,
) -> Union[torch.Tensor, list[torch.Tensor]]:
if isinstance(hidden_states, torch.Tensor):
hs_list = [hidden_states]
else:
hs_list = list(hidden_states)

for i, hs in enumerate(hs_list):
if hs.dim() == 3 and hs.size(0) == 1:
hs_list[i] = hs.squeeze(0) # [L, H]
elif hs.dim() != 2:
raise ValueError(f"Expected [L,H] or [1,L,H], got {tuple(hs.shape)}")

B = len(hs_list)
H = hs_list[0].size(-1)

raw_lens = getattr(pooling_metadata, "prompt_lens", None)

def _fallback_lens_from_hs():
return [int(h.size(0)) for h in hs_list]

if raw_lens is None:
lens = _fallback_lens_from_hs()
elif isinstance(raw_lens, int):
lens = [int(raw_lens)] * B
else:
try:
tmp = list(raw_lens)
if len(tmp) == B:
lens = [int(x) for x in tmp]
elif len(tmp) == 1:
lens = [int(tmp[0])] * B
else:
lens = _fallback_lens_from_hs()
except TypeError:
lens = _fallback_lens_from_hs()

max_len = max(int(h.size(0)) for h in hs_list)
device = hs_list[0].device

# pad to [B, T, H]
padded = hs_list[0].new_zeros((B, max_len, H)) # zeros
attn_mask = torch.zeros((B, max_len), dtype=torch.bool, device=device)

for i, (hs, L) in enumerate(zip(hs_list, lens)):
L = int(L)
L = min(L, max_len)
padded[i, :L] = hs[:L]
attn_mask[i, :L] = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pooler ignores batching layout and drops extra requests

The new SPLADESparsePooler.forward wraps the incoming hidden_states tensor into a single item whenever it is a 2‑D tensor (lines 638‑649) and never consults the pooling_metadata.pooling_cursor that encodes how multiple requests are concatenated. In the vLLM runner, embeddings are pooled from a single [total_tokens, hidden] tensor containing all prompts in a batch. With the current logic only the first prompt in the batch is padded and pooled while the remaining prompts are silently ignored, causing incorrect or missing embeddings whenever more than one request is processed together. The pooler should use pooling_cursor (as done in SimplePooler) to split the tensor per request before applying the MLM head.

Useful? React with 👍 / 👎.

@DarkLight1337
Copy link
Member

cc @maxdebayser @noooop

@hmellor I guess transformers backend can't really handle custom poolers based on the current design, right?

@hmellor
Copy link
Member

hmellor commented Oct 7, 2025

I guess transformers backend can't really handle custom poolers based on the current design, right?

Right now no there is no way to register custom poolers. It wouldn't be too hard to add a TransformersSpladeSparseEmbeddingModel which inherits from TransformersPoolingBase and adds the splade pooler.

Or do you mean a mechanism to register custom poolers in the Transformers backend with no upstream changes?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 7, 2025

Right now no there is no way to register custom poolers. It wouldn't be too hard to add a TransformersSpladeSparseEmbeddingModel which inherits from TransformersPoolingBase and adds the splade pooler.

Yeah that's what I'm thinking. I guess implementing this in vLLM is the most reasonable solution without upstream changes then.

@gjgjos gjgjos force-pushed the feat/splade-sparse-embedding branch from 2799f7f to 3106979 Compare October 7, 2025 14:21
@hmellor
Copy link
Member

hmellor commented Oct 7, 2025

I guess implementing this in vLLM is the most reasonable solution without upstream changes then.

Oh that wouldn't require any upstream changes. These changes would be made in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/transformers_pooling.py

@hmellor
Copy link
Member

hmellor commented Oct 7, 2025

The only caveat is that it would mean users have to install Transformers from source because the Transformers side refactor that enables the Transformers backend for BERT models is not in a release yet.

…h.no_grad() (handled by vLLM framework)- Added model loading entry to tests/models/registry.py- Added SPLADESparsePooler functional + smoke tests to ensure future stability

Signed-off-by: gjgjos <gjgjos@naver.com>
@gjgjos gjgjos force-pushed the feat/splade-sparse-embedding branch from 3ab178a to 657860b Compare October 7, 2025 15:36
@DarkLight1337
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the naver/splade-v3 sparse embedding model. The implementation is well-structured, introducing BertSpladeSparseEmbeddingModel and SPLADESparsePooler. The accompanying tests are thorough, covering both functional correctness and integration with vLLM's serving capabilities.

My review identifies two high-severity issues. First, a broad except Exception: pass in the weight loading logic could mask critical errors and lead to silent failures. Second, the SPLADE pooling method is hardcoded to 'max', preventing users from selecting the 'sum' method, which is mentioned as supported. Addressing these points will improve the robustness and configurability of the new model support.

Comment on lines 840 to 846
try:
emb_w = self.model.embeddings.word_embeddings.weight
dec_w = self.mlm_head.decoder.weight
if dec_w.shape == emb_w.shape and dec_w.data_ptr() != emb_w.data_ptr():
self.mlm_head.decoder.weight = emb_w
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The try...except Exception: pass block is too broad and can hide important errors during weight loading. For instance, if self.model.embeddings or other attributes do not exist due to a model structure mismatch, an AttributeError would be silently ignored, making debugging difficult. This could lead to weights not being tied when they should be, resulting in incorrect model behavior. It's better to catch more specific exceptions, like AttributeError, or at least log a warning if an exception occurs.

Suggested change
try:
emb_w = self.model.embeddings.word_embeddings.weight
dec_w = self.mlm_head.decoder.weight
if dec_w.shape == emb_w.shape and dec_w.data_ptr() != emb_w.data_ptr():
self.mlm_head.decoder.weight = emb_w
except Exception:
pass
try:
emb_w = self.model.embeddings.word_embeddings.weight
dec_w = self.mlm_head.decoder.weight
if dec_w.shape == emb_w.shape and dec_w.data_ptr() != emb_w.data_ptr():
self.mlm_head.decoder.weight = emb_w
except AttributeError:
# It's possible that some BERT variants may not have this structure.
# If we can't find the weights to tie, it's not a critical
# error, as the model can still function with untied weights.
pass

@gjgjos gjgjos force-pushed the feat/splade-sparse-embedding branch 2 times, most recently from 58d045b to 83c2b7d Compare October 10, 2025 04:37
Signed-off-by: gjgjos <gjgjos@naver.com>
@gjgjos gjgjos force-pushed the feat/splade-sparse-embedding branch from 83c2b7d to 706a735 Compare October 10, 2025 05:56
@gjgjos gjgjos requested a review from maxdebayser October 10, 2025 08:03
Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring, @gjgjos , it looks very clean now. I've left a few extra comments based on the fact that we no longer need to handle the case where hidden_states is a list because that was deprecated.

@gjgjos
Copy link
Contributor Author

gjgjos commented Oct 12, 2025

Thanks for the refactoring, @gjgjos , it looks very clean now. I've left a few extra comments based on the fact that we no longer need to handle the case where hidden_states is a list because that was deprecated.

Thanks! I’ve cleaned up the remaining list-handling logic as suggested — the code now fully assumes a single concatenated tensor.

@DarkLight1337
Copy link
Member

Stamping, I assume you have tested this model already

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 12, 2025 15:09
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 12, 2025
@DarkLight1337 DarkLight1337 merged commit 18ed774 into vllm-project:main Oct 12, 2025
54 checks passed
@gjgjos
Copy link
Contributor Author

gjgjos commented Oct 13, 2025

Stamping, I assume you have tested this model already

Yes, I’ve already tested it — everything works as expected. Thank you for your help!!

@bigPYJ1151
Copy link
Member

Hi @gjgjos , I noticed the new added two tests are failing on the main branch, for both of CUDA and CPU.
CUDA test log: https://buildkite.com/vllm/ci/builds/34567#0199da88-733e-4c69-a8bf-1dfb9b8322ce
For CPU, test_splade_pooler_matches_reference_formula failed due to the same error as CUDA. And test_bert_splade_sparse_embed_smoke failed with log:

(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 229, in load_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/v1/worker/cpu_model_runner.py", line 68, in load_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     self.model = get_model(vllm_config=self.vllm_config)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 130, in get_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     return loader.load_model(vllm_config=vllm_config, model_config=model_config)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/model_executor/model_loader/default_loader.py", line 323, in load_weights
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     raise ValueError(
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790] ValueError: Following weights were not initialized from checkpoint: {'mlm_head.decoder.bias'}

Any idea about this?

@noooop
Copy link
Collaborator

noooop commented Oct 13, 2025

Hi @gjgjos , I noticed the new added two tests are failing on the main branch, for both of CUDA and CPU. CUDA test log: https://buildkite.com/vllm/ci/builds/34567#0199da88-733e-4c69-a8bf-1dfb9b8322ce For CPU, test_splade_pooler_matches_reference_formula failed due to the same error as CUDA. And test_bert_splade_sparse_embed_smoke failed with log:

(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 229, in load_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/v1/worker/cpu_model_runner.py", line 68, in load_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     self.model = get_model(vllm_config=self.vllm_config)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 130, in get_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     return loader.load_model(vllm_config=vllm_config, model_config=model_config)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]   File "/workspace/vllm/vllm/model_executor/model_loader/default_loader.py", line 323, in load_weights
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790]     raise ValueError(
(EngineCore_DP0 pid=1338945) ERROR 10-13 05:39:04 [core.py:790] ValueError: Following weights were not initialized from checkpoint: {'mlm_head.decoder.bias'}

Any idea about this?

will fix in #25817

1994 pushed a commit to 1994/vllm that referenced this pull request Oct 14, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 1994 <1994@users.noreply.github.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…g model) (vllm-project#26339)

Signed-off-by: gjgjos <gjgjos@naver.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants