Support bge-m3 sparse embeddings (lexical weights) #14526

maxdebayser · 2025-03-09T21:38:29Z

Here I'm loading the extra sparse_linear.pt file using the secondary_weights loading introduced in the ultravox model when I detect that the model name is BAAI/bge-m3. It's a bit ugly but I don't know if there is a more generic way to do this.

Currently, since the only permissible pooling return type is torch.tensor, I'm just returning the token weights tensor directly. If the use wants to match tokens to the weights they have to call tokenize and remove the bos and eos token and then the indices of both vectors should match.

To request sparse vectors the use has to pass
"additional_data": {"sparse_embeddings": true} in the request. This means that all sequences in that request will be treated as sparse. If the user wants to mix, separate calls have to be made for each type of embedding.

The FlagEmbedding API allows to return more then one type of embedding at the same time, but currently, due to the limitation of the pooling return type we can only return a single tensor per sequence.

To show that this PoC is already returning the correct results, consider the code below:

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3',  use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BGE M3?", "Defination of BM25"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
print(model.convert_id_to_token(output_1['lexical_weights']))

This code prints

[{'What': 0.08344, 'is': 0.08136, 'B': 0.1295, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04086}, {'De': 0.05023, 'fin': 0.1368, 'ation': 0.0452, 'of': 0.0635, 'BM': 0.2515, '25': 0.3337}]

With vllm we get the following:

$ curl -s http://localhost:8000/v1/embeddings    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "input": ["What is BGE M3?", "Defination of BM25"],
     "additional_data": {"sparse_embeddings": true}
}' | jq
{
  "id": "embd-38ce076880b94d41b206ae99caae7b19",
  "object": "list",
  "created": 1741555561,
  "model": "BAAI/bge-m3",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0836181640625,
        0.08148193359375,
        0.1295166015625,
        0.251708984375,
        0.1700439453125,
        0.269775390625,
        0.040924072265625
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        0.050201416015625,
        0.136962890625,
        0.04510498046875,
        0.0633544921875,
        0.25146484375,
        0.333740234375
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 17,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

github-actions · 2025-03-09T21:38:40Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/model_executor/models/roberta.py

DarkLight1337 · 2025-03-10T07:41:45Z

To support sparse+dense together, we need to actually implement #12249. I still don't have time to implement this though.

maxdebayser · 2025-03-13T14:34:46Z

I've changed the implementation so that now the user has to add --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' to the command line to activate this mode. But I agree that we need to implement #12249 to properly support this and other models like ibm-granite/granite-embedding-30m-sparse. Let's keep this PR in draft state for now.

243006306 · 2025-03-20T10:03:11Z

This is great, looking forward to the launch of this feature, how long will it take for this feature to be available?

IllyaPysarchuk · 2025-03-26T13:06:21Z

+1, waiting for this feature.

mergify · 2025-04-01T06:03:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

arjunasuresh300 · 2025-04-30T08:07:54Z

+1

Sam120204 · 2025-06-17T17:22:43Z

any update?

maxdebayser · 2025-06-17T17:25:29Z

The V1 embedding PR is already approved but is now blocked by other unrelated test failures: #16188 . The next step will be to add support for encoder models as they have been left out of the embedding model PR to make it simpler.

fufenghua · 2025-07-08T03:48:40Z

现在还不支持哇

DarkLight1337 · 2025-07-24T03:20:15Z

I think this should be possible now that we support multiple poolers

maxdebayser · 2025-07-24T17:12:19Z

I think this should be possible now that we support multiple poolers
We can select the embedding types per request, right? But can we have multiple pooling strategies applied on the same request?
Anyway, I'll revive this PR to work for one pooling type per request already.

DarkLight1337 · 2025-07-24T17:14:11Z

We can support different task per request in the model runner, but this isn't exposed in the API server yet

Now with the pooling task framework Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser · 2025-08-12T17:15:19Z

@DarkLight1337 , I've updated the PR now that we have V1 embeddings and the new task refactoring. The new request form is:

curl -s http://localhost:8000/pooling    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "task": "embed-sparse",
     "input": ["What is BGE M3?", "Defination of BM25"]
}' | jq
{
  "id": "pool-f3ea25d3e28d4b40b686092badd99f91",
  "object": "list",
  "created": 1755018267,
  "model": "BAAI/bge-m3",
  "data": [
    {
      "index": 0,
      "object": "pooling",
      "data": [
        0.08349609375,
        0.0814208984375,
        0.1295166015625,
        0.251708984375,
        0.1700439453125,
        0.26953125,
        0.04083251953125
      ]
    },
    {
      "index": 1,
      "object": "pooling",
      "data": [
        0.05010986328125,
        0.136962890625,
        0.045013427734375,
        0.06341552734375,
        0.25146484375,
        0.33349609375
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 17,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

As a PoC, I created a new task "embed-sparse", but I'm not 100% happy with it, I don't think it will scale if we have to add many different new tasks. Maybe we should add sub-tasks that are model-defined that the dispatcher can use to route the requests.

Another point is that the output is not very expressive. To get the tokens the user would have to have to call tokenize and match the tokens with the embeddings by position. I think we should make the PoolingResponse more generic to add task-specific outputs. This is related to the discussion #21621

Finally, I'm not sure what the best way to test this model is. We could test it against the outputs of the FlagEmbedding library, but that means that we would have to add yet another dependency, which I think we already have to many of. Maybe we could just test a request against a known output.

DarkLight1337 · 2025-08-13T00:34:34Z

I'm not 100% happy with it, I don't think it will scale if we have to add many different new tasks

Agreed. Currently we allow the Pooler to define their own list of supported tasks but in order for those tasks to work, we also have to update the PoolingParams checking and request dispatching, which could be quite complicated. Having subtask would allow us to keep using the existing logic for the base task.

DarkLight1337 · 2025-08-13T00:41:06Z

Another point is that the output is not very expressive. To get the tokens the user would have to have to call tokenize and match the tokens with the embeddings by position. I think we should make the PoolingResponse more generic to add task-specific outputs.

Yeah, I see now the need for having a registry for each task to override how to transform the response. This would greatly improve the user experience when using encode method.

DarkLight1337 · 2025-08-13T00:43:33Z

Finally, I'm not sure what the best way to test this model is.

We can generate the ground truth locally using FlagEmbedding (set up a helper function so it is easy for us to update the result in case of version changes), and then inside the CI we compare our impl to those generated results.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-16T20:14:12Z

vllm/entrypoints/openai/protocol.py

+class PoolingCompletionRequest(EmbeddingCompletionRequest):
+    task: str | None = None
+
+    def to_pooling_params(self):
+        return PoolingParams(
+            dimensions=self.dimensions, normalize=self.normalize, task=self.task
+        )
+
+
+class PoolingChatRequest(EmbeddingChatRequest):
+    task: str | None = None
+
+    def to_pooling_params(self):
+        return PoolingParams(
+            dimensions=self.dimensions, normalize=self.normalize, task=self.task


Preserve truncate_prompt_tokens in pooling requests

The new PoolingCompletionRequest.to_pooling_params() and PoolingChatRequest.to_pooling_params() no longer pass truncate_prompt_tokens to PoolingParams. Prior to this change, callers could limit the prompt length by setting truncate_prompt_tokens and the value was forwarded in Embedding*Request.to_pooling_params. After the refactor, any truncate_prompt_tokens sent with a pooling request is silently ignored, so long prompts will no longer be truncated even though the API accepts the parameter. This can lead to unexpectedly long contexts or failure when inputs exceed the model’s max length.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-16T20:14:12Z

vllm/entrypoints/openai/serving_pooling.py

        try:
            pooling_params = request.to_pooling_params()

-            if "token_embed" in self.supported_tasks:
-                pooling_task = "token_embed"
-            elif "token_classify" in self.supported_tasks:
-                pooling_task = "token_classify"
-            else:
-                return self.create_error_response(
-                    f"pooling_task must be one of {self.supported_tasks}."
-                )
+            if pooling_params.task is None:
+                if "token_embed" in self.supported_tasks:
+                    pooling_task = "token_embed"
+                elif "token_classify" in self.supported_tasks:
+                    pooling_task = "token_classify"
+                else:
+                    return self.create_error_response(
+                        f"pooling_task must be one of {self.supported_tasks}."
+                    )

-            try:
                pooling_params.verify(pooling_task, self.model_config)
-            except ValueError as e:
-                return self.create_error_response(str(e))
+            else:
+                if pooling_params.task not in self.supported_tasks:
+                    raise ValueError(f"Task {pooling_params.task} is not supported")


Validate pooling params even when task provided

When the client now supplies task in the pooling request, the server only checks membership in supported_tasks and skips pooling_params.verify. That verification step previously filled in default values (e.g. normalize embeddings, apply classification activations) and rejected incompatible parameters. With the new branch, normalize/activation stay None and no validation runs, so explicit task requests return un‑normalized embeddings and token classifications without the configured activation (e.g. the ReLU for sparse weights), and invalid parameter combinations are never rejected. PoolingParams.verify(pooling_params.task, …) still needs to run in this path.

Useful? React with 👍 / 👎.

maxdebayser · 2025-10-16T20:18:27Z

Now that @noooop has added support mulit-vector retrieval with the token_embed and token_classify tasks, I've refactored this PR in terms of these tasks.

To start the server, the architecture has to be overriden because otherwise the extra weight file won't be loaded for sparse embeddings (lexical weight).

vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

With this setting, the server supports regular dense embedding, token_embed and token_classify:

curl -s http://localhost:8000/pooling    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "task": "token_classify", # this returns the lexical weights
     "input": ["What is BGE M3?", "Defination of BM25"]
}'

curl -s http://localhost:8000/pooling    -H "Content-Type: application/json"    -d '{
     "model": "BAAI/bge-m3",
     "task": "token_embed",
     "input": ["What is BGE M3?", "Defination of BM25"]
}'

Please note that the token_classify request will return an array if scores and not a dict of decoded tokens to their scores. The API currently doesn't support rich formats like that.

The lexical weights can also be retrieved with the offline API:

llm = LLM(
    model="BAAI/bge-m3",
    runner="pooling",
    enforce_eager=True,
    hf_overrides={"architectures": ["BgeM3EmbeddingModel"]})

outputs = llm.encode(prompts, pooling_task="token_classify")

cc: @DarkLight1337

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

noooop · 2025-10-17T00:37:03Z

vllm/entrypoints/openai/protocol.py

+class PoolingCompletionRequest(EmbeddingCompletionRequest):
+    task: str | None = None
+
+    def to_pooling_params(self):
+        params = super().to_pooling_params()
+        params.task = self.task
+        return params
+
+
+class PoolingChatRequest(EmbeddingChatRequest):
+    task: str | None = None
+
+    def to_pooling_params(self):
+        params = super().to_pooling_params()
+        params.task = self.task
+        return params


I plan to add the task parameter in #25524 and make it required. Thank for adding it now.

noooop · 2025-10-17T00:39:13Z

vllm/entrypoints/openai/serving_pooling.py


-            if "token_embed" in self.supported_tasks:
-                pooling_task = "token_embed"
-            elif "token_classify" in self.supported_tasks:
-                pooling_task = "token_classify"
+            if pooling_params.task is None:
+                if "token_embed" in self.supported_tasks:
+                    pooling_task = "token_embed"
+                elif "token_classify" in self.supported_tasks:
+                    pooling_task = "token_classify"
            else:
+                pooling_task = pooling_params.task
+
+            if pooling_task not in self.supported_tasks:
                return self.create_error_response(
-                    f"pooling_task must be one of {self.supported_tasks}."
+                    f"Task {pooling_task} is not supported, it"
+                    f" must be one of {self.supported_tasks}."
                )


I plan to make the task parameter required in #25524, which can simplify this logic.

noooop · 2025-10-17T00:48:39Z

vllm/model_executor/models/roberta.py

+        return DispatchPooler(
+            {
+                "embed": Pooler.for_embed(pooler_config),
+                "token_embed": BOSEOSFilter(
+                    Pooler.for_token_embed(pooler_config),
+                    self.bos_token_id,
+                    self.eos_token_id,
+                ),
+                "token_classify": BOSEOSFilter(
+                    Pooler.for_token_classify(
+                        pooler_config, classifier=self.sparse_linear, act_fn=torch.relu
+                    ),
+                    self.bos_token_id,
+                    self.eos_token_id,
+                ),
+            }
+        )


Cool !

BGE-M3 Multi-Functionality:

embed for dense retrieval

token_embed for multi-vector retrieval

token_classify for sparse retrieval

Nothing stops us from using a plugin task to output everything at once. (after #26973 landing)

This way, BGE-M3 will be the best demonstration of the flexibility of our new pooler API.

@DarkLight1337 You must come and see this

Please add examples to demonstrate how users can use it. As well as adding tests to guard this feature

I think the best is to use a plugin task to output everything all at once. This is more efficient.

This may need to coordinate with #26973

I think a separate PR is still needed to inform everyone that the plugin pooling task has been added, although this PR makes few code changes

Please feel free to modify anything in #26973, as well as any PR of mine.

DarkLight1337 reviewed Mar 10, 2025

View reviewed changes

vllm/model_executor/models/roberta.py Outdated Show resolved Hide resolved

vllm/model_executor/models/roberta.py Outdated Show resolved Hide resolved

DarkLight1337 mentioned this pull request Mar 10, 2025

[Usage]: vllm 启动bge 怎么支持sparse+dense colbert+sparse+dense 这些返回呢 #14533

Closed

1 task

DarkLight1337 mentioned this pull request Mar 24, 2025

[Feature]: Request for Support of Dense and Sparse Features in bge-m3 Embedding Model #15384

Closed

1 task

mergify bot added the needs-rebase label Apr 1, 2025

DarkLight1337 mentioned this pull request May 21, 2025

[Feature]: bge m3 sparse embedding #18469

Closed

1 task

mergify bot added the new-model Requests to new models label Jul 11, 2025

Support bge-m3 sparse embeddings

3d443c6

Now with the pooling task framework Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser force-pushed the bge_m3_sparse branch from c73dbcd to 3d443c6 Compare August 12, 2025 17:00

mergify bot added frontend and removed needs-rebase labels Aug 12, 2025

squeeze output

925836f

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into bge_m3_sparse

25817d5

DarkLight1337 mentioned this pull request Aug 14, 2025

[Misc] IO Processor plugins for pooling models #22820

Merged

maxdebayser added 6 commits October 16, 2025 13:29

Merge branch 'upstream_main' into bge_m3_sparse

7d69a03

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fix merge errors

589e143

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Refactor to use new classes for token_embed and token_classify

715ad4c

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fixes

b0aa6a4

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Fix requires_token_ids and task overriding

69a721a

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into bge_m3_sparse

de8f381

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser marked this pull request as ready for review October 16, 2025 20:09

maxdebayser requested review from aarnphm, chaunceyjiang and noooop as code owners October 16, 2025 20:09

chatgpt-codex-connector bot reviewed Oct 16, 2025

View reviewed changes

maxdebayser changed the title ~~First working PoC for bge-m3 sparse embeddings~~ Support bge-m3 sparse embeddings (lexical weights) Oct 16, 2025

maxdebayser mentioned this pull request Oct 16, 2025

[Usage]: How can I get the sparse embedding from OpenAI Embedding Client? #13609

Open

maxdebayser added 2 commits October 16, 2025 17:33

chatgpt has got a point

12455b2

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fix boolean

daba294

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

noooop reviewed Oct 17, 2025

View reviewed changes

noooop mentioned this pull request Oct 24, 2025

[Usage]: how to request a qwen2.5-VL-7B classify model served by vllm using openai SDK? #27413

Open

1 task

Uh oh!

Uh oh!

Support bge-m3 sparse embeddings (lexical weights) #14526

Are you sure you want to change the base?

Support bge-m3 sparse embeddings (lexical weights) #14526

Conversation

maxdebayser commented Mar 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 9, 2025

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented Mar 13, 2025

Uh oh!

243006306 commented Mar 20, 2025

Uh oh!

IllyaPysarchuk commented Mar 26, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

arjunasuresh300 commented Apr 30, 2025

Uh oh!

Sam120204 commented Jun 17, 2025

Uh oh!

maxdebayser commented Jun 17, 2025

Uh oh!

fufenghua commented Jul 8, 2025

Uh oh!

DarkLight1337 commented Jul 24, 2025

Uh oh!

maxdebayser commented Jul 24, 2025

Uh oh!

DarkLight1337 commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 13, 2025

Uh oh!

DarkLight1337 commented Aug 13, 2025

Uh oh!

DarkLight1337 commented Aug 13, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser commented Oct 16, 2025

Uh oh!

noooop Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

noooop Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

noooop Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noooop Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

maxdebayser commented Mar 9, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Mar 10, 2025 •

edited

Loading

DarkLight1337 commented Jul 24, 2025 •

edited

Loading

noooop Oct 17, 2025 •

edited

Loading