-
-
Couldn't load subscription status.
- Fork 10.8k
Support bge-m3 sparse embeddings (lexical weights) #14526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
To support sparse+dense together, we need to actually implement #12249. I still don't have time to implement this though. |
|
I've changed the implementation so that now the user has to add |
|
This is great, looking forward to the launch of this feature, how long will it take for this feature to be available? |
|
+1, waiting for this feature. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
+1 |
|
any update? |
|
The V1 embedding PR is already approved but is now blocked by other unrelated test failures: #16188 . The next step will be to add support for encoder models as they have been left out of the embedding model PR to make it simpler. |
|
现在还不支持哇 |
|
I think this should be possible now that we support multiple poolers |
|
|
We can support different task per request in the model runner, but this isn't exposed in the API server yet |
Now with the pooling task framework Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
c73dbcd to
3d443c6
Compare
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
@DarkLight1337 , I've updated the PR now that we have V1 embeddings and the new task refactoring. The new request form is: As a PoC, I created a new task "embed-sparse", but I'm not 100% happy with it, I don't think it will scale if we have to add many different new tasks. Maybe we should add sub-tasks that are model-defined that the dispatcher can use to route the requests. Another point is that the output is not very expressive. To get the tokens the user would have to have to call tokenize and match the tokens with the embeddings by position. I think we should make the PoolingResponse more generic to add task-specific outputs. This is related to the discussion #21621 Finally, I'm not sure what the best way to test this model is. We could test it against the outputs of the |
Agreed. Currently we allow the Pooler to define their own list of supported tasks but in order for those tasks to work, we also have to update the PoolingParams checking and request dispatching, which could be quite complicated. Having subtask would allow us to keep using the existing logic for the base task. |
Yeah, I see now the need for having a registry for each task to override how to transform the response. This would greatly improve the user experience when using encode method. |
We can generate the ground truth locally using FlagEmbedding (set up a helper function so it is easy for us to update the result in case of version changes), and then inside the CI we compare our impl to those generated results. |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.
vllm/entrypoints/openai/protocol.py
Outdated
| class PoolingCompletionRequest(EmbeddingCompletionRequest): | ||
| task: str | None = None | ||
|
|
||
| def to_pooling_params(self): | ||
| return PoolingParams( | ||
| dimensions=self.dimensions, normalize=self.normalize, task=self.task | ||
| ) | ||
|
|
||
|
|
||
| class PoolingChatRequest(EmbeddingChatRequest): | ||
| task: str | None = None | ||
|
|
||
| def to_pooling_params(self): | ||
| return PoolingParams( | ||
| dimensions=self.dimensions, normalize=self.normalize, task=self.task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve truncate_prompt_tokens in pooling requests
The new PoolingCompletionRequest.to_pooling_params() and PoolingChatRequest.to_pooling_params() no longer pass truncate_prompt_tokens to PoolingParams. Prior to this change, callers could limit the prompt length by setting truncate_prompt_tokens and the value was forwarded in Embedding*Request.to_pooling_params. After the refactor, any truncate_prompt_tokens sent with a pooling request is silently ignored, so long prompts will no longer be truncated even though the API accepts the parameter. This can lead to unexpectedly long contexts or failure when inputs exceed the model’s max length.
Useful? React with 👍 / 👎.
| try: | ||
| pooling_params = request.to_pooling_params() | ||
|
|
||
| if "token_embed" in self.supported_tasks: | ||
| pooling_task = "token_embed" | ||
| elif "token_classify" in self.supported_tasks: | ||
| pooling_task = "token_classify" | ||
| else: | ||
| return self.create_error_response( | ||
| f"pooling_task must be one of {self.supported_tasks}." | ||
| ) | ||
| if pooling_params.task is None: | ||
| if "token_embed" in self.supported_tasks: | ||
| pooling_task = "token_embed" | ||
| elif "token_classify" in self.supported_tasks: | ||
| pooling_task = "token_classify" | ||
| else: | ||
| return self.create_error_response( | ||
| f"pooling_task must be one of {self.supported_tasks}." | ||
| ) | ||
|
|
||
| try: | ||
| pooling_params.verify(pooling_task, self.model_config) | ||
| except ValueError as e: | ||
| return self.create_error_response(str(e)) | ||
| else: | ||
| if pooling_params.task not in self.supported_tasks: | ||
| raise ValueError(f"Task {pooling_params.task} is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate pooling params even when task provided
When the client now supplies task in the pooling request, the server only checks membership in supported_tasks and skips pooling_params.verify. That verification step previously filled in default values (e.g. normalize embeddings, apply classification activations) and rejected incompatible parameters. With the new branch, normalize/activation stay None and no validation runs, so explicit task requests return un‑normalized embeddings and token classifications without the configured activation (e.g. the ReLU for sparse weights), and invalid parameter combinations are never rejected. PoolingParams.verify(pooling_params.task, …) still needs to run in this path.
Useful? React with 👍 / 👎.
|
Now that @noooop has added support mulit-vector retrieval with the To start the server, the architecture has to be overriden because otherwise the extra weight file won't be loaded for sparse embeddings (lexical weight). With this setting, the server supports regular dense embedding, token_embed and token_classify: Please note that the The lexical weights can also be retrieved with the offline API: cc: @DarkLight1337 |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
| class PoolingCompletionRequest(EmbeddingCompletionRequest): | ||
| task: str | None = None | ||
|
|
||
| def to_pooling_params(self): | ||
| params = super().to_pooling_params() | ||
| params.task = self.task | ||
| return params | ||
|
|
||
|
|
||
| class PoolingChatRequest(EmbeddingChatRequest): | ||
| task: str | None = None | ||
|
|
||
| def to_pooling_params(self): | ||
| params = super().to_pooling_params() | ||
| params.task = self.task | ||
| return params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to add the task parameter in #25524 and make it required. Thank for adding it now.
|
|
||
| if "token_embed" in self.supported_tasks: | ||
| pooling_task = "token_embed" | ||
| elif "token_classify" in self.supported_tasks: | ||
| pooling_task = "token_classify" | ||
| if pooling_params.task is None: | ||
| if "token_embed" in self.supported_tasks: | ||
| pooling_task = "token_embed" | ||
| elif "token_classify" in self.supported_tasks: | ||
| pooling_task = "token_classify" | ||
| else: | ||
| pooling_task = pooling_params.task | ||
|
|
||
| if pooling_task not in self.supported_tasks: | ||
| return self.create_error_response( | ||
| f"pooling_task must be one of {self.supported_tasks}." | ||
| f"Task {pooling_task} is not supported, it" | ||
| f" must be one of {self.supported_tasks}." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to make the task parameter required in #25524, which can simplify this logic.
| return DispatchPooler( | ||
| { | ||
| "embed": Pooler.for_embed(pooler_config), | ||
| "token_embed": BOSEOSFilter( | ||
| Pooler.for_token_embed(pooler_config), | ||
| self.bos_token_id, | ||
| self.eos_token_id, | ||
| ), | ||
| "token_classify": BOSEOSFilter( | ||
| Pooler.for_token_classify( | ||
| pooler_config, classifier=self.sparse_linear, act_fn=torch.relu | ||
| ), | ||
| self.bos_token_id, | ||
| self.eos_token_id, | ||
| ), | ||
| } | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool !
BGE-M3 Multi-Functionality:
- embed for dense retrieval
- token_embed for multi-vector retrieval
- token_classify for sparse retrieval
Nothing stops us from using a plugin task to output everything at once. (after #26973 landing)
This way, BGE-M3 will be the best demonstration of the flexibility of our new pooler API.
@DarkLight1337 You must come and see this
Please add examples to demonstrate how users can use it. As well as adding tests to guard this feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best is to use a plugin task to output everything all at once. This is more efficient.
This may need to coordinate with #26973
I think a separate PR is still needed to inform everyone that the plugin pooling task has been added, although this PR makes few code changes
Please feel free to modify anything in #26973, as well as any PR of mine.
FIX #13609
FIX #15384
FIX #18469
Here I'm loading the extra
sparse_linear.ptfile using the secondary_weights loading introduced in the ultravox model when I detect that the model name isBAAI/bge-m3. It's a bit ugly but I don't know if there is a more generic way to do this.Currently, since the only permissible pooling return type is torch.tensor, I'm just returning the token weights tensor directly. If the use wants to match tokens to the weights they have to call
tokenizeand remove the bos and eos token and then the indices of both vectors should match.To request sparse vectors the use has to pass
"additional_data": {"sparse_embeddings": true} in the request. This means that all sequences in that request will be treated as sparse. If the user wants to mix, separate calls have to be made for each type of embedding.
The FlagEmbedding API allows to return more then one type of embedding at the same time, but currently, due to the limitation of the pooling return type we can only return a single tensor per sequence.
To show that this PoC is already returning the correct results, consider the code below:
This code prints
With vllm we get the following: