Skip to content

Conversation

@noooop
Copy link
Collaborator

@noooop noooop commented Sep 22, 2025

Break change

  • Split the encode task into two tasks: token_embed and token_classify
    • token_embed is the same as embed, using normalize as activation
    • token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn. )
  • Complete remove softmax from PoolingParams and perfer using activation, since we actually allow classify and token_classify to use any activation function.
  • pooling_task required for llm.encode and /pooling <- This will take effect entirely in the next PR ([Frontend][Doc][Last/N] Improve all pooling task | Polish encode (pooling) api & Document. #25524)

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

After #21227 landed, we hope that pooling models can always use all pooling, and users don’t need to manually set using all pooling.

The current encode api (/pooling api ) mainly targets the classify for each token scenario (e.g. TokenClassification #24872 & reward models), overlooked the embed for each token scenario.

Let's support embed for each token scenario (multi-vector retrieval)

Partial_Fix #25165

We are stepping closer to support ColBERT & ColPali

cc @DarkLight1337 @maxdebayser

Test Plan

tests/models/language/pooling/test_multi_vector_retrieval.py
tests/test_pooling_params.py

Test Result

pass

Known Issues

  • Maybe we should find a way to support chunked prefill + all pooling (and mean pooling))
  • support ColBERT & ColPali

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added documentation Improvements or additions to documentation qwen Related to Qwen models labels Sep 22, 2025
@noooop
Copy link
Collaborator Author

noooop commented Sep 22, 2025

@jupyterjazz

try:

from vllm import LLM

llm = LLM(
    model="jinaai/jina-embeddings-v4-vllm-text-matching",
    enforce_eager=True,
    max_model_len=1024,
    enable_chunked_prefill=False,  # <- In order to use the encode api
    runner="pooling")

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.embed(prompts)

for prompt, output in zip(prompts, outputs):
    embeds = output.outputs.embedding
    print(len(embeds))

outputs = llm.encode(prompts, pooling_task="encode")

for prompt, output in zip(prompts, outputs):
    multi_vector = output.outputs.data
    print(multi_vector.shape)

Do you ok with this api and outputs?


There are still some broken features that need to be fixed. But the multi_vector feature is now testable.

@noooop
Copy link
Collaborator Author

noooop commented Sep 23, 2025

@DarkLight1337

Ready for review

(Slight) Break change

  • Split the encode task into two tasks: token_embed and token_classify
    • token_embed is the same as embed, using normalize as activation
    • token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn. )
    • Use the following code for compatibility:
def encode2pooling_task(supported_tasks):
    # Currently no model supports both token_embed and token_classify.
    if "token_embed" in supported_tasks:
        return "token_embed"
    elif "token_classify" in supported_tasks:
        return "token_classify"
    else:
        raise ValueError(f"pooling_task must be one of {supported_tasks}.")
  • Complete remove softmax from PoolingParams and perfer using activation, since we actually allow classify and token_classify to use any activation function.

@jupyterjazz
Copy link

jupyterjazz commented Sep 23, 2025

Hi @noooop,

I just tested and it works fine. The only thing it was missing is override_pooler_config=PoolerConfig(normalize=False). I don't have a strong opinion on this but setting normalization to False by default during encode could make sense because oftentimes you would want the actual last hidden layer at this step. Other than this, the API looks good to me. Thank you for such a quick fix!

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this page https://docs.vllm.ai/en/latest/models/pooling_models.html#model-conversion to not use encode task anymore

@DarkLight1337
Copy link
Member

I'll delay the merge of this PR until after the release so we don't have to worry about back-compatibility issues which further complicate future PRs

Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes in this PR look good to me but I would prefer to still keep the generic "encode" task around for uses cases that don't cleanly fit with the old and new tasks introduced in this PR. As an example, the use case introduced by this PR: #22820 can't really be described as "token_embed", "token_classify", "embed", "classify" or "score".

@mergify
Copy link

mergify bot commented Sep 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 26, 2025
@noooop
Copy link
Collaborator Author

noooop commented Sep 27, 2025

Overall the changes in this PR look good to me but I would prefer to still keep the generic "encode" task around for uses cases that don't cleanly fit with the old and new tasks introduced in this PR. As an example, the use case introduced by this PR: #22820 can't really be described as "token_embed", "token_classify", "embed", "classify" or "score".

Now only mypy will check whether pooling_task belongs to PoolingTask, so pooling_task in encode api can accept any str.

In fact, we basically support custom pooling task and pooler. As long as user implement a OOT model with pooler, and use the corresponding pooling_task in encode. LOL


Do we need to move towards allowing users to use pooling task plugins?

@DarkLight1337
Copy link
Member

cc @christian-pinto

@noooop
Copy link
Collaborator Author

noooop commented Oct 15, 2025

Can I temporarily skip this test?
( I think there is a conflict in dependencies between terratorch and vllm

No, we cannot skip it

It seems this pr can pass the test, but I don't know how it happened.

@DarkLight1337
Copy link
Member

Maybe you can try to pip install the specific version of terratorch

Signed-off-by: wang.yuqi <noooop@126.com>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Oct 15, 2025
@noooop
Copy link
Collaborator Author

noooop commented Oct 15, 2025

Maybe you can try to pip install the specific version of terratorch

terratorch @ git+https://github.com/IBM/terratorch.git@1.1.rc3 # required for PrithviMAE test

pip install git+https://github.com/IBM/terratorch.git@1.1.rc3

It seems I can only rely on ci.

@christian-pinto
Copy link
Contributor

probably this is happening because on CI we use the compiled list of dependencies, rather than manually pip installing the various packages?

Have you tried installing all dependencies from here: https://github.com/vllm-project/vllm/blob/main/requirements/test.in

We are in the process of releasing a new version of TerraTorch and then we will be able to just pip install from pypy rather than a specific tag from git. I will post a PR to fix that in vLLM soon.

@noooop
Copy link
Collaborator Author

noooop commented Oct 15, 2025

By the way, this PR splits the encode task into two tasks: token_embed and token_classify.

  • token_embed is the same as embed, using normalize as activation.
  • token_classify is the same as classify, default using softmax as activation (we actually allow classify and token_classify to use any activation function by setting act_fn.)

prithvi_mae and test_io_processor_plugins are not one of "token_embed", "token_classify", "embed", "classify" or "score".

We no longer have the encode task that puts all the others into it.

@christian-pinto How do you think about the plugin pooling task?

@noooop
Copy link
Collaborator Author

noooop commented Oct 15, 2025

@DarkLight1337

Are there any more modifications needed for this PR?

@DarkLight1337
Copy link
Member

As long as the test passes then I'm fine with it

@noooop noooop enabled auto-merge (squash) October 15, 2025 09:41
@christian-pinto
Copy link
Contributor

When it comes to the IO processor plugins, the type of pooling activation function depends on the combination of model and plugin. As an example, for PrithviMAE we instantiate a All pooler and disable softmax because the plugin is expecting to receive the raw output from the pooler. Other models in the future might have a different behavior and require a different pooling strategy and activation function, which is fine and should still work.

However, I agree with @maxdebayser regarding keeping the encode task to support more generic cases like this one, without having to resort to using the remaining tasks that can be confusing for people.

@noooop
Copy link
Collaborator Author

noooop commented Oct 15, 2025

As long as the test passes then I'm fine with it

(You forgot to approve it

@noooop noooop merged commit f54f851 into vllm-project:main Oct 15, 2025
60 checks passed
@noooop noooop deleted the multi_vector_retrieval branch October 15, 2025 11:20
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
mandy-li pushed a commit to mandy-li/vllm that referenced this pull request Oct 16, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
vllm-project#25370)

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants