Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
b6b2e12
token_embed & token_classify
noooop Sep 23, 2025
fb5fdfa
Merge branch 'main' into update_pooling_docs
noooop Oct 28, 2025
dd06fe1
/pooling endpoint support all pooling tasks
noooop Oct 28, 2025
0643461
update
noooop Oct 28, 2025
ce69d7b
update examples
noooop Oct 28, 2025
f9d85cf
update examples
noooop Oct 28, 2025
1ea309d
Update vllm/entrypoints/openai/api_server.py
noooop Oct 28, 2025
cdabfc0
fix
noooop Oct 28, 2025
986de1a
Deprecated Feature
noooop Oct 28, 2025
06b1915
Update docs/models/pooling_models.md
noooop Oct 28, 2025
267d037
Update docs/models/pooling_models.md
noooop Oct 28, 2025
3b13620
Update examples/offline_inference/pooling/README.md
noooop Oct 28, 2025
bb3a6f8
Update examples/offline_inference/pooling/README.md
noooop Oct 28, 2025
351d526
Update examples/online_serving/pooling/README.md
noooop Oct 28, 2025
12db9e3
Update examples/online_serving/pooling/README.md
noooop Oct 28, 2025
4938636
Pooling Tasks
noooop Oct 28, 2025
a7ba610
+ runner="pooling"
noooop Oct 28, 2025
4188194
Openai -> OpenAI
noooop Oct 28, 2025
86ce4c4
activation -> use_activation
noooop Oct 28, 2025
44c7d8a
fix
noooop Oct 28, 2025
d46428a
fix
noooop Oct 28, 2025
90df794
activation -> use_activation
noooop Oct 28, 2025
90746ca
fix
noooop Oct 28, 2025
2cf3132
fix
noooop Oct 28, 2025
794669d
fix
noooop Oct 28, 2025
f43249e
Merge branch 'main' into update_pooling_docs
noooop Oct 28, 2025
37137cf
Merge branch 'main' into update_pooling_docs
noooop Oct 29, 2025
0124f4f
Merge branch 'main' into update_pooling_docs
noooop Oct 29, 2025
4c2a98e
add deprecated waring
noooop Oct 30, 2025
95e014b
Merge branch 'main' into update_pooling_docs
noooop Oct 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/design/io_processor_plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ The `post_process*` methods take `PoolingRequestOutput` objects as input and gen
The `validate_or_generate_params` method is used for validating with the plugin any `SamplingParameters`/`PoolingParameters` received with the user request, or to generate new ones if none are specified. The function always returns the validated/generated parameters.
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/openai/serving_pooling.py).

An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/online_serving/prithvi_geospatial_mae.py](../../examples/online_serving/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/prithvi_geospatial_mae_io_processor.py)) inference examples.
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/online_serving/pooling/prithvi_geospatial_mae.py](../../examples/online_serving/pooling/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py)) inference examples.

## Using an IO Processor plugin

Expand Down
83 changes: 68 additions & 15 deletions docs/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ If `--runner pooling` has been set (manually or automatically) but the model doe
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.

| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|-------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|---------------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel` | `reward` | `token_classify` |

!!! tip
You can explicitly set `--convert <type>` to specify how to convert the model.
Expand All @@ -45,12 +45,14 @@ Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:

| Task | APIs |
|------------|--------------------------------------|
| `encode` | `LLM.reward(...)` |
| `embed` | `LLM.embed(...)`, `LLM.score(...)`\* |
| `classify` | `LLM.classify(...)` |
| `score` | `LLM.score(...)` |
| Task | APIs |
|------------------|-------------------------------------------------------------------------------|
| `embed` | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` |
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")` |
| `score` | `LLM.score(...)` |
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` |
| `token_embed` | `LLM.encode(..., pooling_task="token_embed")` |
| `plugin` | `LLM.encode(..., pooling_task="plugin")` |

\* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task.

Expand Down Expand Up @@ -144,7 +146,6 @@ A code example can be found here: [examples/offline_inference/basic/score.py](..
### `LLM.reward`

The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.
It returns the extracted hidden states directly.

```python
from vllm import LLM
Expand All @@ -161,15 +162,17 @@ A code example can be found here: [examples/offline_inference/basic/reward.py](.
### `LLM.encode`

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
It returns the extracted hidden states directly.

!!! note
Please use one of the more specific methods or set the task directly when using `LLM.encode`:

- For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
- For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
- For rewards, use `LLM.reward(...)` or `pooling_task="reward"`.
- For similarity scores, use `LLM.score(...)`.
- For rewards, use `LLM.reward(...)` or `pooling_task="token_classify"`.
- For token classification, use `pooling_task="token_classify"`.
- For multi-vector retrieval, use `pooling_task="token_embed"`
- For IO Processor Plugins , use `pooling_task="plugin"`

```python
from vllm import LLM
Expand All @@ -185,10 +188,47 @@ print(f"Data: {data!r}")

Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.
- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.

!!! note
Please use one of the more specific methods or set the task directly when using [Pooling API](../serving/openai_compatible_server.md#pooling-api) api.:

- For embeddings, use [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) or `"task":"embed"`.
- For classification logits, use [Classification API](../serving/openai_compatible_server.md#classification-api) or `task":"classify"`.
- For similarity scores, use [Score API](../serving/openai_compatible_server.md#score-api).
- For rewards, `task":"token_classify"`.
- For token classification, use `task":"token_classify"`.
- For multi-vector retrieval, use `task":"token_embed"`
- For IO Processor Plugins , use `task":"plugin"`

```python
# start a supported embeddings model server with `vllm serve`, e.g.
# vllm serve intfloat/e5-small
import requests

host = "localhost"
port = "8000"
model_name = "intfloat/e5-small"

api_url = f"http://{host}:{port}/pooling"

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
prompt = {"model": model_name, "input": prompts, "task": "embed"}

response = requests.post(api_url, json=prompt)

for output in response.json()["data"]:
data = output["data"]
print(f"Data: {data!r} (size={len(data)})")
```

## Matryoshka Embeddings

Expand Down Expand Up @@ -265,3 +305,16 @@ Expected output:
```

An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)

## Deprecated Features

### Encode task

We have split the `encode` task into two more specific token wise tasks: `token_embed` and `token_classify`:

- `token_embed` is the same as embed, using normalize as activation.
- `token_classify` is the same as classify, default using softmax as activation.

### Remove softmax from PoolingParams

We are going to remove `softmax` and `activation` from `PoolingParams`. Instead, you should set `use_activation`, since we actually allow `classify` and `token_classify` to use any activation function.
4 changes: 2 additions & 2 deletions docs/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -638,7 +638,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent

You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

Code example: [examples/online_serving/openai_cross_encoder_score.py](../../examples/online_serving/openai_cross_encoder_score.py)
Code example: [examples/online_serving/pooling/openai_cross_encoder_score.py](../../examples/online_serving/pooling/openai_cross_encoder_score.py)

#### Single inference

Expand Down Expand Up @@ -819,7 +819,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
```
Full example: [examples/online_serving/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/openai_cross_encoder_score_for_multimodal.py)
Full example: [examples/online_serving/pooling/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/pooling/openai_cross_encoder_score_for_multimodal.py)

#### Extra parameters

Expand Down
12 changes: 12 additions & 0 deletions examples/offline_inference/pooling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,18 @@ python examples/offline_inference/pooling/multi_vector_retrieval.py
python examples/offline_inference/pooling/ner.py
```

## Prithvi Geospatial MAE usage

```bash
python examples/offline_inference/pooling/prithvi_geospatial_mae.py
```

## IO Processor Plugins for Prithvi Geospatial MAE

```bash
python examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py
```

## Qwen3 reranker usage

```bash
Expand Down
2 changes: 1 addition & 1 deletion examples/offline_inference/pooling/ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def main(args: Namespace):
label_map = llm.llm_engine.vllm_config.model_config.hf_config.id2label

# Run inference
outputs = llm.encode(prompts)
outputs = llm.encode(prompts, pooling_task="token_classify")

for prompt, output in zip(prompts, outputs):
logits = output.outputs.data
Expand Down
40 changes: 35 additions & 5 deletions examples/online_serving/pooling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,65 +3,95 @@
## Cohere rerank usage

```bash
# vllm serve BAAI/bge-reranker-base
python examples/online_serving/pooling/cohere_rerank_client.py
```

## Embedding requests base64 encoding_format usage

```bash
# vllm serve intfloat/e5-small
python examples/online_serving/pooling/embedding_requests_base64_client.py
```

## Embedding requests bytes encoding_format usage

```bash
# vllm serve intfloat/e5-small
python examples/online_serving/pooling/embedding_requests_bytes_client.py
```

## Jinaai rerank usage

```bash
# vllm serve BAAI/bge-reranker-base
python examples/online_serving/pooling/jinaai_rerank_client.py
```

## Multi vector retrieval usage

```bash
# vllm serve BAAI/bge-m3
python examples/online_serving/pooling/multi_vector_retrieval_client.py
```

## Named Entity Recognition (NER) usage

```bash
# vllm serve boltuix/NeuroBERT-NER
python examples/online_serving/pooling/ner_client.py
```

## Openai chat embedding for multimodal usage
## OpenAI chat embedding for multimodal usage

```bash
python examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py
```

## Openai classification usage
## OpenAI classification usage

```bash
# vllm serve jason9693/Qwen2.5-1.5B-apeach
python examples/online_serving/pooling/openai_classification_client.py
```

## Openai embedding usage
## OpenAI cross_encoder score usage

```bash
# vllm serve BAAI/bge-reranker-v2-m3
python examples/online_serving/pooling/openai_cross_encoder_score.py
```

## OpenAI cross_encoder score for multimodal usage

```bash
# vllm serve jinaai/jina-reranker-m0
python examples/online_serving/pooling/openai_cross_encoder_score_for_multimodal.py
```

## OpenAI embedding usage

```bash
# vllm serve intfloat/e5-small
python examples/online_serving/pooling/openai_embedding_client.py
```

## Openai embedding matryoshka dimensions usage
## OpenAI embedding matryoshka dimensions usage

```bash
# vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
python examples/online_serving/pooling/openai_embedding_matryoshka_fy.py
```

## Openai pooling usage
## OpenAI pooling usage

```bash
# vllm serve internlm/internlm2-1_8b-reward --trust-remote-code
python examples/online_serving/pooling/openai_pooling_client.py
```

## Online Prithvi Geospatial MAE usage

```bash
python examples/online_serving/pooling/prithvi_geospatial_mae.py
```
12 changes: 7 additions & 5 deletions tests/entrypoints/pooling/llm/test_classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,17 @@ def llm():

@pytest.mark.skip_global_cleanup
def test_pooling_params(llm: LLM):
def get_outputs(activation):
def get_outputs(use_activation):
outputs = llm.classify(
prompts, pooling_params=PoolingParams(activation=activation), use_tqdm=False
prompts,
pooling_params=PoolingParams(use_activation=use_activation),
use_tqdm=False,
)
return torch.tensor([x.outputs.probs for x in outputs])

default = get_outputs(activation=None)
w_activation = get_outputs(activation=True)
wo_activation = get_outputs(activation=False)
default = get_outputs(use_activation=None)
w_activation = get_outputs(use_activation=True)
wo_activation = get_outputs(use_activation=False)

assert torch.allclose(default, w_activation, atol=1e-2), (
"Default should use activation."
Expand Down
12 changes: 7 additions & 5 deletions tests/entrypoints/pooling/llm/test_reward.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,17 @@ def llm():


def test_pooling_params(llm: LLM):
def get_outputs(activation):
def get_outputs(use_activation):
outputs = llm.reward(
prompts, pooling_params=PoolingParams(activation=activation), use_tqdm=False
prompts,
pooling_params=PoolingParams(use_activation=use_activation),
use_tqdm=False,
)
return torch.cat([x.outputs.data for x in outputs])

default = get_outputs(activation=None)
w_activation = get_outputs(activation=True)
wo_activation = get_outputs(activation=False)
default = get_outputs(use_activation=None)
w_activation = get_outputs(use_activation=True)
wo_activation = get_outputs(use_activation=False)

assert torch.allclose(default, w_activation, atol=1e-2), (
"Default should use activation."
Expand Down
10 changes: 5 additions & 5 deletions tests/entrypoints/pooling/llm/test_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,21 +34,21 @@ def llm():


def test_pooling_params(llm: LLM):
def get_outputs(activation):
def get_outputs(use_activation):
text_1 = "What is the capital of France?"
text_2 = "The capital of France is Paris."

outputs = llm.score(
text_1,
text_2,
pooling_params=PoolingParams(activation=activation),
pooling_params=PoolingParams(use_activation=use_activation),
use_tqdm=False,
)
return torch.tensor([x.outputs.score for x in outputs])

default = get_outputs(activation=None)
w_activation = get_outputs(activation=True)
wo_activation = get_outputs(activation=False)
default = get_outputs(use_activation=None)
w_activation = get_outputs(use_activation=True)
wo_activation = get_outputs(use_activation=False)

assert torch.allclose(default, w_activation, atol=1e-2), (
"Default should use activation."
Expand Down
Loading