Skip to content

Commit 37b033e

Browse files
noooopgemini-code-assist[bot]DarkLight1337
authored andcommitted
[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (vllm-project#25524)
Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
1 parent 1484d6a commit 37b033e

27 files changed

+499
-131
lines changed

docs/design/io_processor_plugins.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ The `post_process*` methods take `PoolingRequestOutput` objects as input and gen
7979
The `validate_or_generate_params` method is used for validating with the plugin any `SamplingParameters`/`PoolingParameters` received with the user request, or to generate new ones if none are specified. The function always returns the validated/generated parameters.
8080
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/openai/serving_pooling.py).
8181

82-
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/online_serving/prithvi_geospatial_mae.py](../../examples/online_serving/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/prithvi_geospatial_mae_io_processor.py)) inference examples.
82+
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/IBM/terratorch/tree/main/terratorch/vllm/plugins/segmentation). Please, also refer to our online ([examples/online_serving/pooling/prithvi_geospatial_mae.py](../../examples/online_serving/pooling/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py)) inference examples.
8383

8484
## Using an IO Processor plugin
8585

docs/models/pooling_models.md

Lines changed: 68 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,11 @@ If `--runner pooling` has been set (manually or automatically) but the model doe
3030
vLLM will attempt to automatically convert the model according to the architecture names
3131
shown in the table below.
3232

33-
| Architecture | `--convert` | Supported pooling tasks |
34-
|-------------------------------------------------|-------------|-------------------------------|
35-
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
36-
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
37-
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
33+
| Architecture | `--convert` | Supported pooling tasks |
34+
|-------------------------------------------------|-------------|---------------------------------------|
35+
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` |
36+
| `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify`, `score` |
37+
| `*ForRewardModeling`, `*RewardModel` | `reward` | `token_classify` |
3838

3939
!!! tip
4040
You can explicitly set `--convert <type>` to specify how to convert the model.
@@ -45,12 +45,14 @@ Each pooling model in vLLM supports one or more of these tasks according to
4545
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
4646
enabling the corresponding APIs:
4747

48-
| Task | APIs |
49-
|------------|--------------------------------------|
50-
| `encode` | `LLM.reward(...)` |
51-
| `embed` | `LLM.embed(...)`, `LLM.score(...)`\* |
52-
| `classify` | `LLM.classify(...)` |
53-
| `score` | `LLM.score(...)` |
48+
| Task | APIs |
49+
|------------------|-------------------------------------------------------------------------------|
50+
| `embed` | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` |
51+
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")` |
52+
| `score` | `LLM.score(...)` |
53+
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` |
54+
| `token_embed` | `LLM.encode(..., pooling_task="token_embed")` |
55+
| `plugin` | `LLM.encode(..., pooling_task="plugin")` |
5456

5557
\* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task.
5658

@@ -144,7 +146,6 @@ A code example can be found here: [examples/offline_inference/basic/score.py](..
144146
### `LLM.reward`
145147

146148
The [reward][vllm.LLM.reward] method is available to all reward models in vLLM.
147-
It returns the extracted hidden states directly.
148149

149150
```python
150151
from vllm import LLM
@@ -161,15 +162,17 @@ A code example can be found here: [examples/offline_inference/basic/reward.py](.
161162
### `LLM.encode`
162163

163164
The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
164-
It returns the extracted hidden states directly.
165165

166166
!!! note
167167
Please use one of the more specific methods or set the task directly when using `LLM.encode`:
168168

169169
- For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`.
170170
- For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`.
171-
- For rewards, use `LLM.reward(...)` or `pooling_task="reward"`.
172171
- For similarity scores, use `LLM.score(...)`.
172+
- For rewards, use `LLM.reward(...)` or `pooling_task="token_classify"`.
173+
- For token classification, use `pooling_task="token_classify"`.
174+
- For multi-vector retrieval, use `pooling_task="token_embed"`
175+
- For IO Processor Plugins , use `pooling_task="plugin"`
173176

174177
```python
175178
from vllm import LLM
@@ -185,10 +188,47 @@ print(f"Data: {data!r}")
185188

186189
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
187190

188-
- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
189191
- [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
190192
- [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
191193
- [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.
194+
- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
195+
196+
!!! note
197+
Please use one of the more specific methods or set the task directly when using [Pooling API](../serving/openai_compatible_server.md#pooling-api) api.:
198+
199+
- For embeddings, use [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) or `"task":"embed"`.
200+
- For classification logits, use [Classification API](../serving/openai_compatible_server.md#classification-api) or `task":"classify"`.
201+
- For similarity scores, use [Score API](../serving/openai_compatible_server.md#score-api).
202+
- For rewards, `task":"token_classify"`.
203+
- For token classification, use `task":"token_classify"`.
204+
- For multi-vector retrieval, use `task":"token_embed"`
205+
- For IO Processor Plugins , use `task":"plugin"`
206+
207+
```python
208+
# start a supported embeddings model server with `vllm serve`, e.g.
209+
# vllm serve intfloat/e5-small
210+
import requests
211+
212+
host = "localhost"
213+
port = "8000"
214+
model_name = "intfloat/e5-small"
215+
216+
api_url = f"http://{host}:{port}/pooling"
217+
218+
prompts = [
219+
"Hello, my name is",
220+
"The president of the United States is",
221+
"The capital of France is",
222+
"The future of AI is",
223+
]
224+
prompt = {"model": model_name, "input": prompts, "task": "embed"}
225+
226+
response = requests.post(api_url, json=prompt)
227+
228+
for output in response.json()["data"]:
229+
data = output["data"]
230+
print(f"Data: {data!r} (size={len(data)})")
231+
```
192232

193233
## Matryoshka Embeddings
194234

@@ -265,3 +305,16 @@ Expected output:
265305
```
266306

267307
An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)
308+
309+
## Deprecated Features
310+
311+
### Encode task
312+
313+
We have split the `encode` task into two more specific token wise tasks: `token_embed` and `token_classify`:
314+
315+
- `token_embed` is the same as embed, using normalize as activation.
316+
- `token_classify` is the same as classify, default using softmax as activation.
317+
318+
### Remove softmax from PoolingParams
319+
320+
We are going to remove `softmax` and `activation` from `PoolingParams`. Instead, you should set `use_activation`, since we actually allow `classify` and `token_classify` to use any activation function.

docs/serving/openai_compatible_server.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -638,7 +638,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
638638

639639
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
640640

641-
Code example: [examples/online_serving/openai_cross_encoder_score.py](../../examples/online_serving/openai_cross_encoder_score.py)
641+
Code example: [examples/online_serving/pooling/openai_cross_encoder_score.py](../../examples/online_serving/pooling/openai_cross_encoder_score.py)
642642

643643
#### Single inference
644644

@@ -819,7 +819,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
819819
print("Scoring output:", response_json["data"][0]["score"])
820820
print("Scoring output:", response_json["data"][1]["score"])
821821
```
822-
Full example: [examples/online_serving/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/openai_cross_encoder_score_for_multimodal.py)
822+
Full example: [examples/online_serving/pooling/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/pooling/openai_cross_encoder_score_for_multimodal.py)
823823

824824
#### Extra parameters
825825

examples/offline_inference/pooling/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,18 @@ python examples/offline_inference/pooling/multi_vector_retrieval.py
3838
python examples/offline_inference/pooling/ner.py
3939
```
4040

41+
## Prithvi Geospatial MAE usage
42+
43+
```bash
44+
python examples/offline_inference/pooling/prithvi_geospatial_mae.py
45+
```
46+
47+
## IO Processor Plugins for Prithvi Geospatial MAE
48+
49+
```bash
50+
python examples/offline_inference/pooling/prithvi_geospatial_mae_io_processor.py
51+
```
52+
4153
## Qwen3 reranker usage
4254

4355
```bash

examples/offline_inference/pooling/ner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ def main(args: Namespace):
3333
label_map = llm.llm_engine.vllm_config.model_config.hf_config.id2label
3434

3535
# Run inference
36-
outputs = llm.encode(prompts)
36+
outputs = llm.encode(prompts, pooling_task="token_classify")
3737

3838
for prompt, output in zip(prompts, outputs):
3939
logits = output.outputs.data

examples/online_serving/pooling/README.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,65 +3,95 @@
33
## Cohere rerank usage
44

55
```bash
6+
# vllm serve BAAI/bge-reranker-base
67
python examples/online_serving/pooling/cohere_rerank_client.py
78
```
89

910
## Embedding requests base64 encoding_format usage
1011

1112
```bash
13+
# vllm serve intfloat/e5-small
1214
python examples/online_serving/pooling/embedding_requests_base64_client.py
1315
```
1416

1517
## Embedding requests bytes encoding_format usage
1618

1719
```bash
20+
# vllm serve intfloat/e5-small
1821
python examples/online_serving/pooling/embedding_requests_bytes_client.py
1922
```
2023

2124
## Jinaai rerank usage
2225

2326
```bash
27+
# vllm serve BAAI/bge-reranker-base
2428
python examples/online_serving/pooling/jinaai_rerank_client.py
2529
```
2630

2731
## Multi vector retrieval usage
2832

2933
```bash
34+
# vllm serve BAAI/bge-m3
3035
python examples/online_serving/pooling/multi_vector_retrieval_client.py
3136
```
3237

3338
## Named Entity Recognition (NER) usage
3439

3540
```bash
41+
# vllm serve boltuix/NeuroBERT-NER
3642
python examples/online_serving/pooling/ner_client.py
3743
```
3844

39-
## Openai chat embedding for multimodal usage
45+
## OpenAI chat embedding for multimodal usage
4046

4147
```bash
4248
python examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py
4349
```
4450

45-
## Openai classification usage
51+
## OpenAI classification usage
4652

4753
```bash
54+
# vllm serve jason9693/Qwen2.5-1.5B-apeach
4855
python examples/online_serving/pooling/openai_classification_client.py
4956
```
5057

51-
## Openai embedding usage
58+
## OpenAI cross_encoder score usage
5259

5360
```bash
61+
# vllm serve BAAI/bge-reranker-v2-m3
62+
python examples/online_serving/pooling/openai_cross_encoder_score.py
63+
```
64+
65+
## OpenAI cross_encoder score for multimodal usage
66+
67+
```bash
68+
# vllm serve jinaai/jina-reranker-m0
69+
python examples/online_serving/pooling/openai_cross_encoder_score_for_multimodal.py
70+
```
71+
72+
## OpenAI embedding usage
73+
74+
```bash
75+
# vllm serve intfloat/e5-small
5476
python examples/online_serving/pooling/openai_embedding_client.py
5577
```
5678

57-
## Openai embedding matryoshka dimensions usage
79+
## OpenAI embedding matryoshka dimensions usage
5880

5981
```bash
82+
# vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
6083
python examples/online_serving/pooling/openai_embedding_matryoshka_fy.py
6184
```
6285

63-
## Openai pooling usage
86+
## OpenAI pooling usage
6487

6588
```bash
89+
# vllm serve internlm/internlm2-1_8b-reward --trust-remote-code
6690
python examples/online_serving/pooling/openai_pooling_client.py
6791
```
92+
93+
## Online Prithvi Geospatial MAE usage
94+
95+
```bash
96+
python examples/online_serving/pooling/prithvi_geospatial_mae.py
97+
```

0 commit comments

Comments
 (0)