Skip to content

Commit 3c1c5b4

Browse files
committed
Update all images
Signed-off-by: mgoin <mgoin64@gmail.com>
1 parent 23c3ec0 commit 3c1c5b4

8 files changed

+32
-23
lines changed

_posts/2025-10-16-vllm-tpu.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
---
2-
layout: post
3-
title: "vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU "
4-
author: "Google Team"
2+
layout: post
3+
title: "vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU"
4+
author: "Google Team"
5+
image: /assets/figures/vllm-tpu/vllm-tpu.png
56
---
67

78
<p align="center">

_posts/2025-10-22-agent-lightning.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
---
2-
layout: post
3-
title: "No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL"
4-
author: "The Agent Lightning (AGL) Team"
2+
layout: post
3+
title: "No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL"
4+
author: "The Agent Lightning (AGL) Team"
5+
image: /assets/figures/agent-lightning/1_rewards.png
56
---
67

78
**TL;DR.** Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In **agent RL**, this can lead to inconsistencies between training and inference due to the phenomenon we call **Retokenization Drift**. This phenomenon occurs because tokens are detokenized during inference and subsequently retokenized during training; the two sets of tokens may differ even though their corresponding strings are identical. Now, you can ask vLLM’s OpenAI‑compatible endpoints to return the **exact token IDs** for both prompts and generated responses. Pass `"return_token_ids": true` to `/v1/chat/completions` or `/v1/completions` and you’ll receive `prompt_token_ids` and `token_ids` alongside the regular text output. This makes **agent RL** robust, as no more drift will happen. This pairs perfectly with Agent Lightning, where each model call is viewed as separate update sample without stitching; just log the returned IDs via `return_token_ids` enabled.

_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
---
2-
layout: post
3-
title: "Now Serving NVIDIA Nemotron with vLLM"
4-
author: "NVIDIA Nemotron Team"
2+
layout: post
3+
title: "Now Serving NVIDIA Nemotron with vLLM"
4+
author: "NVIDIA Nemotron Team"
5+
image: /assets/figures/2025-vllm-nvidia-nemotron/figure1.png
56
---
67

78
Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are open, efficient, and ready to scale. And, as demand for agents grows, open, performant models are the key as they provide transparency, adaptability, and cost-control.

_posts/2025-10-27-semantic-router-modular.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
---
22
layout: post
3-
title: "From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA"
4-
author: "Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)"
3+
title: "From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA"
4+
author: "Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)"
5+
image: /assets/figures/semantic-router/modular.png
56
---
67

78
Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number of models. This post examines how a recent refactoring of the vLLM Semantic Router's Rust-based classification layer addresses this problem through architectural modularity, Low-Rank Adaptation (LoRA), and concurrency optimization.

_posts/2025-10-28-Kimi-K2-Accuracy.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
layout: post
33
title: "Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM"
44
author: "Linian Wang (Peking University)"
5+
image: /assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg
56
---
67

78
**TL;DR:** For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd ([Kimi-K2-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)) or commit 0102674b179db4ca5a28cd9a4fb446f87f0c1454 ([Kimi-K2](https://huggingface.co/moonshotai/Kimi-K2-Instruct)). The updates are committed per model.
@@ -152,6 +153,8 @@ Through systematic and collaborative debugging, we successfully resolved the cri
152153

153154
I hope this detailed account serves as a useful roadmap for other developers integrating complex models into vLLM and beyond. As the open-source community continues to mature, we look forward to an even more seamless model integration experience and more powerful agentic capabilities for everyone.
154155

156+
![](/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg)
157+
155158
### Acknowledgements
156159

157160
I'd like to extend my sincere gratitude to the engineers at the Kimi team. Their deep technical expertise was crucial in pinpointing the root causes, and they swiftly implemented the necessary fixes on the Hugging Face Hub once the issues were identified. This journey and its successful outcome would not have been possible without their active collaboration and support.

_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
---
22
layout: post
3-
title: "Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM"
4-
author: "NVIDIA Nemotron Team"
3+
title: "Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM"
4+
author: "NVIDIA Nemotron Team"
5+
image: /assets/figures/2025-multimodal-nvidia-nemotron/figure1.png
56
---
67

78
We are excited to release [NVIDIA Nemotron Nano 2 VL](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16), supported by vLLM. This open vision language model ([VLM](https://www.nvidia.com/en-us/glossary/vision-language-models/)) is built for video understanding and document intelligence.

_posts/2025-11-11-intel-arc-pro-b.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
layout: post
33
title: "Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM"
44
author: "Intel vLLM Team"
5+
image: /assets/figures/2025-vllm-on-intel-arc/perf-figure1.png
56
---
67

78
[Intel® Arc™ Pro B-Series GPU Family](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability with multi-GPU setups make it possible to run the latest, large and capable AI models locally, making advanced AI inference accessible to professionals looking to deploy Large Language Models (LLMs) without the premium costs typically associated with AI hardware.
@@ -51,8 +52,8 @@ Intel® Arc™ Pro B60 GPU has 20 XeCores, each with identical resources that ca
5152
One observation is that each group runs a different amount of work due to the imbalance of expert routing. If a group loops fixed stride of work, there is always a group that takes the largest amount of work and another, smallest. The gap between them will accumulate up to 15% of the total MoE GEMM time. A better alternative is whoever finishes a task in one loop starts the immediate available task in the next loop.
5253
For a concrete example, there are 40 groups to crunch 200 GEMM blocks, static stride will result that group 0 loop through 0, 40, 80, ... group 1 loop through 1, 41, 81, etc. A caveat is that due to the nature of MoE, each GEMM block may not have same amount of compute intensity. Also, randomized access patterns will let certain groups finish work faster than others. This will limit efficiency in such a way that the groups always finished job earlier can’t help those always meet heavy loads.
5354

54-
| Before | After |
55-
|---|---|
55+
| Before | After |
56+
| ----------------------------------------------------------------------- | ----------------------------------------------------------------------- |
5657
| ![thread load](/assets/figures/2025-vllm-on-intel-arc/thread-load1.png) | ![thread load](/assets/figures/2025-vllm-on-intel-arc/thread-load2.png) |
5758

5859
We mitigate the effect by letting each group compete for the next job through an atomic number. Whoever finishes computing one GEMM block will get a rank from the atomic number who decides which next block it’ll take. In this case, we eliminated small gaps in kernel looping and achieved perfect scheduling among all scenarios of experts routing.
@@ -85,14 +86,14 @@ Figure 3: TTFT/TPOT for llama-70B single batch with long context input from 1K t
8586

8687
GPT-OSS: Intel® Arc™ Pro B60 GPU also demonstrates exceptional performance with OpenAI's recently launched GPT-OSS model, providing developers and enterprises with a powerful, cost-effective solution for large-scale AI inference as shown in the table below.
8788

88-
| Model | Data type | TP | Input/output seq length | Concurrency | TTFT (s) | TPOT (ms) | Output Token Throughput (toks/s) |
89-
| --- | --- | --- | --- | --- | --- | --- | --- |
90-
| GPT-OSS-20b |MXFP4 |1 |1024/1024 |75 |7.614 |53.96 |1210.74|
91-
| GPT-OSS-20b |MXFP4 |1 |2048/2048 |38 |7.823 |42.35 |818.92 |
92-
| GPT-OSS-20b |MXFP4 |1 |5120/5120 |15 |8.36 |34.27 |416.94 |
93-
| GPT-OSS-120b |MXFP4 |4 |1024/1024 |100|8.04 |58.78 |1495.12|
94-
| GPT-OSS-120b |MXFP4 |4 |2048/2048 |50 |8.11 |41.98 |1085.58|
95-
| GPT-OSS-120b |MXFP4 |4 |5120/5120 |20 |8.60 |30.60 |619.10 |
89+
| Model | Data type | TP | Input/output seq length | Concurrency | TTFT (s) | TPOT (ms) | Output Token Throughput (toks/s) |
90+
| ------------ | --------- | --- | ----------------------- | ----------- | -------- | --------- | -------------------------------- |
91+
| GPT-OSS-20b | MXFP4 | 1 | 1024/1024 | 75 | 7.614 | 53.96 | 1210.74 |
92+
| GPT-OSS-20b | MXFP4 | 1 | 2048/2048 | 38 | 7.823 | 42.35 | 818.92 |
93+
| GPT-OSS-20b | MXFP4 | 1 | 5120/5120 | 15 | 8.36 | 34.27 | 416.94 |
94+
| GPT-OSS-120b | MXFP4 | 4 | 1024/1024 | 100 | 8.04 | 58.78 | 1495.12 |
95+
| GPT-OSS-120b | MXFP4 | 4 | 2048/2048 | 50 | 8.11 | 41.98 | 1085.58 |
96+
| GPT-OSS-120b | MXFP4 | 4 | 5120/5120 | 20 | 8.60 | 30.60 | 619.10 |
9697

9798
Table 1: GPT-OSS vLLM inference throughput using 1-4 GPUs on x8 Intel® Arc™ Pro B-series System.
9899

410 KB
Loading

0 commit comments

Comments
 (0)