Update all images

mgoin · mgoin · commit 3c1c5b420b5a · 2025-11-12T14:36:08.000-05:00
Signed-off-by: mgoin &lt;mgoin64@gmail.com&gt;
diff --git a/_posts/2025-10-16-vllm-tpu.md b/_posts/2025-10-16-vllm-tpu.md
@@ -1,7 +1,8 @@
 ---  
-layout: post  
-title: "vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU "  
-author: "Google Team"   
+layout: post
+title: "vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU"
+author: "Google Team"
+image: /assets/figures/vllm-tpu/vllm-tpu.png
 ---
 
 <p align="center">
diff --git a/_posts/2025-10-22-agent-lightning.md b/_posts/2025-10-22-agent-lightning.md
@@ -1,7 +1,8 @@
 ---  
-layout: post  
-title: "No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL"  
-author: "The Agent Lightning (AGL) Team"   
+layout: post
+title: "No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL"
+author: "The Agent Lightning (AGL) Team"
+image: /assets/figures/agent-lightning/1_rewards.png
 ---
 
 **TL;DR.** Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In **agent RL**, this can lead to inconsistencies between training and inference due to the phenomenon we call **Retokenization Drift**. This phenomenon occurs because tokens are detokenized during inference and subsequently retokenized during training; the two sets of tokens may differ even though their corresponding strings are identical. Now, you can ask vLLM’s OpenAI‑compatible endpoints to return the **exact token IDs** for both prompts and generated responses. Pass `"return_token_ids": true` to `/v1/chat/completions` or `/v1/completions` and you’ll receive `prompt_token_ids` and `token_ids` alongside the regular text output. This makes **agent RL** robust, as no more drift will happen. This pairs perfectly with Agent Lightning, where each model call is viewed as separate update sample without stitching; just log the returned IDs via `return_token_ids` enabled.
diff --git a/_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md b/_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md
@@ -1,7 +1,8 @@
 ---  
-layout: post  
-title: "Now Serving NVIDIA Nemotron with vLLM"  
-author: "NVIDIA Nemotron Team"   
+layout: post
+title: "Now Serving NVIDIA Nemotron with vLLM"
+author: "NVIDIA Nemotron Team"
+image: /assets/figures/2025-vllm-nvidia-nemotron/figure1.png
 ---
 
 Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are open, efficient, and ready to scale. And, as demand for agents grows, open, performant models are the key as they provide transparency, adaptability, and cost-control.
diff --git a/_posts/2025-10-27-semantic-router-modular.md b/_posts/2025-10-27-semantic-router-modular.md
@@ -1,7 +1,8 @@
 ---  
 layout: post
-title: "From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA"  
-author: "Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)"   
+title: "From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA"
+author: "Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)"
+image: /assets/figures/semantic-router/modular.png
 ---
 
 Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number of models. This post examines how a recent refactoring of the vLLM Semantic Router's Rust-based classification layer addresses this problem through architectural modularity, Low-Rank Adaptation (LoRA), and concurrency optimization.
diff --git a/_posts/2025-10-28-Kimi-K2-Accuracy.md b/_posts/2025-10-28-Kimi-K2-Accuracy.md
@@ -2,6 +2,7 @@
 layout: post
 title: "Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM"
 author: "Linian Wang (Peking University)"
+image: /assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg
 ---
 
 **TL;DR:** For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd ([Kimi-K2-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)) or commit 0102674b179db4ca5a28cd9a4fb446f87f0c1454 ([Kimi-K2](https://huggingface.co/moonshotai/Kimi-K2-Instruct)). The updates are committed per model.
@@ -152,6 +153,8 @@ Through systematic and collaborative debugging, we successfully resolved the cri
 
 I hope this detailed account serves as a useful roadmap for other developers integrating complex models into vLLM and beyond. As the open-source community continues to mature, we look forward to an even more seamless model integration experience and more powerful agentic capabilities for everyone.
 
+![](/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg)
+
 ### Acknowledgements
 
 I'd like to extend my sincere gratitude to the engineers at the Kimi team. Their deep technical expertise was crucial in pinpointing the root causes, and they swiftly implemented the necessary fixes on the Hugging Face Hub once the issues were identified. This journey and its successful outcome would not have been possible without their active collaboration and support.
diff --git a/_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md b/_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md
@@ -1,7 +1,8 @@
 ---  
 layout: post
-title: "Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM"  
-author: "NVIDIA Nemotron Team"   
+title: "Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM"
+author: "NVIDIA Nemotron Team"
+image: /assets/figures/2025-multimodal-nvidia-nemotron/figure1.png
 ---
 
 We are excited to release [NVIDIA Nemotron Nano 2 VL](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16), supported by vLLM. This open vision language model ([VLM](https://www.nvidia.com/en-us/glossary/vision-language-models/)) is built for video understanding and document intelligence.
diff --git a/_posts/2025-11-11-intel-arc-pro-b.md b/_posts/2025-11-11-intel-arc-pro-b.md
@@ -2,6 +2,7 @@
 layout: post
 title: "Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM"
 author: "Intel vLLM Team"
+image: /assets/figures/2025-vllm-on-intel-arc/perf-figure1.png
 ---
 
 [Intel® Arc™ Pro B-Series GPU Family](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability with multi-GPU setups make it possible to run the latest, large and capable AI models locally, making advanced AI inference accessible to professionals looking to deploy Large Language Models (LLMs) without the premium costs typically associated with AI hardware.
@@ -51,8 +52,8 @@ Intel® Arc™ Pro B60 GPU has 20 XeCores, each with identical resources that ca
 One observation is that each group runs a different amount of work due to the imbalance of expert routing. If a group loops fixed stride of work, there is always a group that takes the largest amount of work and another, smallest. The gap between them will accumulate up to 15% of the total MoE GEMM time. A better alternative is whoever finishes a task in one loop starts the immediate available task in the next loop.
 For a concrete example, there are 40 groups to crunch 200 GEMM blocks, static stride will result that group 0 loop through 0, 40, 80, ... group 1 loop through 1, 41, 81, etc. A caveat is that due to the nature of MoE, each GEMM block may not have same amount of compute intensity. Also, randomized access patterns will let certain groups finish work faster than others. This will limit efficiency in such a way that the groups always finished job earlier can’t help those always meet heavy loads.
 
-| Before | After |
-|---|---|
+| Before                                                                  | After                                                                   |
+| ----------------------------------------------------------------------- | ----------------------------------------------------------------------- |
 | ![thread load](/assets/figures/2025-vllm-on-intel-arc/thread-load1.png) | ![thread load](/assets/figures/2025-vllm-on-intel-arc/thread-load2.png) |
 
 We mitigate the effect by letting each group compete for the next job through an atomic number. Whoever finishes computing one GEMM block will get a rank from the atomic number who decides which next block it’ll take. In this case, we eliminated small gaps in kernel looping and achieved perfect scheduling among all scenarios of experts routing.
@@ -85,14 +86,14 @@ Figure 3: TTFT/TPOT for llama-70B single batch with long context input from 1K t
 
 GPT-OSS: Intel® Arc™ Pro B60 GPU also demonstrates exceptional performance with OpenAI's recently launched GPT-OSS model, providing developers and enterprises with a powerful, cost-effective solution for large-scale AI inference as shown in the table below.
 
-| Model |	Data type |	TP	| Input/output seq length | Concurrency	| TTFT (s) | 	TPOT (ms) |	Output Token Throughput (toks/s) |
-| --- | --- | --- | --- | --- | --- | --- | --- |
-| GPT-OSS-20b	|MXFP4	|1	|1024/1024	|75	|7.614	|53.96	|1210.74|
-| GPT-OSS-20b	|MXFP4	|1	|2048/2048	|38	|7.823	|42.35	|818.92 |
-| GPT-OSS-20b	|MXFP4	|1	|5120/5120	|15	|8.36	|34.27	|416.94 |
-| GPT-OSS-120b	|MXFP4	|4	|1024/1024	|100|8.04	|58.78	|1495.12|
-| GPT-OSS-120b	|MXFP4	|4	|2048/2048	|50	|8.11	|41.98	|1085.58|
-| GPT-OSS-120b	|MXFP4	|4	|5120/5120	|20	|8.60	|30.60	|619.10 |
+| Model        | Data type | TP  | Input/output seq length | Concurrency | TTFT (s) | TPOT (ms) | Output Token Throughput (toks/s) |
+| ------------ | --------- | --- | ----------------------- | ----------- | -------- | --------- | -------------------------------- |
+| GPT-OSS-20b  | MXFP4     | 1   | 1024/1024               | 75          | 7.614    | 53.96     | 1210.74                          |
+| GPT-OSS-20b  | MXFP4     | 1   | 2048/2048               | 38          | 7.823    | 42.35     | 818.92                           |
+| GPT-OSS-20b  | MXFP4     | 1   | 5120/5120               | 15          | 8.36     | 34.27     | 416.94                           |
+| GPT-OSS-120b | MXFP4     | 4   | 1024/1024               | 100         | 8.04     | 58.78     | 1495.12                          |
+| GPT-OSS-120b | MXFP4     | 4   | 2048/2048               | 50          | 8.11     | 41.98     | 1085.58                          |
+| GPT-OSS-120b | MXFP4     | 4   | 5120/5120               | 20          | 8.60     | 30.60     | 619.10                           |
 
 Table 1: GPT-OSS vLLM inference throughput using 1-4 GPUs on x8 Intel® Arc™ Pro B-series System.
 
diff --git a/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg b/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg