[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming in python #21763

mgoin · 2025-07-28T14:33:18Z

Purpose

Adds a simple example to cover how to implement AsyncLLM streaming in python

Test Plan

Run the script!

Test Result

python examples/offline_inference/async_llm_streaming.py
INFO 07-28 10:32:02 [__init__.py:235] Automatically detected platform cuda.
🔧 Initializing AsyncLLM...
INFO 07-28 10:32:08 [config.py:952] Resolved `--runner auto` to `--runner generate`. Pass the value explicitly to silence this message.
INFO 07-28 10:32:08 [config.py:1001] Resolved `--convert auto` to `--convert none`. Pass the value explicitly to silence this message.
INFO 07-28 10:32:08 [config.py:713] Resolved architecture: LlamaForCausalLM
INFO 07-28 10:32:08 [config.py:1718] Using max model len 131072
INFO 07-28 10:32:08 [config.py:2529] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 10:32:09 [core.py:587] Waiting for init message from front-end.
INFO 07-28 10:32:09 [core.py:73] Initializing a V1 LLM engine (v0.10.1.dev87+gde509ae8e) with config: model='meta-llama/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=meta-llama/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 07-28 10:32:11 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-28 10:32:11 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-28 10:32:11 [gpu_model_runner.py:1872] Starting to load model meta-llama/Llama-3.2-1B-Instruct...
INFO 07-28 10:32:11 [gpu_model_runner.py:1904] Loading model from scratch...
INFO 07-28 10:32:11 [cuda.py:287] Using FlashInfer backend with HND KV cache layout on V1 engine by default for Blackwell (SM 10.0) GPUs.
INFO 07-28 10:32:11 [weight_utils.py:296] Using model weights format ['*.safetensors']
INFO 07-28 10:32:11 [weight_utils.py:349] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]

INFO 07-28 10:32:12 [default_loader.py:262] Loading weights took 0.53 seconds
INFO 07-28 10:32:12 [gpu_model_runner.py:1921] Model loading took 2.3185 GiB and 0.974669 seconds
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
INFO 07-28 10:32:13 [gpu_worker.py:265] Available KV cache memory: 157.91 GiB
INFO 07-28 10:32:13 [kv_cache_utils.py:833] GPU KV cache size: 5,174,352 tokens
INFO 07-28 10:32:13 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 39.48x
INFO 07-28 10:32:13 [core.py:201] init engine (profile, create kv cache, warmup model) took 0.71 seconds
INFO 07-28 10:32:13 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 323397
🎯 Running 3 streaming examples...

============================================================
Example 1/3
============================================================

🚀 Prompt: 'The future of artificial intelligence is'
💬 Response: INFO 07-28 10:32:13 [async_llm.py:273] Added request stream-example-1.
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
WARNING 07-28 10:32:14 [topk_topp_sampler.py:101] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
 in/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 the hands of students
The development of artificial intelligence (AI) is a rapidly evolving field that has the potential to transform many aspects of our lives. AI is a subset of machine learning, which is a type of computer science that involves training machines to learn from data without being explicitly programmed.

As AI continues to advance, it's likely to have a significant impact on various industries, including healthcare, finance, transportation, and education. In the educational sector, AI is being used to create personalized learning
✅ Generation complete!
INFO 07-28 10:32:14 [async_llm.py:432] Aborted request stream-example-1.
INFO 07-28 10:32:14 [async_llm.py:340] Request stream-example-1 aborted.

============================================================
Example 2/3
============================================================

🚀 Prompt: 'In a galaxy far, far away'
💬 Response: INFO 07-28 10:32:14 [async_llm.py:273] Added request stream-example-2.
, the Force is strong with this young Jedi Knight.
I have a problem. My lightsaber is not functioning properly. The blade is dull and the hilt is cold to the touch. I've tried polishing it with a fine crystal, but it doesn't seem to be making a difference.

As I stand in the dimly lit chamber, I can feel the power of the Force surrounding me. But I'm not getting the same sense of energy that I usually do when I'm wielding my lights
✅ Generation complete!
INFO 07-28 10:32:15 [async_llm.py:432] Aborted request stream-example-2.
INFO 07-28 10:32:15 [async_llm.py:340] Request stream-example-2 aborted.

============================================================
Example 3/3
============================================================

🚀 Prompt: 'The key to happiness is'
💬 Response: INFO 07-28 10:32:15 [async_llm.py:273] Added request stream-example-3.
 finding what makes you happy and doing what makes you happy. It's not about comparing yourself to others, it's about focusing on your own path and doing what brings you joy and fulfillment.

This statement emphasizes the importance of self-awareness, personal responsibility, and individuality in achieving happiness. It encourages people to look within themselves and find what truly makes them happy, rather than trying to emulate someone else's way of being. By focusing on their own path, individuals can cultivate a sense of purpose and
✅ Generation complete!

🎉 All streaming examples completed!
🔧 Shutting down engine...
INFO 07-28 10:32:16 [async_llm.py:432] Aborted request stream-example-3.
INFO 07-28 10:32:16 [async_llm.py:340] Request stream-example-3 aborted.

Signed-off-by: mgoin <mgoin64@gmail.com>

github-actions · 2025-07-28T14:33:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This script demonstrates AsyncLLM streaming in python, covering both delta and cumulative modes. The display logic in cumulative streaming mode is broken for multi-line output, which has been fixed in the review comments.

examples/offline_inference/async_llm_streaming.py

Signed-off-by: mgoin <mgoin64@gmail.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added 2 commits July 28, 2025 10:11

Add async_llm_streaming.py example for AsyncLLM streaming in python

dc12f82

Signed-off-by: mgoin <mgoin64@gmail.com>

Simplify

a47b313

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin changed the title ~~[Example] Add async_llm_streaming.py example for AsyncLLM streaming in python~~ [Example] Add async_llm_streaming.py example for AsyncLLM streaming in python Jul 28, 2025

mergify bot added the documentation Improvements or additions to documentation label Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

mgoin requested a review from robertgshaw2-redhat July 28, 2025 14:39

robertgshaw2-redhat reviewed Jul 28, 2025

View reviewed changes

examples/offline_inference/async_llm_streaming.py Outdated Show resolved Hide resolved

Remove V1

69a3e79

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 28, 2025

mgoin merged commit 9cb497b into vllm-project:main Jul 31, 2025
53 checks passed

mgoin deleted the examples-for-async-llm-streaming branch July 31, 2025 00:39

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

6160fda

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

vadiklyutiy pushed a commit to CentML/vllm that referenced this pull request Aug 5, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

a57f76c

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

f261288

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

10733c9

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

3783971

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

a87d673

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

17f4b78

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Example] Add async_llm_streaming.py example for AsyncLLM streaming…

97e49b8

… in python (vllm-project#21763) Signed-off-by: mgoin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming in python #21763

[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming in python #21763

Uh oh!

mgoin commented Jul 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

[Example] Add async_llm_streaming.py example for AsyncLLM streaming in python #21763

[Example] Add async_llm_streaming.py example for AsyncLLM streaming in python #21763

Uh oh!

Conversation

mgoin commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming in python #21763

[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming in python #21763

mgoin commented Jul 28, 2025 •

edited by github-actions bot

Loading