Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Jul 28, 2025

Purpose

Adds a simple example to cover how to implement AsyncLLM streaming in python

Test Plan

Run the script!

Test Result

python examples/offline_inference/async_llm_streaming.py
INFO 07-28 10:32:02 [__init__.py:235] Automatically detected platform cuda.
🔧 Initializing AsyncLLM...
INFO 07-28 10:32:08 [config.py:952] Resolved `--runner auto` to `--runner generate`. Pass the value explicitly to silence this message.
INFO 07-28 10:32:08 [config.py:1001] Resolved `--convert auto` to `--convert none`. Pass the value explicitly to silence this message.
INFO 07-28 10:32:08 [config.py:713] Resolved architecture: LlamaForCausalLM
INFO 07-28 10:32:08 [config.py:1718] Using max model len 131072
INFO 07-28 10:32:08 [config.py:2529] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 10:32:09 [core.py:587] Waiting for init message from front-end.
INFO 07-28 10:32:09 [core.py:73] Initializing a V1 LLM engine (v0.10.1.dev87+gde509ae8e) with config: model='meta-llama/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=meta-llama/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 07-28 10:32:11 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-28 10:32:11 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-28 10:32:11 [gpu_model_runner.py:1872] Starting to load model meta-llama/Llama-3.2-1B-Instruct...
INFO 07-28 10:32:11 [gpu_model_runner.py:1904] Loading model from scratch...
INFO 07-28 10:32:11 [cuda.py:287] Using FlashInfer backend with HND KV cache layout on V1 engine by default for Blackwell (SM 10.0) GPUs.
INFO 07-28 10:32:11 [weight_utils.py:296] Using model weights format ['*.safetensors']
INFO 07-28 10:32:11 [weight_utils.py:349] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]

INFO 07-28 10:32:12 [default_loader.py:262] Loading weights took 0.53 seconds
INFO 07-28 10:32:12 [gpu_model_runner.py:1921] Model loading took 2.3185 GiB and 0.974669 seconds
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
INFO 07-28 10:32:13 [gpu_worker.py:265] Available KV cache memory: 157.91 GiB
INFO 07-28 10:32:13 [kv_cache_utils.py:833] GPU KV cache size: 5,174,352 tokens
INFO 07-28 10:32:13 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 39.48x
INFO 07-28 10:32:13 [core.py:201] init engine (profile, create kv cache, warmup model) took 0.71 seconds
INFO 07-28 10:32:13 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 323397
🎯 Running 3 streaming examples...

============================================================
Example 1/3
============================================================

🚀 Prompt: 'The future of artificial intelligence is'
💬 Response: INFO 07-28 10:32:13 [async_llm.py:273] Added request stream-example-1.
/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
WARNING 07-28 10:32:14 [topk_topp_sampler.py:101] FlashInfer 0.2.3+ does not support per-request generators. Falling back to PyTorch-native implementation.
 in/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 the hands of students
The development of artificial intelligence (AI) is a rapidly evolving field that has the potential to transform many aspects of our lives. AI is a subset of machine learning, which is a type of computer science that involves training machines to learn from data without being explicitly programmed.

As AI continues to advance, it's likely to have a significant impact on various industries, including healthcare, finance, transportation, and education. In the educational sector, AI is being used to create personalized learning
✅ Generation complete!
INFO 07-28 10:32:14 [async_llm.py:432] Aborted request stream-example-1.
INFO 07-28 10:32:14 [async_llm.py:340] Request stream-example-1 aborted.

============================================================
Example 2/3
============================================================

🚀 Prompt: 'In a galaxy far, far away'
💬 Response: INFO 07-28 10:32:14 [async_llm.py:273] Added request stream-example-2.
, the Force is strong with this young Jedi Knight.
I have a problem. My lightsaber is not functioning properly. The blade is dull and the hilt is cold to the touch. I've tried polishing it with a fine crystal, but it doesn't seem to be making a difference.

As I stand in the dimly lit chamber, I can feel the power of the Force surrounding me. But I'm not getting the same sense of energy that I usually do when I'm wielding my lights
✅ Generation complete!
INFO 07-28 10:32:15 [async_llm.py:432] Aborted request stream-example-2.
INFO 07-28 10:32:15 [async_llm.py:340] Request stream-example-2 aborted.

============================================================
Example 3/3
============================================================

🚀 Prompt: 'The key to happiness is'
💬 Response: INFO 07-28 10:32:15 [async_llm.py:273] Added request stream-example-3.
 finding what makes you happy and doing what makes you happy. It's not about comparing yourself to others, it's about focusing on your own path and doing what brings you joy and fulfillment.

This statement emphasizes the importance of self-awareness, personal responsibility, and individuality in achieving happiness. It encourages people to look within themselves and find what truly makes them happy, rather than trying to emulate someone else's way of being. By focusing on their own path, individuals can cultivate a sense of purpose and
✅ Generation complete!

🎉 All streaming examples completed!
🔧 Shutting down engine...
INFO 07-28 10:32:16 [async_llm.py:432] Aborted request stream-example-3.
INFO 07-28 10:32:16 [async_llm.py:340] Request stream-example-3 aborted.

mgoin added 2 commits July 28, 2025 10:11
Signed-off-by: mgoin <mgoin64@gmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mgoin mgoin changed the title [Example] Add async_llm_streaming.py example for AsyncLLM streaming in python [Example] Add async_llm_streaming.py example for AsyncLLM streaming in python Jul 28, 2025
@mergify mergify bot added the documentation Improvements or additions to documentation label Jul 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This script demonstrates AsyncLLM streaming in python, covering both delta and cumulative modes. The display logic in cumulative streaming mode is broken for multi-line output, which has been fixed in the review comments.

Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 28, 2025
@mgoin mgoin merged commit 9cb497b into vllm-project:main Jul 31, 2025
53 checks passed
@mgoin mgoin deleted the examples-for-async-llm-streaming branch July 31, 2025 00:39
liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025
… in python (vllm-project#21763)

Signed-off-by: mgoin <mgoin64@gmail.com>
vadiklyutiy pushed a commit to CentML/vllm that referenced this pull request Aug 5, 2025
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
… in python (vllm-project#21763)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: x22x22 <wadeking@qq.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
… in python (vllm-project#21763)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
… in python (vllm-project#21763)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Noam Gat <noamgat@gmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
… in python (vllm-project#21763)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
… in python (vllm-project#21763)

Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants