Only log non-default CLI args for online serving #17803

hmellor · 2025-05-07T15:03:55Z

Before:

$ vllm serve meta-llama/Llama-3.2-1B-Instruct --enforce-eager --gpu-memory-utilization 0.3
INFO 05-07 17:14:34 [__init__.py:248] Automatically detected platform cuda.
INFO 05-07 17:14:39 [api_server.py:1042] vLLM API server version 0.8.5.dev495+ge89b3f03a
INFO 05-07 17:14:39 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-3.2-1B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-1B-Instruct', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=True, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, enable_prompt_embeds=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, pt_load_map_location='cpu', guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.3, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fd7d7374f40>)

After:

$ vllm serve meta-llama/Llama-3.2-1B-Instruct --enforce-eager --gpu-memory-utilization 0.3
INFO 05-07 17:03:11 [__init__.py:248] Automatically detected platform cuda.
INFO 05-07 17:03:16 [api_server.py:1043] vLLM API server version 0.8.5.dev495+ge89b3f03a
INFO 05-07 17:03:16 [cli_args.py:295] non-default args: {'model': 'meta-llama/Llama-3.2-1B-Instruct', 'enforce_eager': True, 'gpu_memory_utilization': 0.3}

cc @mgoin

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

github-actions · 2025-05-07T15:04:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin · 2025-05-07T15:08:22Z

Can you add an example before and after for a command?

hmellor · 2025-05-07T15:18:25Z

GitHub adding the scrollbar doesn't really do it justice, but I've added it to the PR description. Long line become short line.

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

mgoin

LGTM this is exactly what I'd like to see, curious to get acceptance from others before landing

chaunceyjiang · 2025-05-08T08:40:02Z

Personally, I don't think this PR is very good. Non-default parameters are provided by the user, so the user is obviously aware of them and there's no need to print them. On the other hand, default parameters are what the user might not know, and printing them could actually be more helpful.

For example, parameters like swap_space=4, enable_prefix_caching=None, and prefix_caching_hash_algo='builtin' might give users a clearer understanding of whether the current vLLM configuration is reasonable.

@hmellor @mgoin @DarkLight1337

chaunceyjiang · 2025-05-08T08:47:02Z

For example, in the Kubernetes system, the core component kube-controller-manager will print all default parameters when the log level is set to --v=4.

This is very important in production systems, as it quickly shows the values of default parameters, which helps in analyzing whether those values are reasonable.

hmellor · 2025-05-08T08:53:34Z

Thank you for raising your concern. However, I still think this is a positive change:

If users are examining logs, they may not know which arguments were used to create those logs, so isolating the ones they passed in the logs will be helpful to them
The full configuration of the engine is logged a few lines later (which includes all three of the example arguments in your first comment)
Given that we log the full engine arguments as well, logging both creates noise in the logs, making them harder for people to parse

INFO 05-08 10:47:08 [__init__.py:248] Automatically detected platform cuda.
INFO 05-08 10:47:13 [api_server.py:1044] vLLM API server version 0.8.5.dev495+ge89b3f03a
INFO 05-08 10:47:14 [cli_args.py:297] non-default args: {'model': 'meta-llama/Llama-3.2-1B-Instruct', 'gpu_memory_utilization': 0.4}
INFO 05-08 10:47:23 [config.py:752] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 05-08 10:47:23 [config.py:2057] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-08 10:47:28 [__init__.py:248] Automatically detected platform cuda.
INFO 05-08 10:47:32 [core.py:61] Initializing a V1 LLM engine (v0.8.5.dev495+ge89b3f03a) with config: model='meta-llama/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=meta-llama/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}

If you still believe that the full argparse namespace should be logged, we could add it back as a debug log?

chaunceyjiang · 2025-05-08T09:00:15Z

@hmellor Thank you for the clarification. It's great to be able to see all the parameters.

chaunceyjiang · 2025-05-08T09:02:53Z

If you still believe that the full argparse namespace should be logged, we could add it back as a debug log?

I have no further comments. Being able to see all the parameters is sufficient.

mgoin · 2025-05-08T09:48:55Z

LGTM this is exactly what I'd like to see, curious to get acceptance from others before landing

I didn't want to land this so hastily because it is a log that has been present for a long time. Please respect asks for getting sign off from other committers.
I don't think the PR was even mentioned in the slack so this change might come as a surprise to many.

DarkLight1337 · 2025-05-08T09:52:51Z

Oh sorry I forgot to comment, this LGTM as well which is why I merged it.

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Only log non-default CLI args for online serving

5659b2f

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

mergify bot added the frontend label May 7, 2025

hmellor added 2 commits May 7, 2025 17:23

Wrong type for default constructing args

e905af4

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Don't pass the logger around

fb32e19

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

mgoin approved these changes May 7, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2025

vllm-bot merged commit 998eea4 into vllm-project:main May 8, 2025
68 of 70 checks passed

hmellor deleted the only-log-non-default-args branch May 8, 2025 06:59

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

Only log non-default CLI args for online serving (vllm-project#17803)

82c8b7e

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

Only log non-default CLI args for online serving (vllm-project#17803)

aa73850

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Uh oh!

Only log non-default CLI args for online serving #17803

Only log non-default CLI args for online serving #17803

Uh oh!

Conversation

hmellor commented May 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

mgoin commented May 7, 2025

Uh oh!

hmellor commented May 7, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chaunceyjiang commented May 8, 2025

Uh oh!

chaunceyjiang commented May 8, 2025

Uh oh!

hmellor commented May 8, 2025

Uh oh!

chaunceyjiang commented May 8, 2025

Uh oh!

chaunceyjiang commented May 8, 2025

Uh oh!

mgoin commented May 8, 2025

Uh oh!

DarkLight1337 commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hmellor commented May 7, 2025 •

edited by github-actions bot

Loading