Skip to content

Conversation

@hmellor
Copy link
Member

@hmellor hmellor commented May 7, 2025

Before:

$ vllm serve meta-llama/Llama-3.2-1B-Instruct --enforce-eager --gpu-memory-utilization 0.3
INFO 05-07 17:14:34 [__init__.py:248] Automatically detected platform cuda.
INFO 05-07 17:14:39 [api_server.py:1042] vLLM API server version 0.8.5.dev495+ge89b3f03a
INFO 05-07 17:14:39 [api_server.py:1043] args: Namespace(subparser='serve', model_tag='meta-llama/Llama-3.2-1B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-1B-Instruct', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=True, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, enable_prompt_embeds=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, ignore_patterns=None, use_tqdm_on_load=True, qlora_adapter_name_or_path=None, pt_load_map_location='cpu', guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, enable_reasoning=None, reasoning_parser='', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, worker_cls='auto', worker_extension_cls='', block_size=None, gpu_memory_utilization=0.3, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, cuda_graph_sizes=[512], long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, kv_events_config=None, additional_config=None, use_v2_block_manager=True, disable_log_stats=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fd7d7374f40>)

After:

$ vllm serve meta-llama/Llama-3.2-1B-Instruct --enforce-eager --gpu-memory-utilization 0.3
INFO 05-07 17:03:11 [__init__.py:248] Automatically detected platform cuda.
INFO 05-07 17:03:16 [api_server.py:1043] vLLM API server version 0.8.5.dev495+ge89b3f03a
INFO 05-07 17:03:16 [cli_args.py:295] non-default args: {'model': 'meta-llama/Llama-3.2-1B-Instruct', 'enforce_eager': True, 'gpu_memory_utilization': 0.3}

cc @mgoin

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented May 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label May 7, 2025
@mgoin
Copy link
Member

mgoin commented May 7, 2025

Can you add an example before and after for a command?

@hmellor
Copy link
Member Author

hmellor commented May 7, 2025

GitHub adding the scrollbar doesn't really do it justice, but I've added it to the PR description. Long line become short line.

hmellor added 2 commits May 7, 2025 17:23
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM this is exactly what I'd like to see, curious to get acceptance from others before landing

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2025
@vllm-bot vllm-bot merged commit 998eea4 into vllm-project:main May 8, 2025
68 of 70 checks passed
@hmellor hmellor deleted the only-log-non-default-args branch May 8, 2025 06:59
@chaunceyjiang
Copy link
Collaborator

Personally, I don't think this PR is very good. Non-default parameters are provided by the user, so the user is obviously aware of them and there's no need to print them. On the other hand, default parameters are what the user might not know, and printing them could actually be more helpful.

For example, parameters like swap_space=4, enable_prefix_caching=None, and prefix_caching_hash_algo='builtin' might give users a clearer understanding of whether the current vLLM configuration is reasonable.

@hmellor @mgoin @DarkLight1337

@chaunceyjiang
Copy link
Collaborator

image
For example, in the Kubernetes system, the core component kube-controller-manager will print all default parameters when the log level is set to --v=4.

This is very important in production systems, as it quickly shows the values of default parameters, which helps in analyzing whether those values are reasonable.

@hmellor
Copy link
Member Author

hmellor commented May 8, 2025

Thank you for raising your concern. However, I still think this is a positive change:

  • If users are examining logs, they may not know which arguments were used to create those logs, so isolating the ones they passed in the logs will be helpful to them
  • The full configuration of the engine is logged a few lines later (which includes all three of the example arguments in your first comment)
  • Given that we log the full engine arguments as well, logging both creates noise in the logs, making them harder for people to parse
INFO 05-08 10:47:08 [__init__.py:248] Automatically detected platform cuda.
INFO 05-08 10:47:13 [api_server.py:1044] vLLM API server version 0.8.5.dev495+ge89b3f03a
INFO 05-08 10:47:14 [cli_args.py:297] non-default args: {'model': 'meta-llama/Llama-3.2-1B-Instruct', 'gpu_memory_utilization': 0.4}
INFO 05-08 10:47:23 [config.py:752] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 05-08 10:47:23 [config.py:2057] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-08 10:47:28 [__init__.py:248] Automatically detected platform cuda.
INFO 05-08 10:47:32 [core.py:61] Initializing a V1 LLM engine (v0.8.5.dev495+ge89b3f03a) with config: model='meta-llama/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=meta-llama/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}

If you still believe that the full argparse namespace should be logged, we could add it back as a debug log?

@chaunceyjiang
Copy link
Collaborator

@hmellor Thank you for the clarification. It's great to be able to see all the parameters.

@chaunceyjiang
Copy link
Collaborator

If you still believe that the full argparse namespace should be logged, we could add it back as a debug log?

I have no further comments. Being able to see all the parameters is sufficient.

@mgoin
Copy link
Member

mgoin commented May 8, 2025

LGTM this is exactly what I'd like to see, curious to get acceptance from others before landing

I didn't want to land this so hastily because it is a log that has been present for a long time. Please respect asks for getting sign off from other committers.
I don't think the PR was even mentioned in the slack so this change might come as a surprise to many.

@DarkLight1337
Copy link
Member

Oh sorry I forgot to comment, this LGTM as well which is why I merged it.

princepride pushed a commit to princepride/vllm that referenced this pull request May 10, 2025
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants