Skip to content

Conversation

@DragonFive
Copy link

@DragonFive DragonFive commented Feb 17, 2025

FIX #13370 (link existing issues this PR will resolve)

in vllm/config.py , it will forcing chunked prefill and prefix caching to be disabled, but it's too late, the max_num_batched_tokens will be set 2048 by default when user use --enable-chunked-prefill for mla attention model

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Chunked prefill support for MLA #12639 will likely land today, so I don't think we need to do this. Otherwise I think this would be a good move. (Not closing this PR though, in case some last minute issue pops up with #12639)

@mgoin
Copy link
Member

mgoin commented Feb 25, 2025

Resolved by chunked prefill support

@mgoin mgoin closed this Feb 25, 2025
@dshwei
Copy link

dshwei commented Mar 13, 2025

vllm 0.7.1
torch 2.5.1
when use this version vllm , and set VLLM_TORCH_PROFILER_DIR=./traces/
command as following :
VLLM_TORCH_PROFILER_DIR=./traces/ vllm serve /workspace/models/DeepSeek-V2-Lite-Chat
--gpu-memory-utilization 0.80
--max-model-len 8000
--max-num-batched-tokens 32000
--max-num-seqs 1024
--trust-remote-code \

deepseek-v2_triton$(date +%Y%m%d-%H%M).log &

MLA is enabled; forcing chunked prefill and prefix caching to be disabled.

INFO 03-13 03:06:05 init.py:183] Automatically detected platform cuda.
WARNING 03-13 03:06:06 api_server.py:610] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
async_args_only: False
parser: FlexibleArgumentParser(prog='vllm serve', usage='vllm serve <model_tag> [options]', description=None, formatter_class=<class 'vllm.utils.SortedHelpFormatter'>, conflict_handler='error', add_help=True)
INFO 03-13 03:06:06 api_server.py:838] vLLM API server version 0.7.1
INFO 03-13 03:06:06 api_server.py:839] args: Namespace(subparser='serve', model_tag='/workspace/models/DeepSeek-V2-Lite-Chat', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/DeepSeek-V2-Lite-Chat', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=32000, max_num_seqs=1024, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f968b7083a0>)
INFO 03-13 03:06:06 api_server.py:204] Started engine process with PID 372585
INFO 03-13 03:06:06 config.py:135] Replacing legacy 'type' key with 'rope_type'
INFO 03-13 03:06:11 init.py:183] Automatically detected platform cuda.
WARNING 03-13 03:06:12 api_server.py:610] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
INFO 03-13 03:06:12 config.py:135] Replacing legacy 'type' key with 'rope_type'
INFO 03-13 03:06:13 config.py:526] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 03-13 03:06:13 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-13 03:06:18 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
INFO 03-13 03:06:18 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-13 03:06:18 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/workspace/models/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/workspace/models/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/workspace/models/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[1024,1016,1008,1000,992,984,976,968,960,952,944,936,928,920,912,904,896,888,880,872,864,856,848,840,832,824,816,808,800,792,784,776,768,760,752,744,736,728,720,712,704,696,688,680,672,664,656,648,640,632,624,616,608,600,592,584,576,568,560,552,544,536,528,520,512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":1024}, use_cached_outputs=True,
INFO 03-13 03:06:19 cuda.py:166] Using Triton MLA backend.
WARNING 03-13 03:06:20 triton_decode_attention.py:42] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-13 03:06:20 worker.py:101] Profiling enabled. Traces will be saved to: ./traces/
INFO 03-13 03:06:20 model_runner.py:1111] Starting to load model /workspace/models/DeepSeek-V2-Lite-Chat...
INFO 03-13 03:06:20 cuda.py:166] Using Triton MLA backend.
INFO 03-13 03:06:35 model_runner.py:1116] Loading model weights took 31.1253 GB
WARNING 03-13 03:06:35 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /workspace/miniconda/envs/llama/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json
INFO 03-13 03:06:36 worker.py:266] Memory profiling takes 1.51 seconds
INFO 03-13 03:06:36 worker.py:266] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.80) = 63.46GiB
INFO 03-13 03:06:36 worker.py:266] model weights take 31.13GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 3.83GiB; the rest of the memory reserved for KV Cache is 28.34GiB.
INFO 03-13 03:06:36 executor_base.py:108] # CUDA blocks: 61150, # CPU blocks: 8630

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Usage]:Input prompt (2501 tokens) is too long and exceeds limit of 2048

4 participants