set chunked_prefill off when use mla #13374

DragonFive · 2025-02-17T05:08:01Z

FIX #13370 (link existing issues this PR will resolve)

in vllm/config.py , it will forcing chunked prefill and prefix caching to be disabled, but it's too late, the max_num_batched_tokens will be set 2048 by default when user use --enable-chunked-prefill for mla attention model

github-actions · 2025-02-17T05:08:17Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tlrmchlsmth

Thanks for the contribution!

Chunked prefill support for MLA #12639 will likely land today, so I don't think we need to do this. Otherwise I think this would be a good move. (Not closing this PR though, in case some last minute issue pops up with #12639)

mgoin · 2025-02-25T18:56:25Z

Resolved by chunked prefill support

dshwei · 2025-03-13T03:15:38Z

vllm 0.7.1
torch 2.5.1
when use this version vllm , and set VLLM_TORCH_PROFILER_DIR=./traces/
command as following :
VLLM_TORCH_PROFILER_DIR=./traces/ vllm serve /workspace/models/DeepSeek-V2-Lite-Chat
--gpu-memory-utilization 0.80
--max-model-len 8000
--max-num-batched-tokens 32000
--max-num-seqs 1024
--trust-remote-code \

deepseek-v2_triton$(date +%Y%m%d-%H%M).log &

MLA is enabled; forcing chunked prefill and prefix caching to be disabled.

INFO 03-13 03:06:05 init.py:183] Automatically detected platform cuda.
WARNING 03-13 03:06:06 api_server.py:610] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
async_args_only: False
parser: FlexibleArgumentParser(prog='vllm serve', usage='vllm serve <model_tag> [options]', description=None, formatter_class=<class 'vllm.utils.SortedHelpFormatter'>, conflict_handler='error', add_help=True)
INFO 03-13 03:06:06 api_server.py:838] vLLM API server version 0.7.1
INFO 03-13 03:06:06 api_server.py:839] args: Namespace(subparser='serve', model_tag='/workspace/models/DeepSeek-V2-Lite-Chat', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/DeepSeek-V2-Lite-Chat', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=32000, max_num_seqs=1024, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f968b7083a0>)
INFO 03-13 03:06:06 api_server.py:204] Started engine process with PID 372585
INFO 03-13 03:06:06 config.py:135] Replacing legacy 'type' key with 'rope_type'
INFO 03-13 03:06:11 init.py:183] Automatically detected platform cuda.
WARNING 03-13 03:06:12 api_server.py:610] Torch Profiler is enabled in the API server. This should ONLY be used for local development!
INFO 03-13 03:06:12 config.py:135] Replacing legacy 'type' key with 'rope_type'
INFO 03-13 03:06:13 config.py:526] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 03-13 03:06:13 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-13 03:06:18 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
INFO 03-13 03:06:18 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 03-13 03:06:18 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/workspace/models/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/workspace/models/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/workspace/models/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[1024,1016,1008,1000,992,984,976,968,960,952,944,936,928,920,912,904,896,888,880,872,864,856,848,840,832,824,816,808,800,792,784,776,768,760,752,744,736,728,720,712,704,696,688,680,672,664,656,648,640,632,624,616,608,600,592,584,576,568,560,552,544,536,528,520,512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":1024}, use_cached_outputs=True,
INFO 03-13 03:06:19 cuda.py:166] Using Triton MLA backend.
WARNING 03-13 03:06:20 triton_decode_attention.py:42] The following error message 'operation scheduled before its operands' can be ignored.
INFO 03-13 03:06:20 worker.py:101] Profiling enabled. Traces will be saved to: ./traces/
INFO 03-13 03:06:20 model_runner.py:1111] Starting to load model /workspace/models/DeepSeek-V2-Lite-Chat...
INFO 03-13 03:06:20 cuda.py:166] Using Triton MLA backend.
INFO 03-13 03:06:35 model_runner.py:1116] Loading model weights took 31.1253 GB
WARNING 03-13 03:06:35 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /workspace/miniconda/envs/llama/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json
INFO 03-13 03:06:36 worker.py:266] Memory profiling takes 1.51 seconds
INFO 03-13 03:06:36 worker.py:266] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.80) = 63.46GiB
INFO 03-13 03:06:36 worker.py:266] model weights take 31.13GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 3.83GiB; the rest of the memory reserved for KV Cache is 28.34GiB.
INFO 03-13 03:06:36 executor_base.py:108] # CUDA blocks: 61150, # CPU blocks: 8630

set chunked_prefill off when use mla

ccc1569

tlrmchlsmth reviewed Feb 18, 2025

View reviewed changes

mgoin closed this Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

set chunked_prefill off when use mla #13374

set chunked_prefill off when use mla #13374

Uh oh!

DragonFive commented Feb 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 17, 2025

Uh oh!

tlrmchlsmth left a comment

Uh oh!

mgoin commented Feb 25, 2025

Uh oh!

dshwei commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

set chunked_prefill off when use mla #13374

set chunked_prefill off when use mla #13374

Uh oh!

Conversation

DragonFive commented Feb 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Feb 25, 2025

Uh oh!

dshwei commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DragonFive commented Feb 17, 2025 •

edited by github-actions bot

Loading