Skip to content

Conversation

@princepride
Copy link
Contributor

@princepride princepride commented Jul 3, 2025

Add ignore consolidated.safetensors file because the weights file never use

Purpose

As #20025 said, now the example code will download the whole weights files (about 100 GB) but the consolidated.safetensors weights file is never used.
image

Test Plan

python examples/offline_inference/vision_language.py -m mistral3

Test Result

root@5b4b5bc32e0b:/vllm# python examples/offline_inference/vision_language.py -m mistral3
INFO 07-03 07:12:13 [__init__.py:244] Automatically detected platform cuda.
INFO 07-03 07:12:22 [config.py:843] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 07-03 07:12:22 [config.py:1464] Using max model len 8192
INFO 07-03 07:12:22 [config.py:2277] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-03 07:12:22 [config.py:1750] Ignoring the following patterns when downloading weights: ['consolidated.safetensors']
/vllm/vllm/transformers_utils/tokenizer_group.py:24: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ensure correct encoding and decoding.
  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
INFO 07-03 07:12:24 [core.py:526] Waiting for init message from front-end.
INFO 07-03 07:12:24 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev391+g9965c47d0) with config: model='mistralai/Mistral-Small-3.1-24B-Instruct-2503', speculative_config=None, tokenizer='mistralai/Mistral-Small-3.1-24B-Instruct-2503', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistralai/Mistral-Small-3.1-24B-Instruct-2503, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-03 07:12:24 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-03 07:12:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_a2e883ee'), local_subscribe_addr='ipc:///tmp/704c4374-2070-47d0-8cc5-3b8a0bf1b535', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 07-03 07:12:25 [__init__.py:2782] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7201bd271330>
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3ec674be'), local_subscribe_addr='ipc:///tmp/c5fd0039-4b18-4dbc-85d6-17477bd2e5e4', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 07-03 07:12:25 [__init__.py:2782] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7201bd2713f0>
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_36e64688'), local_subscribe_addr='ipc:///tmp/ec7cb39d-b95c-4050-aa03-3f09caa2cca9', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:25 [__init__.py:1132] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:25 [__init__.py:1132] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:26 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:26 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:26 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_6d9abb5e'), local_subscribe_addr='ipc:///tmp/5fe5dcf6-6324-46ed-ba31-0d0d5652a10c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:26 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:26 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=51615) /vllm/vllm/transformers_utils/tokenizer.py:277: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ensure correct encoding and decoding.
(VllmWorker rank=1 pid=51615)   return cached_get_tokenizer(
(VllmWorker rank=0 pid=51614) /vllm/vllm/transformers_utils/tokenizer.py:277: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ensure correct encoding and decoding.
(VllmWorker rank=0 pid=51614)   return cached_get_tokenizer(
(VllmWorker rank=1 pid=51615) image_token_id:  10
(VllmWorker rank=1 pid=51615) WARNING 07-03 07:12:28 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:28 [gpu_model_runner.py:1766] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:28 [gpu_model_runner.py:1771] Loading model from scratch...
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:29 [cuda.py:270] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=51614) image_token_id:  10
(VllmWorker rank=0 pid=51614) WARNING 07-03 07:12:29 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:29 [gpu_model_runner.py:1766] Starting to load model mistralai/Mistral-Small-3.1-24B-Instruct-2503...
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:29 [weight_utils.py:292] Using model weights format ['*.safetensors']
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:29 [gpu_model_runner.py:1771] Loading model from scratch...
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:29 [cuda.py:270] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:29 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  10% Completed | 1/10 [00:00<00:07,  1.24it/s]
Loading safetensors checkpoint shards:  20% Completed | 2/10 [00:01<00:07,  1.13it/s]
Loading safetensors checkpoint shards:  30% Completed | 3/10 [00:02<00:06,  1.10it/s]
Loading safetensors checkpoint shards:  40% Completed | 4/10 [00:03<00:05,  1.15it/s]
Loading safetensors checkpoint shards:  50% Completed | 5/10 [00:04<00:04,  1.20it/s]
Loading safetensors checkpoint shards:  60% Completed | 6/10 [00:05<00:03,  1.24it/s]
Loading safetensors checkpoint shards:  70% Completed | 7/10 [00:05<00:02,  1.31it/s]
Loading safetensors checkpoint shards:  80% Completed | 8/10 [00:06<00:01,  1.30it/s]
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:36 [default_loader.py:272] Loading weights took 7.66 seconds
Loading safetensors checkpoint shards:  90% Completed | 9/10 [00:07<00:00,  1.27it/s]
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:37 [gpu_model_runner.py:1797] Model loading took 22.4151 GiB and 7.985083 seconds
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:07<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:07<00:00,  1.27it/s]
(VllmWorker rank=0 pid=51614) 
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:37 [default_loader.py:272] Loading weights took 7.92 seconds
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:37 [gpu_model_runner.py:1797] Model loading took 22.4151 GiB and 8.259663 seconds
(VllmWorker rank=0 pid=51614) image_token_id:  10
(VllmWorker rank=1 pid=51615) image_token_id:  10
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:38 [gpu_model_runner.py:2234] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 image items of the maximum feature size.
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:38 [gpu_model_runner.py:2234] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 2 image items of the maximum feature size.
(VllmWorker rank=0 pid=51614) image_token_id:  10
(VllmWorker rank=1 pid=51615) image_token_id:  10
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:46 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/f06e7371c2/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:46 [backends.py:519] Dynamo bytecode transform time: 7.82 s
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:46 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/f06e7371c2/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:46 [backends.py:519] Dynamo bytecode transform time: 7.82 s
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:53 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 6.123 s
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:53 [backends.py:155] Directly load the compiled graph(s) for shape None from the cache, took 6.141 s
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:54 [monitor.py:34] torch.compile takes 7.82 s in total
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:54 [monitor.py:34] torch.compile takes 7.82 s in total
(VllmWorker rank=1 pid=51615) INFO 07-03 07:12:56 [gpu_worker.py:232] Available KV cache memory: 16.20 GiB
(VllmWorker rank=0 pid=51614) INFO 07-03 07:12:56 [gpu_worker.py:232] Available KV cache memory: 16.20 GiB
INFO 07-03 07:12:57 [kv_cache_utils.py:716] GPU KV cache size: 212,352 tokens
INFO 07-03 07:12:57 [kv_cache_utils.py:720] Maximum concurrency for 8,192 tokens per request: 25.92x
INFO 07-03 07:12:57 [kv_cache_utils.py:716] GPU KV cache size: 212,352 tokens
INFO 07-03 07:12:57 [kv_cache_utils.py:720] Maximum concurrency for 8,192 tokens per request: 25.92x
(VllmWorker rank=1 pid=51615) WARNING 07-03 07:12:57 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
(VllmWorker rank=0 pid=51614) WARNING 07-03 07:12:57 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████| 67/67 [00:22<00:00,  2.93it/s]
(VllmWorker rank=0 pid=51614) INFO 07-03 07:13:20 [custom_all_reduce.py:196] Registering 5360 cuda graph addresses
(VllmWorker rank=1 pid=51615) INFO 07-03 07:13:20 [custom_all_reduce.py:196] Registering 5360 cuda graph addresses
(VllmWorker rank=1 pid=51615) INFO 07-03 07:13:20 [gpu_model_runner.py:2322] Graph capturing finished in 23 secs, took 0.67 GiB
(VllmWorker rank=0 pid=51614) INFO 07-03 07:13:20 [gpu_model_runner.py:2322] Graph capturing finished in 23 secs, took 0.67 GiB
INFO 07-03 07:13:20 [core.py:172] init engine (profile, create kv cache, warmup model) took 42.11 seconds
/vllm/vllm/transformers_utils/tokenizer.py:277: FutureWarning: It is strongly recommended to run mistral models with `--tokenizer-mode "mistral"` to ensure correct encoding and decoding.
  return cached_get_tokenizer(
image_token_id:  10
Adding requests:   0%|                                                                              | 0/4 [00:00<?, ?it/s]image_token_id:  10
Adding requests:  25%|█████████████████▌                                                    | 1/4 [00:02<00:06,  2.18s/it]image_token_id:  10
image_token_id:  10
image_token_id:  10
Adding requests: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.80it/s]
Processed prompts: 100%|████████████| 4/4 [00:07<00:00,  1.89s/it, est. speed input: 1102.32 toks/s, output: 33.85 toks/s]
--------------------------------------------------
The image depicts a scenic view of cherry blossoms in full bloom with a clear blue sky in the background. In the distance, there is a tall, white, lattice-structured tower that resembles the Tokyo Tower in Japan. The cherry blossoms are in the foreground, partially framing the tower, creating a beautiful contrast between
--------------------------------------------------
The image depicts a scenic view of a cherry blossom tree in full bloom. The delicate pink flowers are in the foreground, creating a beautiful frame around the background. Through the branches and blossoms, a tall, white, lattice-structured tower is visible. This tower is the Tokyo Skytree, a broadcasting and
--------------------------------------------------
The image depicts a scenic view of a tall tower partially obscured by blooming cherry blossom trees in the foreground. The cherry blossoms are in full bloom, with delicate pink flowers covering the branches, creating a beautiful contrast against the clear blue sky. The tower in the background appears to be a communications or observation tower
--------------------------------------------------
The image depicts a beautiful scene of cherry blossoms in full bloom, with a clear blue sky in the background. The cherry blossoms are in the foreground, with their delicate pink and white flowers framing the image. Through the branches and blossoms, a tall, white, lattice-structured tower is visible. This tower is
--------------------------------------------------

…er use

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @princepride, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an efficiency issue in the vLLM example code for Mistral-3 models by ensuring that a large, unused model weights file is no longer downloaded. By specifying an ignore pattern during model initialization, the examples now operate with a reduced footprint, improving the initial setup experience for users.

Highlights

  • Resource Optimization: I've added a configuration to prevent the download of the consolidated.safetensors file for Mistral-3 models in the example code. This file, approximately 100 GB in size, was being downloaded unnecessarily as it is not used by the example, leading to significant savings in download time and disk space.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Jul 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a significant efficiency issue by preventing the unnecessary download of the large consolidated.safetensors file for the Mistral model in the example scripts. The addition of ignore_patterns=["consolidated.safetensors"] to the EngineArgs constructor in both vision_language.py and vision_language_multi_image.py is a direct and correct solution. The changes are clear, concise, and improve the user experience by reducing download size and time without introducing new complexities or correctness concerns. The test results confirm the successful implementation of this ignore pattern.

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

cc @patrickvonplaten do you have any problems with this change?

@DarkLight1337 DarkLight1337 requested a review from mgoin July 3, 2025 14:13
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true and fair enough, thanks!

@mgoin mgoin enabled auto-merge (squash) July 3, 2025 21:52
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 3, 2025
@mgoin mgoin merged commit 25950dc into vllm-project:main Jul 4, 2025
62 checks passed
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
)

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants