add tqdm when loading checkpoint shards #6569

zhaotyer · 2024-07-19T09:40:03Z

When we load a model with relatively large parameters, it takes a long time due to IO limitations, but we cannot see the model loading progress.In addition, sometimes some weight files of the model are missing, but vllm can still load successfully, but the answers will be garbled, so I added tqdm to display the current weight loading progress.

github-actions · 2024-07-19T09:40:16Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

zhaotyer · 2024-07-19T09:41:37Z

@youkaichao @WoosukKwon Can you help see if can add this function?

youkaichao · 2024-07-19T16:54:57Z

I think this is a good idea. Two things to note:

you can add tqdm in

vllm/vllm/model_executor/model_loader/weight_utils.py

Line 358 in a5314e8

for st_file in hf_weights_files:

directly
you need to take care of distributed inference case, where multiple processes load weight together. The tqdm output can be messed up.

zhaotyer · 2024-07-20T15:12:00Z

I think this is a good idea. Two things to note:

you can add tqdm in

vllm/vllm/model_executor/model_loader/weight_utils.py

Line 358 in a5314e8

for st_file in hf_weights_files:

directly

you need to take care of distributed inference case, where multiple processes load weight together. The tqdm output can be messed up.

thank you for your reply
1.Adding tqdm to loader.py can take effect on various format weights
2.I test with --tensor-parallel-size=2,The log is as follows

INFO 07-20 14:57:28 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/models/atom/1/local_model/base_model/', speculative_config=None, tokenizer='/models/atom/1/local_model/base_model/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/models/atom/1/local_model/base_model/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=72804) INFO 07-20 14:57:37 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 07-20 14:57:37 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=72804) INFO 07-20 14:57:37 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-20 14:57:37 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=72804) INFO 07-20 14:57:37 pynccl.py:63] vLLM is using nccl==2.20.5
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main
    cache[rtype].remove(name)
KeyError: '/psm_70d67c33'
INFO 07-20 14:57:38 custom_all_reduce_utils.py:170] generating GPU P2P access cache in /root/.config/vllm/gpu_p2p_access_cache_for_4,6.json
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:32 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_4,6.json
INFO 07-20 14:58:32 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_4,6.json
Loading checkpoint shards:   0%|                                                                                                                                                  | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.50s/it]
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:38 model_runner.py:160] Loading model weights took 7.1438 GB
INFO 07-20 14:58:38 model_runner.py:160] Loading model weights took 7.1438 GB
INFO 07-20 14:58:46 distributed_gpu_executor.py:56] # GPU blocks: 46622, # CPU blocks: 9362
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:49 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:49 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-20 14:58:50 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-20 14:58:50 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=72804) INFO 07-20 14:59:07 custom_all_reduce.py:267] Registering 1995 cuda graph addresses
INFO 07-20 14:59:07 custom_all_reduce.py:267] Registering 1995 cuda graph addresses
INFO 07-20 14:59:07 model_runner.py:965] Graph capturing finished in 18 secs.
(VllmWorkerProcess pid=72804) INFO 07-20 14:59:07 model_runner.py:965] Graph capturing finished in 18 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

youkaichao · 2024-07-22T16:51:37Z

code in this pr is too hacky to maintain in the future. I suggest just add tqdm to each loading format.

zhaotyer · 2024-07-23T02:55:34Z

code in this pr is too hacky to maintain in the future. I suggest just add tqdm to each loading format.

The changes were made as you suggested

youkaichao · 2024-07-23T03:14:14Z

vllm/model_executor/model_loader/weight_utils.py

+    hf_weights_files = tqdm(hf_weights_files,desc="Loading safetensors checkpoint shards")
    for st_file in hf_weights_files:


the common practice would be:

for st_file in tqdm(hf_weights_files, desc="Loading safetensors checkpoint shards"):

youkaichao · 2024-07-23T03:14:53Z

please check https://github.com/vllm-project/vllm/blob/main/CONTRIBUTING.md for how to format the code locally.

youkaichao · 2024-07-23T03:42:41Z

@zhaotyer I pushed to this branch so that it can be merged quickly. thanks for your initial contribution!

youkaichao · 2024-07-23T03:43:41Z

this should help people understand the weight loading time, avoid users thinking vLLM hangs, just like #6636 .

youkaichao · 2024-07-23T03:47:50Z

locally tested, it works.

zhaotyer · 2024-07-23T03:59:40Z

please check https://github.com/vllm-project/vllm/blob/main/CONTRIBUTING.md for how to format the code locally.

ok.thx

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

add tqdm when loading checkpoint shards

3cde6e6

zhaotyer and others added 2 commits July 23, 2024 10:26

Merge branch 'vllm-project:main' into add-tqdm-loading-checkpoint

496dc59

add tqdm to each loading format

92dae54

youkaichao reviewed Jul 23, 2024

View reviewed changes

youkaichao added 2 commits July 22, 2024 20:38

fix format

dd19e86

fix format

8dfe250

youkaichao approved these changes Jul 23, 2024

View reviewed changes

youkaichao merged commit e519ae0 into vllm-project:main Jul 23, 2024
17 of 19 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

add tqdm when loading checkpoint shards (vllm-project#6569)

15ddef2

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024

add tqdm when loading checkpoint shards (vllm-project#6569)

9453b1e

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

cduk pushed a commit to cduk/vllm-pascal that referenced this pull request Aug 6, 2024

add tqdm when loading checkpoint shards (vllm-project#6569)

8152bd7

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

add tqdm when loading checkpoint shards (vllm-project#6569)

109f69d

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

add tqdm when loading checkpoint shards (vllm-project#6569)

d41c0be

Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tqdm when loading checkpoint shards #6569

add tqdm when loading checkpoint shards #6569

zhaotyer commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

zhaotyer commented Jul 19, 2024

youkaichao commented Jul 19, 2024

zhaotyer commented Jul 20, 2024

youkaichao commented Jul 22, 2024

zhaotyer commented Jul 23, 2024

youkaichao Jul 23, 2024

youkaichao commented Jul 23, 2024

youkaichao commented Jul 23, 2024

youkaichao commented Jul 23, 2024

youkaichao commented Jul 23, 2024

zhaotyer commented Jul 23, 2024

		hf_weights_files = tqdm(hf_weights_files,desc="Loading safetensors checkpoint shards")
		for st_file in hf_weights_files:

add tqdm when loading checkpoint shards #6569

add tqdm when loading checkpoint shards #6569

Conversation

zhaotyer commented Jul 19, 2024

github-actions bot commented Jul 19, 2024

zhaotyer commented Jul 19, 2024

youkaichao commented Jul 19, 2024

zhaotyer commented Jul 20, 2024

youkaichao commented Jul 22, 2024

zhaotyer commented Jul 23, 2024

youkaichao Jul 23, 2024

Choose a reason for hiding this comment

youkaichao commented Jul 23, 2024

youkaichao commented Jul 23, 2024

youkaichao commented Jul 23, 2024

youkaichao commented Jul 23, 2024

zhaotyer commented Jul 23, 2024