Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tqdm when loading checkpoint shards #6569

Merged
merged 5 commits into from
Jul 23, 2024

Conversation

zhaotyer
Copy link
Contributor

When we load a model with relatively large parameters, it takes a long time due to IO limitations, but we cannot see the model loading progress.In addition, sometimes some weight files of the model are missing, but vllm can still load successfully, but the answers will be garbled, so I added tqdm to display the current weight loading progress.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@zhaotyer
Copy link
Contributor Author

@youkaichao @WoosukKwon Can you help see if can add this function?

@youkaichao
Copy link
Member

I think this is a good idea. Two things to note:

@zhaotyer
Copy link
Contributor Author

I think this is a good idea. Two things to note:

thank you for your reply
1.Adding tqdm to loader.py can take effect on various format weights
2.I test with --tensor-parallel-size=2,The log is as follows

INFO 07-20 14:57:28 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/models/atom/1/local_model/base_model/', speculative_config=None, tokenizer='/models/atom/1/local_model/base_model/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/models/atom/1/local_model/base_model/)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=72804) INFO 07-20 14:57:37 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 07-20 14:57:37 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=72804) INFO 07-20 14:57:37 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-20 14:57:37 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=72804) INFO 07-20 14:57:37 pynccl.py:63] vLLM is using nccl==2.20.5
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/resource_tracker.py", line 201, in main
    cache[rtype].remove(name)
KeyError: '/psm_70d67c33'
INFO 07-20 14:57:38 custom_all_reduce_utils.py:170] generating GPU P2P access cache in /root/.config/vllm/gpu_p2p_access_cache_for_4,6.json
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:32 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_4,6.json
INFO 07-20 14:58:32 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_4,6.json
Loading checkpoint shards:   0%|                                                                                                                                                  | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.50s/it]
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:38 model_runner.py:160] Loading model weights took 7.1438 GB
INFO 07-20 14:58:38 model_runner.py:160] Loading model weights took 7.1438 GB
INFO 07-20 14:58:46 distributed_gpu_executor.py:56] # GPU blocks: 46622, # CPU blocks: 9362
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:49 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=72804) INFO 07-20 14:58:49 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-20 14:58:50 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-20 14:58:50 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=72804) INFO 07-20 14:59:07 custom_all_reduce.py:267] Registering 1995 cuda graph addresses
INFO 07-20 14:59:07 custom_all_reduce.py:267] Registering 1995 cuda graph addresses
INFO 07-20 14:59:07 model_runner.py:965] Graph capturing finished in 18 secs.
(VllmWorkerProcess pid=72804) INFO 07-20 14:59:07 model_runner.py:965] Graph capturing finished in 18 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

@youkaichao
Copy link
Member

code in this pr is too hacky to maintain in the future. I suggest just add tqdm to each loading format.

@zhaotyer
Copy link
Contributor Author

code in this pr is too hacky to maintain in the future. I suggest just add tqdm to each loading format.

The changes were made as you suggested

Comment on lines 359 to 360
hf_weights_files = tqdm(hf_weights_files,desc="Loading safetensors checkpoint shards")
for st_file in hf_weights_files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the common practice would be:

for st_file in tqdm(hf_weights_files, desc="Loading safetensors checkpoint shards"):

@youkaichao
Copy link
Member

please check https://github.com/vllm-project/vllm/blob/main/CONTRIBUTING.md for how to format the code locally.

@youkaichao
Copy link
Member

@zhaotyer I pushed to this branch so that it can be merged quickly. thanks for your initial contribution!

@youkaichao
Copy link
Member

this should help people understand the weight loading time, avoid users thinking vLLM hangs, just like #6636 .

@youkaichao
Copy link
Member

locally tested, it works.

@youkaichao youkaichao merged commit e519ae0 into vllm-project:main Jul 23, 2024
17 of 19 checks passed
@zhaotyer
Copy link
Contributor Author

please check https://github.com/vllm-project/vllm/blob/main/CONTRIBUTING.md for how to format the code locally.

ok.thx

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
cduk pushed a commit to cduk/vllm-pascal that referenced this pull request Aug 6, 2024
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants