Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[executor] init local_rank as device index #13027

Merged
merged 1 commit into from
Feb 11, 2025

Conversation

MengqingCao
Copy link
Contributor

@MengqingCao MengqingCao commented Feb 10, 2025

What does this pr do

This pr initailizing local_rank by the rank specified in argument device if it exists.
FIX #12967

Bug description

While specifying device to cards except card-0, as following code, there'll cause an device conflict because the tensors (such as attn_bias) will be put on card-0 by default.

from vllm import LLM
llm = LLM("facebook/opt-125m", device="cuda:1")

before this pr

INFO 02-10 17:10:10 __init__.py:190] Automatically detected platform cuda.
INFO 02-10 17:10:17 config.py:542] This model supports multiple tasks: {'embed', 'score', 'classify', 'generate', 'reward'}. Defaulting to 'generate'.
WARNING 02-10 17:10:17 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-10 17:10:17 config.py:678] Async output processing is not supported on the current platform type cuda.
INFO 02-10 17:10:17 llm_engine.py:234] Initializing a V0 LLM engine (v0.6.4.post2.dev395+g02222a02.d20241217) with config: model='/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6', speculative_config=None, tokenizer='/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
INFO 02-10 17:10:18 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-10 17:10:18 cuda.py:227] Using XFormers backend.
INFO 02-10 17:10:19 model_runner.py:1109] Starting to load model /home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6...
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.71it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.71it/s]

INFO 02-10 17:10:20 model_runner.py:1114] Loading model weights took 0.0000 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/cmq/code/vllm/pipeline.py", line 48, in <module>
[rank0]:     llm = LLM("/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6",
[rank0]:   File "/home/cmq/code/vllm/vllm/utils.py", line 1051, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/entrypoints/llm.py", line 242, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:   File "/home/cmq/code/vllm/vllm/engine/llm_engine.py", line 484, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/cmq/code/vllm/vllm/engine/llm_engine.py", line 276, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/cmq/code/vllm/vllm/engine/llm_engine.py", line 416, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/home/cmq/code/vllm/vllm/executor/executor_base.py", line 101, in determine_num_available_blocks
[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
[rank0]:   File "/home/cmq/code/vllm/vllm/executor/uniproc_executor.py", line 55, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/utils.py", line 2220, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]:   File "/home/cmq/code/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/model_executor/models/opt.py", line 370, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/home/cmq/code/vllm/vllm/compilation/decorators.py", line 172, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/model_executor/models/opt.py", line 325, in forward
[rank0]:     return self.decoder(input_ids,
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/model_executor/models/opt.py", line 282, in forward
[rank0]:     hidden_states = layer(hidden_states,
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/model_executor/models/opt.py", line 175, in forward
[rank0]:     hidden_states = self.self_attn(hidden_states=hidden_states,
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/model_executor/models/opt.py", line 115, in forward
[rank0]:     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/cmq/code/vllm/vllm/attention/layer.py", line 201, in forward
[rank0]:     return torch.ops.vllm.unified_attention(
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:   File "/home/cmq/code/vllm/vllm/attention/layer.py", line 307, in unified_attention
[rank0]:     return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
[rank0]:   File "/home/cmq/code/vllm/vllm/attention/backends/xformers.py", line 558, in forward
[rank0]:     out = self._run_memory_efficient_xformers_forward(
[rank0]:   File "/home/cmq/code/vllm/vllm/attention/backends/xformers.py", line 730, in _run_memory_efficient_xformers_forward
[rank0]:     out = xops.memory_efficient_attention_forward(
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 376, in memory_efficient_attention_forward
[rank0]:     return _memory_efficient_attention_forward(
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 483, in _memory_efficient_attention_forward
[rank0]:     inp.validate_inputs()
[rank0]:   File "/home/cmq/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/common.py", line 145, in validate_inputs
[rank0]:     raise ValueError(
[rank0]: ValueError: Attention bias and Query/Key/Value should be on the same device
[rank0]:   query.device: cuda:1
[rank0]:   attn_bias   : cuda:0

[rank0]:[W210 17:10:21.389456654 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

after this pr

INFO 02-10 17:12:24 __init__.py:190] Automatically detected platform cuda.
INFO 02-10 17:12:31 config.py:542] This model supports multiple tasks: {'score', 'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 02-10 17:12:31 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-10 17:12:31 config.py:678] Async output processing is not supported on the current platform type cuda.
INFO 02-10 17:12:31 llm_engine.py:234] Initializing a V0 LLM engine (v0.6.4.post2.dev395+g02222a02.d20241217) with config: model='/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6', speculative_config=None, tokenizer='/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
INFO 02-10 17:12:32 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-10 17:12:32 cuda.py:227] Using XFormers backend.
INFO 02-10 17:12:32 model_runner.py:1109] Starting to load model /home/cmq/.cache/huggingface/hub/models--facebook--opt-125m/snapshots/27dcfa74d334bc871f3234de431e71c6eeba5dd6...
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.86it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.86it/s]

INFO 02-10 17:12:33 model_runner.py:1114] Loading model weights took 0.2389 GB
INFO 02-10 17:12:34 worker.py:267] Memory profiling takes 0.63 seconds
INFO 02-10 17:12:34 worker.py:267] the current vLLM instance can use total_gpu_memory (14.57GiB) x gpu_memory_utilization (0.90) = 13.11GiB
INFO 02-10 17:12:34 worker.py:267] model weights take 0.24GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 12.37GiB.
INFO 02-10 17:12:34 executor_base.py:110] # CUDA blocks: 22526, # CPU blocks: 7281
INFO 02-10 17:12:34 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 175.98x
INFO 02-10 17:12:38 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 4.57 seconds
Processed prompts: 100%|████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.13it/s, est. speed input: 7.18 toks/s, output: 290.34 toks/s]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx fork
Prompt: 'The president of the United States is', Generated text: " not a racist. He is a racist.\nHe's a racist because he's a racist.                                                                                                                                                                                                                                            "
Prompt: 'Hello, my name is', Generated text: ' J.C. and I am a student at the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California, Berkeley, and a graduate of the University of California, Berkeley. I am a graduate of the University of California'
Prompt: 'The future of AI is', Generated text: ' in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of'
Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the French Republic.\n\nThe capital of France is the capital of the'
Prompt: 'The future of AI is', Generated text: ' in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of the people.\n\nThe future of AI is in the hands of'
Prompt: 'Hello, I come from', Generated text: ' a family of 4 and I am a single mom. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom of 3. I am a single mom'
[rank0]:[W210 17:12:45.595748139 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Signed-off-by: Mengqing Cao <cmq0113@163.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 11, 2025
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

(cc @youkaichao in case you see any gotchas)

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the device is not really intended to be a way to specify the device index, but i have no objections to it.

@youkaichao youkaichao merged commit 9cf4759 into vllm-project:main Feb 11, 2025
49 checks passed
SzymonOzog pushed a commit to SzymonOzog/vllm that referenced this pull request Feb 12, 2025
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
kwang1012 pushed a commit to kwang1012/vllm that referenced this pull request Feb 12, 2025
Signed-off-by: Mengqing Cao <cmq0113@163.com>
panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025
Signed-off-by: Mengqing Cao <cmq0113@163.com>
kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Triton error when initializing LLM(...) when enable_lora=True and cuda device != cuda:0
4 participants