-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042
Comments
|
I have the same issue, do you manage to solve it?
|
Here is the log i get when i run llama model
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): /mnt/harddisk/miniconda3/envs/vllm_w4a8/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown |
How to reproduceJust execute code below. from time import time
from vllm import LLM, SamplingParams
import contextlib
import gc
import ray
import torch
from vllm.distributed import destroy_model_parallel
def cleanup():
destroy_model_parallel()
with contextlib.suppress(AssertionError):
torch.distributed.destroy_process_group()
gc.collect()
torch.cuda.empty_cache()
ray.shutdown()
# Common prefix.
prefix = (
"You are an expert school principal, skilled in effectively managing "
"faculty and staff. Draft 10-15 questions for a potential first grade "
"Head Teacher for my K-12, all-girls', independent school that emphasizes "
"community, joyful discovery, and life-long learning. The candidate is "
"coming in for a first-round panel interview for a 8th grade Math "
"teaching role. They have 5 years of previous teaching experience "
"as an assistant teacher at a co-ed, public school with experience "
"in middle school math teaching. Based on these information, fulfill "
"the following paragraph: ")
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
generating_prompts = [prefix + prompt for prompt in prompts]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0)
# Create an LLM.
regular_llm = LLM(model="facebook/opt-125m",
tensor_parallel_size=2,
gpu_memory_utilization=0.4)
print("Results without `enable_prefix_caching`")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
start_time_regular = time()
outputs = regular_llm.generate(generating_prompts, sampling_params)
duration_regular = time() - start_time_regular
regular_generated_texts = []
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
regular_generated_texts.append(generated_text)
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print("-" * 80)
del regular_llm
cleanup()
prefix_cached_llm = LLM(model="facebook/opt-125m",
tensor_parallel_size=2,
enable_prefix_caching=True,
gpu_memory_utilization=0.4,
disable_custom_all_reduce=True)
# Warmup so that the shared prompt's KV cache is computed.
prefix_cached_llm.generate(generating_prompts[0], sampling_params)
# Generate with prefix caching.
start_time_cached = time()
outputs = prefix_cached_llm.generate(generating_prompts, sampling_params)
duration_cached = time() - start_time_cached
print("Results with `enable_prefix_caching`")
cached_generated_texts = []
# Print the outputs. You should see the same outputs as before.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
cached_generated_texts.append(generated_text)
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print("-" * 80)
# Compare the results and display the speedup
generated_same = all([
regular_generated_texts[i] == cached_generated_texts[i]
for i in range(len(prompts))
])
print(f"Generated answers are the same: {generated_same}")
speedup = round(duration_regular / duration_cached, 2)
print(f"Speed up of cached generation compared to the regular is: {speedup}") Error log
|
+1 This happens to me also some times when running Llama3.170b on tp=2 Is there a good way to recover from this? |
@youkaichao did you find the source / a way to work around this? |
not yet :( |
This comment was marked as abuse.
This comment was marked as abuse.
+1 This happens to me also many times when running miniCPM 2.6 on vllm=0.5.5 |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Your current environment
🐛 Describe the bug
We received quite a lot report about "Watchdog caught collective operation timeout", which is flaky and difficult to reproduce. It happens after running for some time.
To analyze the error, we need to collect enough stack trace. If you encounter a similar problem, please paste enough stack trace for us to debug.
Example: https://buildkite.com/vllm/ci-aws/builds/3548#01906e81-54c6-4713-beb7-d08f3c873200 caught one such error.
Please include the first line of error, together with the Python stack trace.
In the following example, it seems one process has illegal memory access. It dies, but the rest process is still in allreduce, and is waiting for it, causing the timeout problem. From the python level stack trace, it happens when we profile the run, and it seems to be related with moe layer.
The text was updated successfully, but these errors were encountered: