-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: 段错误 (核心已转储) #8321
Comments
vllm 0.6.0 |
Does this segmentation fault occur when disabling tensor parallel? |
cc @Isotr0py since it may be related to GGUF loading |
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 13.75 MiB is free. Process 2509972 has 23.63 GiB memory in use. Of the allocated memory 23.05 GiB is allocated by PyTorch, and 145.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
Looks like the model is too big to load inside 1 GPU. Is there a smaller version that is easier to test with? |
Mistral-Large-Instruct-2407-IQ1_M.gguf 1bit Already the smallest |
Seems that the model has been loaded to GPU successfully:
Perhaps related to problematic model forwarding due to gguf config extraction instead. (Maybe caused by calling kernel like |
Oh, it's because the gguf kernel we ported is out of date which didn't include |
Your current environment
import os
os.environ["CUDA_VISIBLE_DEVICES"]="4,5,6,7" # 4090*4
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import time
import uvicorn
from fastapi import FastAPI,Body
from pydantic import BaseModel
import asyncio
apps = FastAPI()
path = "/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf"
sampling_params = SamplingParams(temperature=1.0,repetition_penalty=1.0,max_tokens=512)
Create an LLM.
llm = LLM(model=path,tokenizer="/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/",trust_remote_code=True,gpu_memory_utilization=0.8,tensor_parallel_size=4,enforce_eager=True,disable_custom_all_reduce=True)
🐛 Describe the bug
WARNING 09-10 15:07:35 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-10 15:07:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-10 15:07:35 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fed8f24fee0>, local_subscribe_port=38065, remote_subscribe_port=None)
INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf...
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf...
INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB
/root/miniconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
段错误 (核心已转储)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: