[Bug]: 段错误 (核心已转储) #8321

LIUKAI0815 · 2024-09-10T07:10:44Z

Your current environment

import os
os.environ["CUDA_VISIBLE_DEVICES"]="4,5,6,7" # 4090*4

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import time
import uvicorn
from fastapi import FastAPI,Body
from pydantic import BaseModel
import asyncio

apps = FastAPI()

path = "/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf"
sampling_params = SamplingParams(temperature=1.0,repetition_penalty=1.0,max_tokens=512)

Create an LLM.

llm = LLM(model=path,tokenizer="/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/",trust_remote_code=True,gpu_memory_utilization=0.8,tensor_parallel_size=4,enforce_eager=True,disable_custom_all_reduce=True)

🐛 Describe the bug

WARNING 09-10 15:07:35 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-10 15:07:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-10 15:07:35 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fed8f24fee0>, local_subscribe_port=38065, remote_subscribe_port=None)
INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf...
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf...
INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB
/root/miniconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
段错误 (核心已转储)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

LIUKAI0815 · 2024-09-10T07:11:06Z

vllm 0.6.0

DarkLight1337 · 2024-09-10T08:07:50Z

Does this segmentation fault occur when disabling tensor parallel?

DarkLight1337 · 2024-09-10T08:09:23Z

cc @Isotr0py since it may be related to GGUF loading

LIUKAI0815 · 2024-09-10T08:12:32Z

Does this segmentation fault occur when disabling tensor parallel?

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 13.75 MiB is free. Process 2509972 has 23.63 GiB memory in use. Of the allocated memory 23.05 GiB is allocated by PyTorch, and 145.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

DarkLight1337 · 2024-09-10T08:14:03Z

Looks like the model is too big to load inside 1 GPU. Is there a smaller version that is easier to test with?

LIUKAI0815 · 2024-09-10T08:20:43Z

Mistral-Large-Instruct-2407-IQ1_M.gguf 1bit Already the smallest

Isotr0py · 2024-09-10T08:35:04Z

Seems that the model has been loaded to GPU successfully:

INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB

Perhaps related to problematic model forwarding due to gguf config extraction instead. (Maybe caused by calling kernel like rotary_embeddings or page_attention)

Isotr0py · 2024-09-11T02:22:41Z

Oh, it's because the gguf kernel we ported is out of date which didn't include IQ1_M implementation. I will add it back.

LIUKAI0815 added the bug Something isn't working label Sep 10, 2024

Isotr0py mentioned this issue Sep 11, 2024

[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel #8357

Merged

3 tasks

mgoin closed this as completed in #8357 Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 段错误 (核心已转储) #8321

[Bug]: 段错误 (核心已转储) #8321

LIUKAI0815 commented Sep 10, 2024

LIUKAI0815 commented Sep 10, 2024

DarkLight1337 commented Sep 10, 2024

DarkLight1337 commented Sep 10, 2024

LIUKAI0815 commented Sep 10, 2024

DarkLight1337 commented Sep 10, 2024

LIUKAI0815 commented Sep 10, 2024

Isotr0py commented Sep 10, 2024 •

edited

Loading

Isotr0py commented Sep 11, 2024

[Bug]: 段错误 (核心已转储) #8321

[Bug]: 段错误 (核心已转储) #8321

Comments

LIUKAI0815 commented Sep 10, 2024

Your current environment

Create an LLM.

🐛 Describe the bug

Before submitting a new issue...

LIUKAI0815 commented Sep 10, 2024

DarkLight1337 commented Sep 10, 2024

DarkLight1337 commented Sep 10, 2024

LIUKAI0815 commented Sep 10, 2024

DarkLight1337 commented Sep 10, 2024

LIUKAI0815 commented Sep 10, 2024

Isotr0py commented Sep 10, 2024 • edited Loading

Isotr0py commented Sep 11, 2024

Isotr0py commented Sep 10, 2024 •

edited

Loading