Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 段错误 (核心已转储) #8321

Closed
1 task done
LIUKAI0815 opened this issue Sep 10, 2024 · 8 comments · Fixed by #8357
Closed
1 task done

[Bug]: 段错误 (核心已转储) #8321

LIUKAI0815 opened this issue Sep 10, 2024 · 8 comments · Fixed by #8357
Labels
bug Something isn't working

Comments

@LIUKAI0815
Copy link

Your current environment

import os
os.environ["CUDA_VISIBLE_DEVICES"]="4,5,6,7" # 4090*4

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import time
import uvicorn
from fastapi import FastAPI,Body
from pydantic import BaseModel
import asyncio

apps = FastAPI()

path = "/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf"
sampling_params = SamplingParams(temperature=1.0,repetition_penalty=1.0,max_tokens=512)

Create an LLM.

llm = LLM(model=path,tokenizer="/workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/",trust_remote_code=True,gpu_memory_utilization=0.8,tensor_parallel_size=4,enforce_eager=True,disable_custom_all_reduce=True)

🐛 Describe the bug

WARNING 09-10 15:07:35 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-10 15:07:35 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-10 15:07:35 utils.py:977] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-10 15:07:35 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 09-10 15:07:35 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fed8f24fee0>, local_subscribe_port=38065, remote_subscribe_port=None)
INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf...
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:36 model_runner.py:915] Starting to load model /workspace/model/llm/Mistral/Mistral-Large-Instruct-2407/Mistral-Large-Instruct-2407-IQ1_M.gguf...
INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB
/root/miniconda3/envs/vllm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
段错误 (核心已转储)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@LIUKAI0815 LIUKAI0815 added the bug Something isn't working label Sep 10, 2024
@LIUKAI0815
Copy link
Author

vllm 0.6.0

@DarkLight1337
Copy link
Member

Does this segmentation fault occur when disabling tensor parallel?

@DarkLight1337
Copy link
Member

cc @Isotr0py since it may be related to GGUF loading

@LIUKAI0815
Copy link
Author

Does this segmentation fault occur when disabling tensor parallel?

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 13.75 MiB is free. Process 2509972 has 23.63 GiB memory in use. Of the allocated memory 23.05 GiB is allocated by PyTorch, and 145.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@DarkLight1337
Copy link
Member

Looks like the model is too big to load inside 1 GPU. Is there a smaller version that is easier to test with?

@LIUKAI0815
Copy link
Author

Mistral-Large-Instruct-2407-IQ1_M.gguf 1bit Already the smallest

@Isotr0py
Copy link
Collaborator

Isotr0py commented Sep 10, 2024

Seems that the model has been loaded to GPU successfully:

INFO 09-10 15:07:58 model_runner.py:926] Loading model weights took 15.6715 GB
(VllmWorkerProcess pid=3532163) INFO 09-10 15:07:59 model_runner.py:926] Loading model weights took 15.6715 GB

Perhaps related to problematic model forwarding due to gguf config extraction instead. (Maybe caused by calling kernel like rotary_embeddings or page_attention)

@Isotr0py
Copy link
Collaborator

Oh, it's because the gguf kernel we ported is out of date which didn't include IQ1_M implementation. I will add it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants