cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file #1620

dynamite9999 · 2023-05-28T05:25:42Z

Hello, I tried to see if anyone else had this issue but closest was #1596 and seems different in my situation.

Crashes only for GPU cuBLAS

make LLAMA_CUBLAS=1
Makefile edited NVCCFLAGS = --forward-unknown-to-host-compiler --gpu-architecture=sm_86
3 ./main -m ./models/ggml-vicuna-13b-1.1-q4_1.bin -p "Building a website can be done in 10 simple steps:" -n 512
main: build = 0 (unknown)
main: seed = 1685251133
llama.cpp: loading model from ./models/ggml-vicuna-13b-1.1-q4_1.bin
terminate called after throwing an instance of 'std::runtime_error'
what(): unexpectedly reached end of file
Aborted (core dumped)
root@netaisyslog:~/app/app/netai_llm#

But for everything else same, make clean and plain make - No GPU

it works fine.

Any guidance on how to go about figuring out what is going on here ?

thanks

Background

nvidia-smi

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@netaisyslog:~/app/app/netai_llm#

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
root@netaisyslog:~/app/app/netai_llm#

Makefile

cat Makefile

Define the default target now so that it is always the first target

BUILD_TARGETS = main quantize quantize-stats perplexity embedding vdot

ifdef LLAMA_BUILD_SERVER
BUILD_TARGETS += server
endif

default: $(BUILD_TARGETS)

ifndef UNAME_S
UNAME_S := $(shell uname -s)
endif

ifndef UNAME_P
UNAME_P := $(shell uname -p)
endif

ifndef UNAME_M
UNAME_M := $(shell uname -m)
endif

CCV := $(shell $(CC) --version | head -n 1)
CXXV := $(shell $(CXX) --version | head -n 1)

Mac OS + Arm can report x86_64

ref: ggerganov/whisper.cpp#66 (comment)

ifeq ($(UNAME_S),Darwin)
ifneq ($(UNAME_P),arm)
SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
ifeq ($(SYSCTL_M),1)
# UNAME_P := arm
# UNAME_M := arm64
warn := $(warning Your arch is announced as x86_64, but it seems to actually be ARM64. Not fixing that can lead to bad performance. For more info see: https://github.com/ggerganov/whisper.cpp/issues/66\#issuecomment-1282546789)
endif
endif
endif

Compile flags

keep standard at C11 and C++11

CFLAGS = -I. -O3 -std=c11 -fPIC
CXXFLAGS = -I. -I./examples -O3 -std=c++11 -fPIC
LDFLAGS =

ifndef LLAMA_DEBUG
CFLAGS += -DNDEBUG
CXXFLAGS += -DNDEBUG
endif

warnings

CFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith
CXXFLAGS += -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar

OS specific

TODO: support Windows

ifeq ($(UNAME_S),Linux)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),Darwin)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),FreeBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),NetBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),OpenBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),Haiku)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif

ifdef LLAMA_GPROF
CFLAGS += -pg
CXXFLAGS += -pg
endif
ifdef LLAMA_PERF
CFLAGS += -DGGML_PERF
CXXFLAGS += -DGGML_PERF
endif

Architecture specific

TODO: probably these flags need to be tweaked on some architectures

feel free to update the Makefile for your architecture and send a pull request or issue

ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
# Use all CPU extensions that are available:
CFLAGS += -march=native -mtune=native
CXXFLAGS += -march=native -mtune=native

# Usage AVX-only
#CFLAGS   += -mfma -mf16c -mavx
#CXXFLAGS += -mfma -mf16c -mavx

endif
ifneq ($(filter ppc64%,$(UNAME_M)),)
POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
ifneq (,$(findstring POWER9,$(POWER9_M)))
CFLAGS += -mcpu=power9
CXXFLAGS += -mcpu=power9
endif
# Require c++23's std::byteswap for big-endian support.
ifeq ($(UNAME_M),ppc64)
CXXFLAGS += -std=c++23 -DGGML_BIG_ENDIAN
endif
endif
ifndef LLAMA_NO_ACCELERATE
# Mac M1 - include Accelerate framework.
# -framework Accelerate works on Mac Intel as well, with negliable performance boost (as of the predict time).
ifeq ($(UNAME_S),Darwin)
CFLAGS += -DGGML_USE_ACCELERATE
LDFLAGS += -framework Accelerate
endif
endif
ifdef LLAMA_OPENBLAS
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -I/usr/include/openblas
ifneq ($(shell grep -e "Arch Linux" -e "ID_LIKE=arch" /etc/os-release 2>/dev/null),)
LDFLAGS += -lopenblas -lcblas
else
LDFLAGS += -lopenblas
endif
endif
ifdef LLAMA_BLIS
CFLAGS += -DGGML_USE_OPENBLAS -I/usr/local/include/blis -I/usr/include/blis
LDFLAGS += -lblis -L/usr/local/lib
endif
ifdef LLAMA_CUBLAS
CFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$(CUDA_PATH)/targets/x86_64-linux/include
CXXFLAGS += -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I$(CUDA_PATH)/targets/x86_64-linux/include
LDFLAGS += -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L$(CUDA_PATH)/targets/x86_64-linux/lib
OBJS += ggml-cuda.o
NVCC = nvcc

NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native

NVCCFLAGS = --forward-unknown-to-host-compiler  --gpu-architecture=sm_86

ifdef LLAMA_CUDA_DMMV_X
NVCCFLAGS += -DGGML_CUDA_DMMV_X=$(LLAMA_CUDA_DMMV_X)
else
NVCCFLAGS += -DGGML_CUDA_DMMV_X=32
endif # LLAMA_CUDA_DMMV_X
ifdef LLAMA_CUDA_DMMV_Y
NVCCFLAGS += -DGGML_CUDA_DMMV_Y=$(LLAMA_CUDA_DMMV_Y)
else
NVCCFLAGS += -DGGML_CUDA_DMMV_Y=1
endif # LLAMA_CUDA_DMMV_Y
ggml-cuda.o: ggml-cuda.cu ggml-cuda.h
$(NVCC) $(NVCCFLAGS) $(CXXFLAGS) -Wno-pedantic -c $< -o $@
endif # LLAMA_CUBLAS
ifdef LLAMA_CLBLAST
CFLAGS += -DGGML_USE_CLBLAST
CXXFLAGS += -DGGML_USE_CLBLAST
# Mac provides OpenCL as a framework
ifeq ($(UNAME_S),Darwin)
LDFLAGS += -lclblast -framework OpenCL
else
LDFLAGS += -lclblast -lOpenCL
endif
OBJS += ggml-opencl.o
ggml-opencl.o: ggml-opencl.cpp ggml-opencl.h
$(CXX) $(CXXFLAGS) -c $< -o $@
endif
ifneq ($(filter aarch64%,$(UNAME_M)),)
# Apple M1, M2, etc.
# Raspberry Pi 3, 4, Zero 2 (64-bit)
CFLAGS += -mcpu=native
CXXFLAGS += -mcpu=native
endif
ifneq ($(filter armv6%,$(UNAME_M)),)
# Raspberry Pi 1, Zero
CFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access
endif
ifneq ($(filter armv7%,$(UNAME_M)),)
# Raspberry Pi 2
CFLAGS += -mfpu=neon-fp-armv8 -mfp16-format=ieee -mno-unaligned-access -funsafe-math-optimizations
endif
ifneq ($(filter armv8%,$(UNAME_M)),)
# Raspberry Pi 3, 4, Zero 2 (32-bit)
CFLAGS += -mfp16-format=ieee -mno-unaligned-access
endif

Print build information

$(info I llama.cpp build info: )
$(info I UNAME_S: $(UNAME_S))
$(info I UNAME_P: $(UNAME_P))
$(info I UNAME_M: $(UNAME_M))
$(info I CFLAGS: $(CFLAGS))
$(info I CXXFLAGS: $(CXXFLAGS))
$(info I LDFLAGS: $(LDFLAGS))
$(info I CC: $(CCV))
$(info I CXX: $(CXXV))
$(info )

Build library

ggml.o: ggml.c ggml.h ggml-cuda.h
$(CC) $(CFLAGS) -c $< -o $@

llama.o: llama.cpp ggml.h ggml-cuda.h llama.h llama-util.h
$(CXX) $(CXXFLAGS) -c $< -o $@

common.o: examples/common.cpp examples/common.h
$(CXX) $(CXXFLAGS) -c $< -o $@

libllama.so: llama.o ggml.o $(OBJS)
$(CXX) $(CXXFLAGS) -shared -fPIC -o $@ $^ $(LDFLAGS)

clean:
rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server vdot build-info.h

Examples

main: examples/main/main.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
@echo
@echo '==== Run ./main -h for help. ===='
@echo

quantize: examples/quantize/quantize.cpp build-info.h ggml.o llama.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

quantize-stats: examples/quantize-stats/quantize-stats.cpp build-info.h ggml.o llama.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

perplexity: examples/perplexity/perplexity.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

embedding: examples/embedding/embedding.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

save-load-state: examples/save-load-state/save-load-state.cpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

server: examples/server/server.cpp examples/server/httplib.h examples/server/json.hpp build-info.h ggml.o llama.o common.o $(OBJS)
$(CXX) $(CXXFLAGS) -Iexamples/server $(filter-out %.h,$(filter-out %.hpp,$^)) -o $@ $(LDFLAGS)

build-info.h: $(wildcard .git/index) scripts/build-info.sh
@sh scripts/build-info.sh > $@.tmp
@if ! cmp -s $@.tmp $@; then
mv $@.tmp $@;
else
rm $@.tmp;
fi

Tests

benchmark-matmult: examples/benchmark/benchmark-matmult.cpp build-info.h ggml.o $(OBJS)
$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)
./$@

vdot: pocs/vdot/vdot.cpp ggml.o $(OBJS)
$(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)

.PHONY: tests clean
tests:
bash ./tests/run-tests.sh
root@netaisyslog:~/app/app/netai_llm#

The text was updated successfully, but these errors were encountered:

KerfuffleV2 · 2023-05-28T13:42:11Z

If you got it from a source the one that starts with an H, the first thing would be to check if the SHA256 matches the file you downloaded. It's possible the file you have is truncated/corrupt.

llama.cpp mmaps models by default which I think will probably be more tolerant of something like an incomplete model. I bet if you run without GPU and --no-mmap you'll get an error.

cmp-nct · 2023-05-28T14:33:06Z

I have my doubts on the error report. I'd like to see a fresh compiled CPU version that "works fine" with your model.
There have been changes in 4_1 as well as in 8_0 so my guess is that you use an old model binary and you tested it with the old version, now you recompiled in GPU version and it doesn't work anymore because it's not compatible and the magic changes are not implemented to warn/abort.

asctime · 2023-05-31T21:29:00Z

I also had this error and resolved it with help of CRD716 on discord. Essentially the model is too old. Yes, I mean you from six-weeks-ago.

aiaicode · 2023-06-02T08:06:42Z

I get the same error even on a CPU. I picked up an earlier code from April and it works fine but the new code does not. @asctime which ggml model should we use then? q4_0 doesn't work? In that case, the readme front page needs to change.

asctime · 2023-06-02T10:40:58Z

@aiaicode:

CRD716
—
05/31/2023 9:31 AM
Yeah, there was a breaking change to models recently, try a new one.

The latest 7B from IlyaGusev seem to parse ok but I haven't had time to test the training.

github-actions · 2024-04-09T01:08:47Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file #1620

cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file #1620

dynamite9999 commented May 28, 2023

KerfuffleV2 commented May 28, 2023

cmp-nct commented May 28, 2023 •

edited

Loading

asctime commented May 31, 2023

aiaicode commented Jun 2, 2023

asctime commented Jun 2, 2023

github-actions bot commented Apr 9, 2024

cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file #1620

cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file #1620

Comments

dynamite9999 commented May 28, 2023

Crashes only for GPU cuBLAS

nvidia-smi

nvcc --version

Define the default target now so that it is always the first target

Mac OS + Arm can report x86_64

ref: ggerganov/whisper.cpp#66 (comment)

Compile flags

keep standard at C11 and C++11

warnings

OS specific

TODO: support Windows

Architecture specific

TODO: probably these flags need to be tweaked on some architectures

feel free to update the Makefile for your architecture and send a pull request or issue

NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native

Print build information

Build library

Examples

Tests

KerfuffleV2 commented May 28, 2023

cmp-nct commented May 28, 2023 • edited Loading

asctime commented May 31, 2023

aiaicode commented Jun 2, 2023

asctime commented Jun 2, 2023

github-actions bot commented Apr 9, 2024

cmp-nct commented May 28, 2023 •

edited

Loading