Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting SIGSEGV with llama backend #973

Closed
iamjackg opened this issue Aug 29, 2023 · 15 comments · Fixed by #2232
Closed

Getting SIGSEGV with llama backend #973

iamjackg opened this issue Aug 29, 2023 · 15 comments · Fixed by #2232
Assignees
Labels
bug Something isn't working

Comments

@iamjackg
Copy link

iamjackg commented Aug 29, 2023

LocalAI version:
v1.25.0

Environment, CPU architecture, OS, and Version:
Linux hostname 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • 13th Gen Intel(R) Core(TM) i7-13700KF
  • 3070 Ti

Describe the bug
Trying to run any GGUF model with the llama backend results in SIGSEGV as soon as the model tries to load. (output in Logs section)

Note that running the main binary of llama.cpp from LocalAI/go-llama/build/bin/ runs totally fine, e.g.

./main -t 6 --low-vram -m ~/gits/llama.cpp/models/phind-codellama-34b-v1.Q4_K_M.gguf --temp 0 -ngl 14 --color --rope-freq-base 1e6 -p $'# this python function determines whether an object is JSON-serializable or not, without using json.dumps\ndef is_json_serializable(thing):'

To Reproduce

Any request seems to do this. I tried with both codellama-13b-python.Q4_K_S.gguf and phind-codellama-34b-v1.Q4_K_M.gguf for good measure. Both work when running llama.cpp directly.

Expected behavior

Logs

11:50PM DBG Loading GRPC Model llama: {backendString:llama model:codellama-13b-python.Q4_K_S.gguf threads:6 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0002dc000 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
11:50PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
11:50PM DBG GRPC Service for codellama-13b-python.Q4_K_S.gguf will be running at: '127.0.0.1:38915'
11:50PM DBG GRPC Service state dir: /tmp/go-processmanager2204529450
11:50PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38915: connect: connection refused"
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr 2023/08/28 23:50:18 gRPC Server listening at 127.0.0.1:38915
11:50PM DBG GRPC Service Ready
11:50PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:codellama-13b-python.Q4_K_S.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:14 MainGPU: TensorSplit: Threads:6 LibrarySearchPath: RopeFreqBase:1e+06 RopeFreqScale:1 RMSNormEps:0 NGQA:0 ModelFile:/home/jack/gits/llama.cpp/models/codellama-13b-python.Q4_K_S.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false}
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr create_gpt_params: loading model /home/jack/gits/llama.cpp/models/codellama-13b-python.Q4_K_S.gguf
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr SIGSEGV: segmentation violation
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr PC=0x7f5937be9fbd m=5 sigcode=1
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr signal arrived during cgo execution
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr 
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr goroutine 34 [syscall]:
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr runtime.cgocall(0x81cbd0, 0xc0001815f0)
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr 	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0001815c8 sp=0xc000181590 pc=0x4161cb
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x7f58b4000b70, 0x1000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe, 0x200, ...)
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr 	_cgo_gotypes.go:266 +0x4c fp=0xc0001815f0 sp=0xc0001815c8 pc=0x8131ac
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr github.com/go-skynet/go-llama%2ecpp.New({0xc0002a8000, 0x41}, {0xc000110700, 0x8, 0x926a60?})
11:50PM DBG GRPC(codellama-13b-python.Q4_K_S.gguf-127.0.0.1:38915): stderr 	/home/jack/gits/LocalAI/go-llama/llama.go:39 +0x3aa fp=0xc0001817f0 sp=0xc0001815f0 pc=0x813a6a
[...]
@iamjackg iamjackg added the bug Something isn't working label Aug 29, 2023
@coreywagehoft
Copy link

I was going to submit an issue for this as well. I am getting the same error with similar codellama models in the gguf format with version 1.25.0

@racerxdl
Copy link

Same here :(

@SVerkuil
Copy link

I have similar problems on all GGUF files at the moment.

See below some additional logs if they can be of help.

9:16AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:phind-codellama-34b-v2.Q5_K_M.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:true Embeddings:false NUMA:false NGPULayers:32 MainGPU: TensorSplit: Threads:14 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/phind-codellama-34b-v2.Q5_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false}
9:16AM DBG GRPC(phind-codellama-34b-v2.Q5_K_M.gguf-127.0.0.1:43215): stderr create_gpt_params: loading model /models/phind-codellama-34b-v2.Q5_K_M.gguf
9:16AM DBG GRPC(phind-codellama-34b-v2.Q5_K_M.gguf-127.0.0.1:43215): stderr SIGSEGV: segmentation violation
9:16AM DBG GRPC(phind-codellama-34b-v2.Q5_K_M.gguf-127.0.0.1:43215): stderr PC=0x7fc7452f8789 m=0 sigcode=128
9:16AM DBG GRPC(phind-codellama-34b-v2.Q5_K_M.gguf-127.0.0.1:43215): stderr signal arrived during cgo execution

@guidevops
Copy link

change backend from llama to llama-stable work for me

@iamjackg
Copy link
Author

The lama-stable backend doesn't support GGUF models though, does it?

@johndpope
Copy link

I suspect this PR may fix things. The llama.cpp need a bump to get working with gguf - so golang would be behind. #977

@iamjackg
Copy link
Author

iamjackg commented Aug 31, 2023

Everything had already been bumped for v1.25, which split the llama backend into llama (GGUF support) and llama-stable (GGML support). That PR is just an automated bump to the latest version, and is unrelated to this issue.

@kratosok
Copy link

kratosok commented Sep 7, 2023

Your mileage may vary, but I ran into the SIGSEGV issue with the current(as of last night) docker image. Building locally, or rebuilding the container image has taken care of the problem in my case.

@iamjackg
Copy link
Author

iamjackg commented Sep 7, 2023

Unfortunately I'm also building locally.

@jadams
Copy link

jadams commented Sep 8, 2023

Having the same problem here with wizardcoder-python-34b-v1.0.Q4_K_M.gguf

Tried prebuilt docker tags v1.25.0, master, and rebuilding

Log snippet:

[...]
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr create_gpt_params: loading model /models/wizardcoder-python-34b-v1.0.Q4_K_M.gguf
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr SIGSEGV: segmentation violation
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr PC=0x7fc1bb667789 m=5 sigcode=1
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr signal arrived during cgo execution
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr goroutine 35 [syscall]:
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr runtime.cgocall(0x820470, 0xc0002ad580)
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr 	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc0002ad558 sp=0xc0002ad520 pc=0x417e6b
6:27PM DBG GRPC(wizardcoder-python-34b-v1.0.Q4_K_M.gguf-127.0.0.1:38501): stderr github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x7fc134000b60, 0x800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x200, ...)
[...]

Works perfectly fine when running

/build/go-llama/build/bin/main -t 10 -ngl 32 -m /models/wizardcoder-python-34b-v1.0.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

@sandros94
Copy link

sandros94 commented Sep 9, 2023

@jadams could you share your nvidia-smi and nvcc --version?

I'm facing this issue too but I'm getting segmentation fault when trying to use llama.cpp directly.

While checking my nvidia-smi and nvcc --version I noticed the first is reporting cuda v12.2, while the second is trying to use 12.1. I'm almost sure this shouldn't be the problem but I'm starting to consider everything.

Testing v1.25.0 and master -cuda12 containerized images on a windows machine with a gtx 1660 ti.

For reference, (after rebuild) tested with:

/build/go-llama/build/bin/main -t 8 -ngl 1 -lv -m /models/orca_mini_v3_7b.Q6_K.gguf --color -c 512 --temp 0.7 -p "### Instruction: Write a story about llamas\n### Response:"

@jadams
Copy link

jadams commented Sep 9, 2023

nvidia-smi:

user@machine:~$ kubectl exec -it localai-local-ai-5c745dfdbd-7n7ls -- nvidia-smi
Defaulted container "localai-local-ai" out of: localai-local-ai, download-model (init)
Sat Sep  9 15:33:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100S-PCIE-32GB          On  | 00000000:B0:00.0 Off |                    0 |
| N/A   24C    P0              22W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

nvcc --version

user@machine:~$ kubectl exec -it localai-local-ai-5c745dfdbd-7n7ls -- nvcc --version
Defaulted container "localai-local-ai" out of: localai-local-ai, download-model (init)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

seems to be the same as you, cuda 12.2 on nvidia-smi and cuda 12.1 on nvcc

@flotos
Copy link

flotos commented Sep 12, 2023

I am getting the same issue. Orca-mini 3b works but not other models, doing inference on CPU only

@Llamatron2112
Copy link

My nvcc and nvidia-smi cuda versions match, but I have a similar output with SIGSEGV when trying to load a GGUF model

@Dbone29
Copy link

Dbone29 commented Oct 17, 2023

I think the problem was something related to gguf v2 format. Can you test it with localai 1.30? Maybe it is fixed ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.