Skip to content

Eval bug: HIP gfx908 (MI100) cublass error when prompt is too long. #15845

@narikm

Description

@narikm

An error occur when the prompt is longer than a few tokens. Launch arguments:

#!/bin/bash
export LD_LIBRARY_PATH=/home/tug/Desktop/bin/llama.cpp/build:$LD_LIBRARY_PATH
export HIP_VISIBLE_DEVICES=0

cd "/home/tug/Desktop/bin/llama.cpp/build"

numactl -N 0 -m 0
./llama-server
--n-gpu-layers 99
--threads 40
--threads-batch 40
--ctx-size 35000
--batch-size 2048
-ub 510
--override-tensor exps=CPU
--host 0.0.0.0
--port 8080
-fa on
--jinja
--model "/media/tug/AI NVMe/MODELS/DeepSeek-V3.1-Q4_0/DeepSeek-V3.1-Q4_0-00001-of-00008.gguf"

read -p "Press ENTER to close..."

Name and Version

llama-server b6399
version: 0 (unknown)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

HIP

Hardware

MI100 gfx908 Xeon 2300 X2

Models

DeepSeek-V3.1-Q4_0

Problem description & steps to reproduce

Work with small user prompt but crash when longer.

First Bad Commit

No response

Relevant log output

slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 35008, n_keep = 0, n_prompt_tokens = 417
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 417, n_tokens = 417, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 417, n_tokens = 417
/shared/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:87: ROCm error
ROCm error: CUBLAS_STATUS_NOT_SUPPORTED
current device: 0, in function ggml_cuda_op_mul_mat_cublas at /shared/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1302
hipblasGemmEx(ctx.cublas_handle(id), HIPBLAS_OP_T, HIPBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ptr, HIPBLAS_R_16F, ne00, src1_ptr, HIPBLAS_R_16F, ne10, &beta, dst_dd_i, HIPBLAS_R_32F, ldc, HIPBLAS_R_32F, HIPBLAS_GEMM_DEFAULT)
[New LWP 947059]
[New LWP 947058]
[New LWP 947057]
[New LWP 947056]
[New LWP 947055]
[New LWP 947054]
[New LWP 947053]
[New LWP 947052]
[New LWP 947051]
[New LWP 947050]
[New LWP 947049]
[New LWP 947048]
[New LWP 947047]
[New LWP 947046]
[New LWP 947045]
[New LWP 947044]
[New LWP 947043]
[New LWP 947042]
[New LWP 947041]
[New LWP 947040]
[New LWP 947039]
[New LWP 947038]
[New LWP 947037]
[New LWP 947036]
[New LWP 947035]
[New LWP 947034]
[New LWP 947033]
[New LWP 947032]
[New LWP 947031]
[New LWP 947030]
[New LWP 947029]
[New LWP 947028]
[New LWP 947027]
[New LWP 947026]
[New LWP 947025]
[New LWP 947024]
[New LWP 947023]
[New LWP 947022]
[New LWP 947021]
[New LWP 946900]
[New LWP 945802]
[New LWP 945801]
[New LWP 945800]
[New LWP 945799]
[New LWP 945798]
[New LWP 945797]
[New LWP 945796]
[New LWP 945795]
[New LWP 945794]
[New LWP 945793]
[New LWP 945792]
[New LWP 945791]
[New LWP 945790]
[New LWP 945789]
[New LWP 945788]
[New LWP 945787]
[New LWP 945786]
[New LWP 945785]
[New LWP 945784]
[New LWP 945783]
[New LWP 945782]
[New LWP 945781]
[New LWP 945780]
[New LWP 945779]
[New LWP 945778]
[New LWP 945777]
[New LWP 945776]
[New LWP 945775]
[New LWP 945774]
[New LWP 945773]
[New LWP 945772]
[New LWP 945771]
[New LWP 945770]
[New LWP 945769]
[New LWP 945768]
[New LWP 945767]
[New LWP 945766]
[New LWP 945765]
[New LWP 945764]
[New LWP 945763]
[New LWP 945762]
[New LWP 945761]
[New LWP 945760]
[New LWP 945759]
[New LWP 945758]
[New LWP 945757]
[New LWP 945756]
[New LWP 945755]
[New LWP 945754]
[New LWP 945753]
[New LWP 945752]
[New LWP 945751]
[New LWP 945750]
[New LWP 945749]
[New LWP 945748]
[New LWP 945747]
[New LWP 945746]
[New LWP 945745]
[New LWP 945744]
[New LWP 945743]
[New LWP 945742]
[New LWP 945741]
[New LWP 945740]
[New LWP 945739]
[New LWP 945738]
[New LWP 945737]
[New LWP 945736]
[New LWP 945735]
[New LWP 945734]
[New LWP 945733]
[New LWP 945732]
[New LWP 945731]
[New LWP 945730]
[New LWP 945729]
[New LWP 945728]
[New LWP 945727]
[New LWP 945726]
[New LWP 945725]
[New LWP 945724]
[New LWP 945723]
[New LWP 945722]
[New LWP 945406]

This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.ubuntu.com
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/liblber.so.2
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlidec.so.1
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlicommon.so.1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000079a095f107e3 in __GI___wait4 (pid=950067, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x000079a095f107e3 in __GI___wait4 (pid=950067, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000079a0965715f3 in ggml_print_backtrace () from libggml-base.so
#2 0x000079a09657179b in ggml_abort () from libggml-base.so
#3 0x000079a09153ad62 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from libggml-hip.so
#4 0x000079a091549b95 in ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t*) () from libggml-hip.so
#5 0x000079a091547fea in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void ()(ggml_backend_cuda_context&, ggml_tensor const, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t*), void ()(float const, int const*, void*, ggml_type, long, long, long, long, long, long, long, long, ihipStream_t*)) () from libggml-hip.so
#6 0x000079a091542d86 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () from libggml-hip.so
#7 0x000079a091540bad in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from libggml-hip.so
#8 0x000079a09658be07 in ggml_backend_sched_graph_compute_async () from libggml-base.so
#9 0x000079a09669e591 in llama_context::graph_compute(ggml_cgraph*, bool) () from libllama.so
#10 0x000079a09669f994 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from libllama.so
#11 0x000079a0966a5c6d in llama_context::decode(llama_batch const&) () from libllama.so
#12 0x000079a0966a6baf in llama_decode () from libllama.so
#13 0x000059852106b2a2 in server_context::update_slots() ()
#14 0x00005985210317ac in server_queue::start_loop() ()
#15 0x0000598520ff545b in main ()
[Inferior 1 (process 945365) detached]
/home/tug/Desktop/R1V3HIP.sh: line 20: 945365 Aborted (core dumped) numactl -N 0 -m 0 ./llama-server --n-gpu-layers 99 --threads 40 --threads-batch 40 --ctx-size 35000 --batch-size 2048 -ub 510 --override-tensor exps=CPU --host 0.0.0.0 --port 8080 -fa off --jinja --model "/media/tug/AI NVMe/MODELS/DeepSeek-V3.1-Q4_0/DeepSeek-V3.1-Q4_0-00001-of-00008.gguf"

Compiled with:
make -S . -B build
-DGGML_HIP=ON
-DGPU_TARGETS="gfx908"
-DCMAKE_BUILD_TYPE=Release
-Dhipblas_DIR=$HIPBLAS_DIR
-DLLAMA_CURL=OFF

cmake --build build --config Release -- -j16

Metadata

Metadata

Assignees

Labels

AMD GPUIssues specific to AMD GPUsbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions