-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /bank/llama.cpp/libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from /bank/llama.cpp/libggml-hip.so
load_backend: loaded CPU backend from /bank/llama.cpp/libggml-cpu-haswell.so
version: 6853 (945501f)
built with cc (Debian 14.2.0-19) 14.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
HIP
Hardware
CPU: Threadripper 5975wx
GPU 1: RTX 5090
GPU 2: AMD Instinct MI50 32GB (gfx906)
Models
Unsloth's gpt-oss-120b-UD-Q6_K_XL
Problem description & steps to reproduce
Running the model with -sm row crashes at warmup. With -sm layer this does not reproduce. When running the model on the gfx906 alone this does not reproduce.
Steps to reproduce:
- Compiled with
cmake -B build \ -DGGML_NATIVE=ON \ -DGGML_CUDA=ON \ -DGGML_CUDA_USE_GRAPHS=ON \ -DGGML_CUBLAS=ON \ -DCMAKE_CUDA_ARCHITECTURES="120" \ -DGGML_CUDA_F16=ON \ -DGGML_BACKEND_DL=ON \ -DGGML_HIP=ON \ -DGGML_HIP_GRAPHS=ON \ -DGPU_TARGETS=gfx906 \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_CPU_ALL_VARIANTS=ON
ROCm version is 7.0.1 in this case, using the tensiles for gfx906 of ROCm 6.4.3.
- Ran with
./llama-cli -m gpt-oss-120b-UD-Q6_K_XL-00001-of-00002.gguf -c 2048 -dev CUDA0,ROCm0 -sm row
First Bad Commit
Relevant log output
: offloaded 37/37 layers to GPU
load_tensors: CUDA0_Split model buffer size = 485.98 MiB
load_tensors: CUDA0 model buffer size = 30740.94 MiB
load_tensors: ROCm0 model buffer size = 27505.06 MiB
load_tensors: ROCm0_Split model buffer size = 1021.64 MiB
load_tensors: CPU_Mapped model buffer size = 586.82 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache: ROCm0 KV buffer size = 36.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 36.00 MiB
llama_kv_cache: size = 72.00 MiB ( 2048 cells, 18 layers, 1/1 seqs), K (f16): 36.00 MiB, V (f16): 36.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 768 cells
llama_kv_cache: ROCm0 KV buffer size = 12.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 15.00 MiB
llama_kv_cache: size = 27.00 MiB ( 768 cells, 18 layers, 1/1 seqs), K (f16): 13.50 MiB, V (f16): 13.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 95.02 MiB
llama_context: ROCm0 compute buffer size = 398.38 MiB
llama_context: CUDA_Host compute buffer size = 11.15 MiB
llama_context: graph nodes = 2024
llama_context: graph splits = 3
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/bank/llama.cpp/source/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: ROCm error
ROCm error: an illegal memory access was encountered
current device: -1, in function ggml_backend_cuda_synchronize at /bank/llama.cpp/source/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2787
hipStreamSynchronize(cuda_ctx->stream())
[New LWP 49691]
[New LWP 49666]
[New LWP 49665]
[New LWP 49664]
[New LWP 49661]
[New LWP 49657]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x00007fd662899668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: No such file or directory
#2 0x00007fd6628996ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x00007fd662904787 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4 0x00007fd662ee3d6b in ggml_print_backtrace () from libggml-base.so
#5 0x00007fd662ee3ebe in ggml_abort () from libggml-base.so
#6 0x00007fd423b1ea12 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /bank/llama.cpp/libggml-hip.so
#7 0x00007fd423b23af4 in ggml_backend_cuda_synchronize(ggml_backend*) () from /bank/llama.cpp/libggml-hip.so
#8 0x00007fd662efa4fe in ggml_backend_sched_synchronize () from libggml-base.so
#9 0x00007fd66300a270 in llama_context::synchronize() () from libllama.so
#10 0x000055c9a157c55a in common_init_from_params(common_params&) ()
#11 0x000055c9a147fc1c in main ()
[Inferior 1 (process 49643) detached]
[1] 49643 IOT instruction ./llama-cli -m /bank/models/openai/gpt-oss-120b-UD-Q6_K_XL-00001-of-00002.gguf