Skip to content

Eval bug: ROCm illegal memory access with -sm row #16799

@MagoDelBlocco

Description

@MagoDelBlocco

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /bank/llama.cpp/libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from /bank/llama.cpp/libggml-hip.so
load_backend: loaded CPU backend from /bank/llama.cpp/libggml-cpu-haswell.so
version: 6853 (945501f)
built with cc (Debian 14.2.0-19) 14.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

HIP

Hardware

CPU: Threadripper 5975wx
GPU 1: RTX 5090
GPU 2: AMD Instinct MI50 32GB (gfx906)

Models

Unsloth's gpt-oss-120b-UD-Q6_K_XL

Problem description & steps to reproduce

Running the model with -sm row crashes at warmup. With -sm layer this does not reproduce. When running the model on the gfx906 alone this does not reproduce.

Steps to reproduce:

  • Compiled with cmake -B build \ -DGGML_NATIVE=ON \ -DGGML_CUDA=ON \ -DGGML_CUDA_USE_GRAPHS=ON \ -DGGML_CUBLAS=ON \ -DCMAKE_CUDA_ARCHITECTURES="120" \ -DGGML_CUDA_F16=ON \ -DGGML_BACKEND_DL=ON \ -DGGML_HIP=ON \ -DGGML_HIP_GRAPHS=ON \ -DGPU_TARGETS=gfx906 \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_CPU_ALL_VARIANTS=ON

ROCm version is 7.0.1 in this case, using the tensiles for gfx906 of ROCm 6.4.3.

  • Ran with ./llama-cli -m gpt-oss-120b-UD-Q6_K_XL-00001-of-00002.gguf -c 2048 -dev CUDA0,ROCm0 -sm row

First Bad Commit

b6841

Relevant log output

: offloaded 37/37 layers to GPU
load_tensors:  CUDA0_Split model buffer size =   485.98 MiB
load_tensors:        CUDA0 model buffer size = 30740.94 MiB
load_tensors:        ROCm0 model buffer size = 27505.06 MiB
load_tensors:  ROCm0_Split model buffer size =  1021.64 MiB
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache:      ROCm0 KV buffer size =    36.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =    36.00 MiB
llama_kv_cache: size =   72.00 MiB (  2048 cells,  18 layers,  1/1 seqs), K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache:      ROCm0 KV buffer size =    12.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =    15.00 MiB
llama_kv_cache: size =   27.00 MiB (   768 cells,  18 layers,  1/1 seqs), K (f16):   13.50 MiB, V (f16):   13.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =    95.02 MiB
llama_context:      ROCm0 compute buffer size =   398.38 MiB
llama_context:  CUDA_Host compute buffer size =    11.15 MiB
llama_context: graph nodes  = 2024
llama_context: graph splits = 3
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/bank/llama.cpp/source/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: ROCm error
ROCm error: an illegal memory access was encountered
  current device: -1, in function ggml_backend_cuda_synchronize at /bank/llama.cpp/source/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2787
  hipStreamSynchronize(cuda_ctx->stream())
[New LWP 49691]
[New LWP 49666]
[New LWP 49665]
[New LWP 49664]
[New LWP 49661]
[New LWP 49657]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fd662899668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49	./nptl/cancellation.c: No such file or directory
#2  0x00007fd6628996ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x00007fd662904787 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007fd662ee3d6b in ggml_print_backtrace () from libggml-base.so
#5  0x00007fd662ee3ebe in ggml_abort () from libggml-base.so
#6  0x00007fd423b1ea12 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /bank/llama.cpp/libggml-hip.so
#7  0x00007fd423b23af4 in ggml_backend_cuda_synchronize(ggml_backend*) () from /bank/llama.cpp/libggml-hip.so
#8  0x00007fd662efa4fe in ggml_backend_sched_synchronize () from libggml-base.so
#9  0x00007fd66300a270 in llama_context::synchronize() () from libllama.so
#10 0x000055c9a157c55a in common_init_from_params(common_params&) ()
#11 0x000055c9a147fc1c in main ()
[Inferior 1 (process 49643) detached]
[1]    49643 IOT instruction  ./llama-cli -m /bank/models/openai/gpt-oss-120b-UD-Q6_K_XL-00001-of-00002.gguf

Metadata

Metadata

Assignees

No one assigned

    Labels

    AMD GPUIssues specific to AMD GPUsNvidia GPUIssues specific to Nvidia GPUsbugSomething isn't workingcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions