Eval bug: ROCm illegal memory access with `-sm row`

### Name and Version

./llama-cli --version                        
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /bank/llama.cpp/libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
load_backend: loaded ROCm backend from /bank/llama.cpp/libggml-hip.so
load_backend: loaded CPU backend from /bank/llama.cpp/libggml-cpu-haswell.so
version: 6853 (945501f5e)
built with cc (Debian 14.2.0-19) 14.2.0 for x86_64-linux-gnu



### Operating systems

Linux

### GGML backends

HIP

### Hardware

CPU: Threadripper 5975wx
GPU 1: RTX 5090
GPU 2: AMD Instinct MI50 32GB (gfx906)

### Models

Unsloth's `gpt-oss-120b-UD-Q6_K_XL`

### Problem description & steps to reproduce

Running the model with `-sm row` crashes at warmup. With `-sm layer` this does not reproduce. When running the model on the gfx906 alone this does not reproduce.

Steps to reproduce:
- Compiled with ```cmake -B build \
-DGGML_NATIVE=ON \
-DGGML_CUDA=ON \
-DGGML_CUDA_USE_GRAPHS=ON \
-DGGML_CUBLAS=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \
-DGGML_CUDA_F16=ON \
-DGGML_BACKEND_DL=ON \
-DGGML_HIP=ON \
-DGGML_HIP_GRAPHS=ON \
-DGPU_TARGETS=gfx906 \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CPU_ALL_VARIANTS=ON```

ROCm version is 7.0.1 in this case, using the tensiles for gfx906 of ROCm 6.4.3.

- Ran with ```./llama-cli -m gpt-oss-120b-UD-Q6_K_XL-00001-of-00002.gguf -c 2048 -dev CUDA0,ROCm0 -sm row```

### First Bad Commit

[b6841](https://github.com/ggml-org/llama.cpp/releases/tag/b6841)

### Relevant log output

```shell
: offloaded 37/37 layers to GPU
load_tensors:  CUDA0_Split model buffer size =   485.98 MiB
load_tensors:        CUDA0 model buffer size = 30740.94 MiB
load_tensors:        ROCm0 model buffer size = 27505.06 MiB
load_tensors:  ROCm0_Split model buffer size =  1021.64 MiB
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 2048 cells
llama_kv_cache:      ROCm0 KV buffer size =    36.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =    36.00 MiB
llama_kv_cache: size =   72.00 MiB (  2048 cells,  18 layers,  1/1 seqs), K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache:      ROCm0 KV buffer size =    12.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =    15.00 MiB
llama_kv_cache: size =   27.00 MiB (   768 cells,  18 layers,  1/1 seqs), K (f16):   13.50 MiB, V (f16):   13.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =    95.02 MiB
llama_context:      ROCm0 compute buffer size =   398.38 MiB
llama_context:  CUDA_Host compute buffer size =    11.15 MiB
llama_context: graph nodes  = 2024
llama_context: graph splits = 3
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/bank/llama.cpp/source/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: ROCm error
ROCm error: an illegal memory access was encountered
  current device: -1, in function ggml_backend_cuda_synchronize at /bank/llama.cpp/source/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2787
  hipStreamSynchronize(cuda_ctx->stream())
[New LWP 49691]
[New LWP 49666]
[New LWP 49665]
[New LWP 49664]
[New LWP 49661]
[New LWP 49657]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fd662899668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49	./nptl/cancellation.c: No such file or directory
#2  0x00007fd6628996ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x00007fd662904787 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007fd662ee3d6b in ggml_print_backtrace () from libggml-base.so
#5  0x00007fd662ee3ebe in ggml_abort () from libggml-base.so
#6  0x00007fd423b1ea12 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /bank/llama.cpp/libggml-hip.so
#7  0x00007fd423b23af4 in ggml_backend_cuda_synchronize(ggml_backend*) () from /bank/llama.cpp/libggml-hip.so
#8  0x00007fd662efa4fe in ggml_backend_sched_synchronize () from libggml-base.so
#9  0x00007fd66300a270 in llama_context::synchronize() () from libllama.so
#10 0x000055c9a157c55a in common_init_from_params(common_params&) ()
#11 0x000055c9a147fc1c in main ()
[Inferior 1 (process 49643) detached]
[1]    49643 IOT instruction  ./llama-cli -m /bank/models/openai/gpt-oss-120b-UD-Q6_K_XL-00001-of-00002.gguf
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: ROCm illegal memory access with `-sm row` #16799

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: ROCm illegal memory access with -sm row #16799

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Eval bug: ROCm illegal memory access with `-sm row` #16799