ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched by slaren · Pull Request #17276 · ggml-org/llama.cpp

slaren · 2025-11-14T20:39:20Z

No description provided.

ggerganov · 2025-11-15T09:53:24Z

I tried this change after reverting #17143 but it doesn't trigger an error using the llama-batched-bench command there. I do see it going through this branch:

llama.cpp/ggml/src/ggml-alloc.c

Lines 1052 to 1055 in 6d90fe9

    
                       GGML_LOG_DEBUG("%s: cannot reallocate multi buffer graph automatically, call reserve\n", __func__); 
        
           #endif 
        
                       return false; 
        
                   }

slaren · 2025-11-15T11:50:07Z

I was trying to reproduce this, but I get this assert when running llama-batched-bench (with the current version, without reverting #17143):

llama_kv_cache:      Metal KV buffer size =  5310.00 MiB
llama_kv_cache: size = 5310.00 MiB (151040 cells,  36 layers, 16/1 seqs), K (f16): 2655.00 MiB, V (f16): 2655.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 3480
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   450.51 MiB
llama_context:        CPU compute buffer size =   299.01 MiB
llama_context: graph nodes  = 1231
llama_context: graph splits = 2
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x141a07b00 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mm_q8_0_f32', name = 'kernel_mul_mm_q8_0_f32_bci=0_bco=1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mm_q8_0_f32_bci=0_bco=1            0x141a08600 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x141a088c0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32_imrope=0'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32_imrope=0                 0x141a08b80 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x141a09340 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x141a09600 | th_max = 1024 | th_width =   32
Assertion failed: (ggml_metal_op_flash_attn_ext_extra_pad(op) == 0), function ggml_metal_op_flash_attn_ext, file ggml-metal-ops.cpp, line 2367.

ggerganov · 2025-11-15T16:29:17Z

I think these asserts can be safely removed. Will take a look tomorrow.

ggerganov · 2025-11-16T07:53:50Z

The asserts are now removed on master.

slaren · 2025-11-24T11:07:22Z

I have verified that the Vulkan issue is indeed due to different graph orders depending on batch size. The code causing this seems to be this:

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp

Lines 12954 to 12967 in fdbff91

    
           // This function tries to reorder the graph to allow nodes to run in parallel. 
        
           // This helps with small batches, but for large batches its a slowdown, probably 
        
           // due to cache contention. So only reorder if the majority of nodes have few rows. 
        
           int num_small_nodes = 0; 
        
           int num_counted_nodes = 0; 
        
           for (int i = 0; i < graph->n_nodes; ++i) { 
        
               if (!is_empty(graph->nodes[i]) && 
        
                   graph->nodes[i]->op != GGML_OP_SET_ROWS) { 
        
                   if (ggml_nrows(graph->nodes[i]) <= 8) { 
        
                       num_small_nodes++; 
        
                   } 
        
                   num_counted_nodes++; 
        
               } 
        
           }

I think this should be fixed in the Vulkan backend so that changes in tensor sizes do no change the order of the graph. Meanwhile, we could run the tests with GGML_VK_DISABLE_GRAPH_OPTIMIZE.

…gml_backend_sched Enabled in ggml-ci for testing.

…gml_backend_sched (ggml-org#17276) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

slaren requested a review from ggerganov as a code owner November 14, 2025 20:39

slaren marked this pull request as draft November 14, 2025 20:40

slaren force-pushed the sl/realloc-error branch from 3df2f6d to 6d90fe9 Compare November 14, 2025 21:04

github-actions bot added testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Nov 14, 2025

DajanaV mentioned this pull request Nov 14, 2025

UPSTREAM PR #17276: ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc auroralabs-loci/llama.cpp#215

Open

slaren marked this pull request as ready for review November 14, 2025 22:04

ggerganov mentioned this pull request Nov 16, 2025

metal : remove obosolete asserts #17295

Merged

slaren force-pushed the sl/realloc-error branch from 6d90fe9 to 0710d5f Compare November 17, 2025 20:29

slaren changed the title ~~ggml : add GGML_NO_REALLOC option to disable reallocations in ggml-alloc~~ ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Nov 17, 2025

ggerganov mentioned this pull request Nov 19, 2025

llama : update worst-case graph for unified cache #17379

Merged

github-actions bot added the examples label Nov 24, 2025

jeffbolznv mentioned this pull request Nov 24, 2025

vulkan: allow graph_optimize for prompt processing workloads #17475

Merged

loci-dev mentioned this pull request Nov 24, 2025

UPSTREAM PR #17475: vulkan: allow graph_optimize for prompt processing workloads auroralabs-loci/llama.cpp#308

Open

slaren and others added 4 commits November 27, 2025 15:00

ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in g…

99e0d87

…gml_backend_sched Enabled in ggml-ci for testing.

llama : update worst-case graph for unified cache

528d416

ci : disable op offload in some tests

741b69e

fix spelling

5a9485c

ggerganov force-pushed the sl/realloc-error branch from fdbff91 to 5a9485c Compare November 27, 2025 13:00

ggerganov approved these changes Nov 28, 2025

View reviewed changes

ggerganov merged commit e072b20 into master Nov 28, 2025
76 of 78 checks passed

ggerganov deleted the sl/realloc-error branch November 28, 2025 15:33

ggerganov mentioned this pull request Nov 30, 2025

ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler #17617

Merged

pwilkin mentioned this pull request Dec 20, 2025

ggml-blas: refactor BLAS backend #18027

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched#17276

ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched#17276
ggerganov merged 4 commits intomasterfrom
sl/realloc-error

slaren commented Nov 14, 2025

Uh oh!

ggerganov commented Nov 15, 2025

Uh oh!

slaren commented Nov 15, 2025

Uh oh!

ggerganov commented Nov 15, 2025

Uh oh!

ggerganov commented Nov 16, 2025

Uh oh!

slaren commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

slaren commented Nov 14, 2025

Uh oh!

ggerganov commented Nov 15, 2025

Uh oh!

slaren commented Nov 15, 2025

Uh oh!

ggerganov commented Nov 15, 2025

Uh oh!

ggerganov commented Nov 16, 2025

Uh oh!

slaren commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants