Skip to content

Qwen3-Next --ubatch-size issue #17578

@IIIIIllllIIIIIlllll

Description

@IIIIIllllIIIIIlllll

I find another problem.:
When I adjust the size of the parameter -ub (--ubatch-size) to more than 512, I get the memory error:

ggml_new_object: not enough space in the context's memory pool (needed 10711552, available 10711184)

Hardware and Software Environment

  • GPU: AMD Radeon Graphics
  • GPU Architecture: gfx1151 (Device ID: 0x1151)
  • Driver/Stack: ROCm (reported as NO_VMM = 1)
  • CPU: x86_64 with AVX512 support
  • OS: Linux (Ubuntu)
  • llama.cpp Build Info: Built with gcc 15.2.0, ROCm backend enabled.

mark@MarkPC:~/llama.cpp/llama.cpp-master$ ./llama-server -m /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-Q8_0/Qwen3-Next-80B-A3B-Instruct-Q8_0.gguf -fa 1 -c 65536 --host 0.0.0.0 --port 8090 -ub 4096 --no-mmap
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 0 (unknown) with cc (Ubuntu 15.2.0-4ubuntu4) 15.2.0 for x86_64-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-Q8_0/Qwen3-Next-80B-A3B-Instruct-Q8_0.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) (0000:c6:00.0) - 121499 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 807 tensors from /home/mark/Models/Q8/Qwen3-Next-80B-A3B-Instruct-Q8_0/Qwen3-Next-80B-A3B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3next
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.800000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.700000
llama_model_loader: - kv 5: general.name str = Qwen3 Next A3B Instruct
llama_model_loader: - kv 6: general.finetune str = Instruct
llama_model_loader: - kv 7: general.basename str = Qwen3-Next
llama_model_loader: - kv 8: general.size_label str = A3B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3-Nex...
llama_model_loader: - kv 11: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 12: qwen3next.block_count u32 = 48
llama_model_loader: - kv 13: qwen3next.context_length u32 = 262144
llama_model_loader: - kv 14: qwen3next.embedding_length u32 = 2048
llama_model_loader: - kv 15: qwen3next.feed_forward_length u32 = 5120
llama_model_loader: - kv 16: qwen3next.attention.head_count u32 = 16
llama_model_loader: - kv 17: qwen3next.attention.head_count_kv u32 = 2
llama_model_loader: - kv 18: qwen3next.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 19: qwen3next.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 20: qwen3next.expert_used_count u32 = 10
llama_model_loader: - kv 21: qwen3next.attention.key_length u32 = 256
llama_model_loader: - kv 22: qwen3next.attention.value_length u32 = 256
llama_model_loader: - kv 23: general.file_type u32 = 7
llama_model_loader: - kv 24: qwen3next.expert_count u32 = 512
llama_model_loader: - kv 25: qwen3next.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 26: qwen3next.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 27: qwen3next.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 28: qwen3next.ssm.state_size u32 = 128
llama_model_loader: - kv 29: qwen3next.ssm.group_count u32 = 16
llama_model_loader: - kv 30: qwen3next.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 31: qwen3next.ssm.inner_size u32 = 4096
llama_model_loader: - kv 32: qwen3next.rope.dimension_count u32 = 64
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 41: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 43: tokenizer.chat_template str = {%- if tools %}\n {{-'<|im_start|>...
llama_model_loader: - type f32: 313 tensors
llama_model_loader: - type q8_0: 494 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 78.98 GiB (8.52 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3next
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5120
print_info: n_expert = 512
print_info: n_expert_used = 10
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 4096
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 32
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = ?B
print_info: model params = 79.67 B
print_info: general.name = Qwen3 Next A3B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU model buffer size = 0.00 MiB
load_tensors: ROCm0 model buffer size = 80561.98 MiB
load_tensors: ROCm_Host model buffer size = 315.30 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 65536
llama_context: n_ctx_seq = 65536
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (65536) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 2.32 MiB
llama_kv_cache: ROCm0 KV buffer size = 1536.00 MiB
llama_kv_cache: size = 1536.00 MiB ( 65536 cells, 12 layers, 4/1 seqs), K (f16): 768.00 MiB, V (f16): 768.00 MiB
llama_memory_recurrent: ROCm0 RS buffer size = 301.50 MiB
llama_memory_recurrent: size = 301.50 MiB ( 4 cells, 48 layers, 4 seqs), R (f32): 13.50 MiB, S (f32): 288.00 MiB
ggml_new_object: not enough space in the context's memory pool (needed 10711552, available 10711184)
/home/mark/llama.cpp/compile/llama.cpp-master/ggml/src/ggml.c:1679: GGML_ASSERT(obj_new) failed
[New LWP 105107]
[New LWP 105104]
[New LWP 105103]
[New LWP 105102]
[New LWP 105101]
[New LWP 105100]
[New LWP 105099]
[New LWP 105098]
[New LWP 105097]
[New LWP 105096]
[New LWP 105095]
[New LWP 105094]
[New LWP 105093]
[New LWP 105092]
[New LWP 105091]
[New LWP 105090]
[New LWP 105089]
[New LWP 105088]
[New LWP 105087]
[New LWP 105086]
[New LWP 105085]
[New LWP 105084]
[New LWP 105083]
[New LWP 105082]
[New LWP 105081]
[New LWP 105080]
[New LWP 105079]
[New LWP 105078]
[New LWP 105077]
[New LWP 105076]
[New LWP 105075]
[New LWP 105074]
[New LWP 105073]
[New LWP 105072]
[New LWP 105069]

This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.ubuntu.com
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: 没有那个文件或目录
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x000071d8e32a013c in __internal_syscall_cancel (a1=, a2=, a3=, a4=, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: 没有那个文件或目录
#2 __syscall_cancel (a1=, a2=, a3=, a4=, a5=a5@entry=0,a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x000071d8e331c98f in __GI___wait4 (pid=, stat_loc=, options=, usage=) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: 没有那个文件或目录
#4 0x000071d8e3cca9d3 in ggml_print_backtrace () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libggml-base.so.0
#5 0x000071d8e3ccab86 in ggml_abort () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libggml-base.so.0
#6 0x000071d8e3ccbae1 in ggml_new_tensor_impl.constprop () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libggml-base.so.0
#7 0x000071d8e3cd1b60 in ggml_view_4d () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libggml-base.so.0
#8 0x000071d8e3be145e in llm_build_qwen3next::build_delta_net_chunking(ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, int) () from /home/mark/llama.cpp/compile/llama.cp p-master/build/bin/libllama.so.0
#9 0x000071d8e3be4130 in llm_build_qwen3next::build_layer_attn_linear(llm_graph_input_rs*, ggml_tensor*, ggml_tensor*, ggml_tensor*, int) () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libllama.so.0
#10 0x000071d8e3be467e in llm_build_qwen3next::llm_build_qwen3next(llama_model const&, llm_graph_params const&) () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libllama.so.0
#11 0x000071d8e3b18459 in llama_model::build_graph(llm_graph_params const&) const () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libllama.so.0
#12 0x000071d8e3aa5478 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool) () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libllama.so.0
#13 0x000071d8e3aa83e9 in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libllama.so.0
#14 0x000071d8e3aa8ef4 in llama_init_from_model () from /home/mark/llama.cpp/compile/llama.cpp-master/build/bin/libllama.so.0
#15 0x00005ba47b54e43a in common_init_from_params(common_params&) ()
#16 0x00005ba47b3dadd2 in server_context::load_model(common_params const&) ()
#17 0x00005ba47b3b77e2 in main ()
[Inferior 1 (process 105067) detached]
已中止 (核心已转储)

Originally posted by @2432896620-ctrl in #16095 (comment)

Metadata

Metadata

Assignees

Labels

CUDARelated to the CUDA backendbugSomething isn't workingmodelModel specificneed feedbackTesting and feedback with results are neededperformanceSpeed related topics

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions