Skip to content

Conversation

@alundb
Copy link
Contributor

@alundb alundb commented Nov 1, 2025

In #9386, llama_sampler_sample was updated to include the call to llama_sampler_accept. This change made the sampling API sample usage misleading, as the documentation still referenced the subsequent llama_sampler_accept-call.

This PR removes the documentation reference to llama_sampler_accept, clarifying that this method should not be invoked as part of the decoding loop, as it is already called internally as part of llama_sampler_sample (since #9386).

commit 5fb5e24 (llama : minor
sampling refactor (2) (ggml-org#9386)) moved the llama_sampler_accept call
into llama_sampler_sample, but the sampling sample usage in llama.h
was forgotten to be updated accordingly.
@alundb
Copy link
Contributor Author

alundb commented Nov 1, 2025

Fyi, I noticed this the hard way by looking at llama.h as the main reference when toying around with some simple bnf grammar sampling/decoding, and due to having added this extra llama_sampler_accept(chain, id);, I kept running into:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Unexpected empty grammar stack after accepting piece: J
Aborted (core dumped)

I hope this PR helps save someone else from making the same mistake in the future.

@ggerganov ggerganov merged commit 76af40a into ggml-org:master Nov 2, 2025
1 check passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Nov 3, 2025
* origin/master: (169 commits)
opencl: support imrope (ggml-org#16914)
fix: Viewing multiple PDF attachments (ggml-org#16974)
model-conversion : pass config to from_pretrained (ggml-org#16963)
server : add props.model_alias (ggml-org#16943)
ggml: CUDA: add head size 72 for flash-attn (ggml-org#16962)
mtmd: add --image-min/max-tokens (ggml-org#16921)
mtmd: pad mask for qwen2.5vl (ggml-org#16954)
ggml : LoongArch fixes (ggml-org#16958)
sync: minja (glm 4.6 & minmax m2 templates) (ggml-org#16949)
SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× faster)Feature/sycl repeat back opt (ggml-org#16869)
feat(webui): improve LaTeX rendering with currency detection (ggml-org#16508)
test-backend-ops : fix segfault in moe-expert-reduce test in support mode and coverage (ggml-org#16936)
ci : disable failing riscv cross build (ggml-org#16952)
model: add Janus Pro for image understanding (ggml-org#16906)
clip : use FA (ggml-org#16837)
server : support unified cache across slots (ggml-org#16736)
common : move gpt-oss reasoning processing to init params (ggml-org#16937)
docs: remove llama_sampler_accept reference in sampling sample usage (ggml-org#16920)
CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (ggml-org#16917)
devops: fix failing s390x docker build (ggml-org#16918)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants