fix: prevent segfault in tokenizer on highly repetitive input #17786

ServeurpersoCom · 2025-12-05T09:54:54Z

Add nosubs|optimize flags to std::regex constructors to prevent catastrophic backtracking when processing prompts with repeated identical characters (e.g., 'A' * 10000).

The nosubs flag disables subgroup capture, significantly reducing memory usage and backtracking on uniform token sequences

Make sure to read the contributing guidelines before submitting a PR

Before :

/root/llama.cpp.pascal/build/bin/llama-server --port 8088 -m /var/www/ia/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf

You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant'
main: model loaded
main: server is listening on http://127.0.0.1:8088
main: starting the main loop...
srv  update_slots: all slots are idle
Erreur de segmentation

(segfault)

After :

(root|~/llama.cpp.pascal) curl -X POST http://localhost:8088/v1/chat/completions   -H "Content-Type: application/json"   -d '{"messages":[{"role":"user","content":"'"$(python3 -c "print('A'*10000)")"' Say OK"}]}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"The user typed a long string of \"A\" and then says \"Say OK\". So they likely want the assistant to respond with \"OK\". The instruction: \"Say OK\". So we just reply \"OK\". But also must obey system instruction: We should not mention policy. Just reply \"OK\".","content":"OK"}}],"created":1764928092,"model":"gpt-oss-20b-MXFP4.gguf","system_fingerprint":"b7321-147310d71","object":"chat.completion","usage":{"completion_tokens":72,"prompt_tokens":1319,"total_tokens":1391},"id":"chatcmpl-i1GMuuGvb2X3irH73aTbLP0ZkwBYbJdf","timings":{"cache_n":0,"prompt_n":1319,"prompt_ms":171.43,"prompt_per_token_ms":0.12996967399545112,"prompt_per_second":7694.102549145423,"predicted_n":72,"predicted_ms":199.156,"predicted_per_token_ms":2.7660555555555555,"predicted_per_second":361.52563819317515}}

Close #17636

Add nosubs|optimize flags to std::regex constructors to prevent catastrophic backtracking when processing prompts with repeated identical characters (e.g., 'A' * 10000). The nosubs flag disables subgroup capture, significantly reducing memory usage and backtracking on uniform token sequences

aviallon · 2025-12-05T11:09:08Z

Ah, so this is the bug I was hitting.

…rg#17786) Add nosubs|optimize flags to std::regex constructors to prevent catastrophic backtracking when processing prompts with repeated identical characters (e.g., 'A' * 10000). The nosubs flag disables subgroup capture, significantly reducing memory usage and backtracking on uniform token sequences

ServeurpersoCom requested a review from ggerganov as a code owner December 5, 2025 09:54

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17786: fix: prevent segfault in tokenizer on highly repetitive input auroralabs-loci/llama.cpp#452

Open

ggerganov approved these changes Dec 5, 2025

View reviewed changes

ggerganov merged commit 1be9783 into ggml-org:master Dec 5, 2025
70 of 78 checks passed

This was referenced Dec 6, 2025

Regression: PR #17786 breaks model loading on Windows/MSVC for models with complex tokenizer regex #17830

Open

fix: disable regex nosubs|optimize flags on MSVC #17831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent segfault in tokenizer on highly repetitive input #17786

fix: prevent segfault in tokenizer on highly repetitive input #17786

Uh oh!

ServeurpersoCom commented Dec 5, 2025

Uh oh!

aviallon commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: prevent segfault in tokenizer on highly repetitive input #17786

fix: prevent segfault in tokenizer on highly repetitive input #17786

Uh oh!

Conversation

ServeurpersoCom commented Dec 5, 2025

Uh oh!

aviallon commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants