Eval bug: Excessive stack usage during tool calling

### Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4840 (3ffbbd5c)
built with Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144) for x86_64-pc-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

i9-13900HX + NVIDIA GeForce RTX 4070



### Models

[bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M](https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/blob/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf)

### Problem description & steps to reproduce

cc/@ochafik

I am attempting to run BFCL on llama-server, and so far I have triggered a crash twice.  It does not appear to be deterministic, unfortunately.  In one instance, I was able to catch the crash with gdb.  Here is the end of the backtrace:

```
#87097 0x00005669dac2b7f9 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type, std::__detail::_RegexExecutorPolicy, bool) ()
#87098 0x00007116a7f3ac54 in llama_grammar_accept_impl(llama_grammar&, int) () from /home/ed/Projects/llama.cpp/build/bin/libllama.so
#87099 0x00005669dadb179a in common_sampler_accept(common_sampler*, int, bool) ()
#87100 0x00005669dac5c626 in server_context::update_slots() ()
#87101 0x00005669dabe4886 in server_queue::start_loop() ()
#87102 0x00005669dabb0bc8 in main ()
```
The remaining 87096 stack frames were identical.  So while I have not been able to find the exact input that triggered the crash yet, I hoped that this might be enough of a clue as to what is going on.

Here is some more information about what I am doing:
* `/home/ed/Projects/llama.cpp/build/bin/llama-server --ctx-size 0 --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --host 0.0.0.0 -ngl 100`
* `python /home/ed/Projects/gorilla/berkeley-function-call-leaderboard/venv/bin/bfcl generate --model gpt-4-turbo-2024-04-09-FC --test-category all --include-input-log`
* I added this patch:
``` diff
diff --git a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
index fbf7c0f..fc0da1f 100644
--- a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
+++ b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
@@ -22,7 +22,7 @@ class OpenAIHandler(BaseHandler):
     def __init__(self, model_name, temperature) -> None:
         super().__init__(model_name, temperature)
         self.model_style = ModelStyle.OpenAI
-        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url="http://localhost:8080")
 
     def decode_ast(self, result, language="Python"):
         if "FC" in self.model_name or self.is_fc_model:
```

### First Bad Commit

_No response_

### Relevant log output

```shell
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 48450 | processing task
slot update_slots: id  0 | task 48450 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id  0 | task 48450 | kv cache rm [67, end)
slot update_slots: id  0 | task 48450 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id  0 | task 48450 | prompt done, n_past = 326, n_tokens = 259
slot      release: id  0 | task 48450 | stop processing: n_past = 504, truncated = 0
slot print_timing: id  0 | task 48450 | 
prompt eval time =     104.08 ms /   259 tokens (    0.40 ms per token,  2488.52 tokens per second)
       eval time =    3465.17 ms /   179 tokens (   19.36 ms per token,    51.66 tokens per second)
      total time =    3569.24 ms /   438 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 48630 | processing task
slot update_slots: id  0 | task 48630 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id  0 | task 48630 | kv cache rm [67, end)
slot update_slots: id  0 | task 48630 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id  0 | task 48630 | prompt done, n_past = 326, n_tokens = 259
/home/ed/.local/share/dorothy/user/commands/llama-cpp-server: line 8: 709629 Segmentation fault      (core dumped) ~/Projects/llama.cpp/build/bin/llama-server --ctx-size $CTX_SIZE --jinja -fa -hf "$MODEL" --host 0.0.0.0 -ngl $OFFLOAD_NUM $OTHERARGS
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Excessive stack usage during tool calling #12234

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Excessive stack usage during tool calling #12234

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions