Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: Excessive stack usage during tool calling #12234

Open
edmcman opened this issue Mar 6, 2025 · 11 comments
Open

Eval bug: Excessive stack usage during tool calling #12234

edmcman opened this issue Mar 6, 2025 · 11 comments

Comments

@edmcman
Copy link

edmcman commented Mar 6, 2025

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4840 (3ffbbd5)
built with Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1exp120240731145000.144) for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

i9-13900HX + NVIDIA GeForce RTX 4070

Models

bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M

Problem description & steps to reproduce

cc/@ochafik

I am attempting to run BFCL on llama-server, and so far I have triggered a crash twice. It does not appear to be deterministic, unfortunately. In one instance, I was able to catch the crash with gdb. Here is the end of the backtrace:

#87097 0x00005669dac2b7f9 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type, std::__detail::_RegexExecutorPolicy, bool) ()
#87098 0x00007116a7f3ac54 in llama_grammar_accept_impl(llama_grammar&, int) () from /home/ed/Projects/llama.cpp/build/bin/libllama.so
#87099 0x00005669dadb179a in common_sampler_accept(common_sampler*, int, bool) ()
#87100 0x00005669dac5c626 in server_context::update_slots() ()
#87101 0x00005669dabe4886 in server_queue::start_loop() ()
#87102 0x00005669dabb0bc8 in main ()

The remaining 87096 stack frames were identical. So while I have not been able to find the exact input that triggered the crash yet, I hoped that this might be enough of a clue as to what is going on.

Here is some more information about what I am doing:

  • /home/ed/Projects/llama.cpp/build/bin/llama-server --ctx-size 0 --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --host 0.0.0.0 -ngl 100
  • python /home/ed/Projects/gorilla/berkeley-function-call-leaderboard/venv/bin/bfcl generate --model gpt-4-turbo-2024-04-09-FC --test-category all --include-input-log
  • I added this patch:
diff --git a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
index fbf7c0f..fc0da1f 100644
--- a/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
+++ b/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py
@@ -22,7 +22,7 @@ class OpenAIHandler(BaseHandler):
     def __init__(self, model_name, temperature) -> None:
         super().__init__(model_name, temperature)
         self.model_style = ModelStyle.OpenAI
-        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"), base_url="http://localhost:8080")
 
     def decode_ast(self, result, language="Python"):
         if "FC" in self.model_name or self.is_fc_model:

First Bad Commit

No response

Relevant log output

srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 48450 | processing task
slot update_slots: id  0 | task 48450 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id  0 | task 48450 | kv cache rm [67, end)
slot update_slots: id  0 | task 48450 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id  0 | task 48450 | prompt done, n_past = 326, n_tokens = 259
slot      release: id  0 | task 48450 | stop processing: n_past = 504, truncated = 0
slot print_timing: id  0 | task 48450 | 
prompt eval time =     104.08 ms /   259 tokens (    0.40 ms per token,  2488.52 tokens per second)
       eval time =    3465.17 ms /   179 tokens (   19.36 ms per token,    51.66 tokens per second)
      total time =    3569.24 ms /   438 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 48630 | processing task
slot update_slots: id  0 | task 48630 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 326
slot update_slots: id  0 | task 48630 | kv cache rm [67, end)
slot update_slots: id  0 | task 48630 | prompt processing progress, n_past = 326, n_tokens = 259, progress = 0.794479
slot update_slots: id  0 | task 48630 | prompt done, n_past = 326, n_tokens = 259
/home/ed/.local/share/dorothy/user/commands/llama-cpp-server: line 8: 709629 Segmentation fault      (core dumped) ~/Projects/llama.cpp/build/bin/llama-server --ctx-size $CTX_SIZE --jinja -fa -hf "$MODEL" --host 0.0.0.0 -ngl $OFFLOAD_NUM $OTHERARGS
@edmcman
Copy link
Author

edmcman commented Mar 6, 2025

This time around it was the java_47 test that failed. I think the other crashes were also related to java.

I don't think there is a way to run a specific test though in BFCL, but we can do --test-category java at least. I'm going to try something like llama-server --verbose 2>&1 | tail -n1000 to see if I can pick up anything helpful before it crashes.

@edmcman
Copy link
Author

edmcman commented Mar 6, 2025

java_47 seems to be a consistent problem. In a new run, it hasn't crashed yet, but it has been performing inference on it for about five minutes now...

Here is the "question":

{"id": "java_47", "question": [[{"role": "user", "content": "Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name 'CERTIFICATE' and the value being a 1024-character long Base64 string with 'MIIFdTCCBF2gAwIBAgISESG'?"}]], "function": [{"name": "LargeHandshakeTest.format", "description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters.", "parameters": {"type": "dict", "properties": {"name": {"type": "String", "description": "The name of the Java constant."}, "value": {"type": "String", "description": "The value of the Java constant, which will be split into multiple lines if it's too long."}}, "required": ["name", "value"]}}]}

and an answer:

{"id": "java_47", "ground_truth": [{"LargeHandshakeTest.format": {"name": ["CERTIFICATE"], "value": ["MIIFdTCCBF2gAwIBAgISESG"]}}]}

I'm not sure why that would be causing an issue.

@edmcman
Copy link
Author

edmcman commented Mar 6, 2025

Adding -n -2 to the llama-server args avoids the problem but all of the results -- not just java_47 -- become <tool_call> 😓 Yup, that's it. No content or closing tag. Not sure what's going on there either. Maybe a separate issue?

On the bright side, I did get the query since the -n -2 query "succeeded":

{
  "id": "java_47",
  "result": "<tool_call>",
  "inference_log": [
    {
      "role": "inference_input",
      "content": {
        "message": "[{'role': 'user', 'content': \"Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name 'CERTIFICATE' and the value being a 1024-character long Base64 string with 'MIIFdTCCBF2gAwIBAgISESG'?\"}]",
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "LargeHandshakeTest_format",
              "description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters. Note that the provided function is in Java 8 SDK syntax.",
              "parameters": {
                "type": "object",
                "properties": {
                  "name": {
                    "type": "string",
                    "description": "The name of the Java constant. This is Java String type parameter in string representation."
                  },
                  "value": {
                    "type": "string",
                    "description": "The value of the Java constant, which will be split into multiple lines if it's too long. This is Java String type parameter in string representation."
                  }
                },
                "required": [
                  "name",
                  "value"
                ]
              }
            }
          }
        ]
      }
    }
  ],
  "input_token_count": 326,
  "output_token_count": 1,
  "latency": 0.11321735382080078
}

I'll try to convert this into a curl-based test.

@edmcman
Copy link
Author

edmcman commented Mar 6, 2025

curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user", 
      "content": "Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name '\''CERTIFICATE'\'' and the value being a 1024-character long Base64 string with '\''MIIFdTCCBF2gAwIBAgISESG'\''"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "LargeHandshakeTest_format",
        "description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters. Note that the provided function is in Java 8 SDK syntax.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": {
              "type": "string",
              "description": "The name of the Java constant. This is Java String type parameter in string representation."
            },
            "value": {
              "type": "string",
              "description": "The value of the Java constant, which will be split into multiple lines if it'\''s too long. This is Java String type parameter in string representation."
            }
          },
          "required": [
            "name",
            "value"
          ]
        }
      }
    }
  ]
}'

This reliably seems to trigger the issue. I also got the tail of the --verbose log: verbose.log

@edmcman
Copy link
Author

edmcman commented Mar 6, 2025

Here's the full log: log.zip

bzcat /tmp/log.bz2 | fgrep 'next token' | awk '{print $18}' | uniq -c
      1 '<tool_call>'
      1 '
      1 '{"'
      1 'name'
      1 '":'
      1 '
      1 'Large'
      1 'Hand'
      1 'shake'
      1 'Test'
      1 '_format'
      1 '",'
      1 '
      1 'arguments'
      1 '":'
      1 '
      1 'name'
      1 '":'
      1 '
      1 'CERT'
      1 'IFICATE'
      1 '",'
      1 '
      1 'value'
      1 '":'
      1 '
      1 'MI'
      1 'IF'
      1 'dT'
      1 'CC'
      1 'BF'
      1 '2'
      1 'g'
      1 'Aw'
      1 'IB'
      1 'Ag'
      1 'ISE'
      1 'SG'
   5427 'XXXXXXXX'

So the model just outputs a bunch of jibberish.

@ochafik
Copy link
Collaborator

ochafik commented Mar 6, 2025

So the model just outputs a bunch of jibberish.

@edmcman adding a --repeat-penalty 2.0 prevents the model from entering that infinite loop (no clue what's a good penalty tbh but maybe that model needs it to be more reasonable).

@ochafik
Copy link
Collaborator

ochafik commented Mar 6, 2025

@edmcman Alternatively, its cousin bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M behaves well on that example w/o a penalty.

Btw, I've also been trying to run the benchmark, I may have written more code than needed haha.

@edmcman
Copy link
Author

edmcman commented Mar 6, 2025

@edmcman adding a --repeat-penalty 2.0 prevents the model from entering that infinite loop (no clue what's a good penalty tbh but maybe that model needs it to be more reasonable).

Nice, I was just starting to play with that before ending my work day, but I went in the wrong direction (0.9).

Btw, I've also been trying to run the benchmark, I may have written more code than needed haha.

Wow, you went all out! Good for you! I felt a little guilty with my one-line hack :) I was a little surprised they didn't already have an option to use an existing openai server but pass the tools as tools.

@edmcman
Copy link
Author

edmcman commented Mar 7, 2025

Btw, I found that this paper recommends a repetition penalty of 1.2.

@ggerganov
Copy link
Member

ggerganov commented Mar 11, 2025

@ochafik I noticed the discussion about the repetition penalty. Without knowing much details about the use case, I just tested the curl command from #12234 (comment) and with greedy-sampling (i.e. "samplers": ["top_k"], "top_k": 1) I get the following output:

    "content": "<tool_call>\n{\"name\": \"LargeHandshakeTest_format\", \"arguments\": {\"name\": \"CERTIFICATE\", \"value\": \"MIIFdTCCBF2gAwIBAgISESGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",

This made me think that for some reason, the model does not want to sample the closing quotes of the "value". I then realized that the request wants to generate a "1024-character long" value. I don't really understand what this means, but I suspect that this makes the model try to generate a 1024-character long string in the "value" and that's why it keep repeating XXXX... forever. So I tried to simply reword the request like this (i.e. remove the text about "1024-character long" string):

#!/bin/bash

curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
  "model": "gpt-4",
  "temperature": 0.0,
  "n_predict": 48,
  "messages": [
    {
      "role": "user",
      "content": "Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name '\''CERTIFICATE'\'' and the value being a Base64 string with '\''MIIFdTCCBF2gAwIBAgISESG'\''"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "LargeHandshakeTest_format",
        "description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters. Note that the provided function is in Java 8 SDK syntax.",
        "parameters": {
          "type": "object",
          "properties": {
            "name": {
              "type": "string",
              "description": "The name of the Java constant. This is Java String type parameter in string representation."
            },
            "value": {
              "type": "string",
              "description": "The value of the Java constant, which will be split into multiple lines if it'\''s too long. This is Java String type parameter in string representation."
            }
          },
          "required": [
            "name",
            "value"
          ]
        }
      }
    }
  ]
}'

This seems to work correctly, producing:

    "content": "<tool_call>\n{\"name\": \"LargeHandshakeTest_format\", \"arguments\": {\"name\": \"CERTIFICATE\", \"value\": \"MIIFdTCCBF2gAwIBAgISESG\"}}\n</tool_call>",

Note that this does not require a repetition penalty.

So in summary, I strongly believe that the best sampling settings for any model is simple greedy sampling. This is especially true for constrained generations like in this case. Repetition penalties should always be avoided and needing them always proves to be due to some underlying problem that should be solved instead of adding a repetition penalty.

Whenever you encounter some use case where it looks like that greedy sampling is not optimal, please let me know and I will try to show it's not the case. Hope this helps!

@edmcman
Copy link
Author

edmcman commented Mar 12, 2025

@ggerganov I have a (perhaps silly) question: Why isn't simple greedy sampling the default for llama-server?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants