-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: Excessive stack usage during tool calling #12234
Comments
This time around it was the I don't think there is a way to run a specific test though in BFCL, but we can do |
Here is the "question": {"id": "java_47", "question": [[{"role": "user", "content": "Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name 'CERTIFICATE' and the value being a 1024-character long Base64 string with 'MIIFdTCCBF2gAwIBAgISESG'?"}]], "function": [{"name": "LargeHandshakeTest.format", "description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters.", "parameters": {"type": "dict", "properties": {"name": {"type": "String", "description": "The name of the Java constant."}, "value": {"type": "String", "description": "The value of the Java constant, which will be split into multiple lines if it's too long."}}, "required": ["name", "value"]}}]} and an answer: {"id": "java_47", "ground_truth": [{"LargeHandshakeTest.format": {"name": ["CERTIFICATE"], "value": ["MIIFdTCCBF2gAwIBAgISESG"]}}]} I'm not sure why that would be causing an issue. |
Adding On the bright side, I did get the query since the {
"id": "java_47",
"result": "<tool_call>",
"inference_log": [
{
"role": "inference_input",
"content": {
"message": "[{'role': 'user', 'content': \"Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name 'CERTIFICATE' and the value being a 1024-character long Base64 string with 'MIIFdTCCBF2gAwIBAgISESG'?\"}]",
"tools": [
{
"type": "function",
"function": {
"name": "LargeHandshakeTest_format",
"description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters. Note that the provided function is in Java 8 SDK syntax.",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the Java constant. This is Java String type parameter in string representation."
},
"value": {
"type": "string",
"description": "The value of the Java constant, which will be split into multiple lines if it's too long. This is Java String type parameter in string representation."
}
},
"required": [
"name",
"value"
]
}
}
}
]
}
}
],
"input_token_count": 326,
"output_token_count": 1,
"latency": 0.11321735382080078
} I'll try to convert this into a curl-based test. |
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name '\''CERTIFICATE'\'' and the value being a 1024-character long Base64 string with '\''MIIFdTCCBF2gAwIBAgISESG'\''"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "LargeHandshakeTest_format",
"description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters. Note that the provided function is in Java 8 SDK syntax.",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the Java constant. This is Java String type parameter in string representation."
},
"value": {
"type": "string",
"description": "The value of the Java constant, which will be split into multiple lines if it'\''s too long. This is Java String type parameter in string representation."
}
},
"required": [
"name",
"value"
]
}
}
}
]
}' This reliably seems to trigger the issue. I also got the tail of the |
Here's the full log: log.zip
So the model just outputs a bunch of jibberish. |
@edmcman adding a |
@edmcman Alternatively, its cousin Btw, I've also been trying to run the benchmark, I may have written more code than needed haha. |
Nice, I was just starting to play with that before ending my work day, but I went in the wrong direction (0.9).
Wow, you went all out! Good for you! I felt a little guilty with my one-line hack :) I was a little surprised they didn't already have an option to use an existing openai server but pass the tools as tools. |
Btw, I found that this paper recommends a repetition penalty of 1.2. |
@ochafik I noticed the discussion about the repetition penalty. Without knowing much details about the use case, I just tested the "content": "<tool_call>\n{\"name\": \"LargeHandshakeTest_format\", \"arguments\": {\"name\": \"CERTIFICATE\", \"value\": \"MIIFdTCCBF2gAwIBAgISESGXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX", This made me think that for some reason, the model does not want to sample the closing quotes of the #!/bin/bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "gpt-4",
"temperature": 0.0,
"n_predict": 48,
"messages": [
{
"role": "user",
"content": "Help me output a formatted Java constant declaration for a large Base64 encoded string representing a certificate, with the constant name '\''CERTIFICATE'\'' and the value being a Base64 string with '\''MIIFdTCCBF2gAwIBAgISESG'\''"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "LargeHandshakeTest_format",
"description": "Outputs a formatted Java constant declaration for a given name and value, splitting the value into multiple lines if it exceeds 60 characters. Note that the provided function is in Java 8 SDK syntax.",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The name of the Java constant. This is Java String type parameter in string representation."
},
"value": {
"type": "string",
"description": "The value of the Java constant, which will be split into multiple lines if it'\''s too long. This is Java String type parameter in string representation."
}
},
"required": [
"name",
"value"
]
}
}
}
]
}' This seems to work correctly, producing: "content": "<tool_call>\n{\"name\": \"LargeHandshakeTest_format\", \"arguments\": {\"name\": \"CERTIFICATE\", \"value\": \"MIIFdTCCBF2gAwIBAgISESG\"}}\n</tool_call>", Note that this does not require a repetition penalty. So in summary, I strongly believe that the best sampling settings for any model is simple greedy sampling. This is especially true for constrained generations like in this case. Repetition penalties should always be avoided and needing them always proves to be due to some underlying problem that should be solved instead of adding a repetition penalty. Whenever you encounter some use case where it looks like that greedy sampling is not optimal, please let me know and I will try to show it's not the case. Hope this helps! |
@ggerganov I have a (perhaps silly) question: Why isn't simple greedy sampling the default for |
Name and Version
./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
version: 4840 (3ffbbd5)
built with Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1
exp120240731145000.144) for x86_64-pc-linux-gnuOperating systems
Linux
GGML backends
CUDA
Hardware
i9-13900HX + NVIDIA GeForce RTX 4070
Models
bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
Problem description & steps to reproduce
cc/@ochafik
I am attempting to run BFCL on llama-server, and so far I have triggered a crash twice. It does not appear to be deterministic, unfortunately. In one instance, I was able to catch the crash with gdb. Here is the end of the backtrace:
The remaining 87096 stack frames were identical. So while I have not been able to find the exact input that triggered the crash yet, I hoped that this might be enough of a clue as to what is going on.
Here is some more information about what I am doing:
/home/ed/Projects/llama.cpp/build/bin/llama-server --ctx-size 0 --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --host 0.0.0.0 -ngl 100
python /home/ed/Projects/gorilla/berkeley-function-call-leaderboard/venv/bin/bfcl generate --model gpt-4-turbo-2024-04-09-FC --test-category all --include-input-log
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: