-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1/chat/completions endpoint does not honor cache_prompt
#4329
Comments
Can you verify that the issue is resolved with: #4347 |
Thanks for the response! Unfortunately, it's still not honoring the cache. I pulled, did a make clean and a make. Here are server log outputs for two requests. The first began with 707 tokens and a request to generate 50 more. It took 7 seconds for prompt eval and 4 seconds for prediction. The second request sent the catenation of the previous prompt and the generated text with no alterations. It took 7.4 seconds for prompt eval and 4 for prediction.
The server command line was:
and the first part of the startup log is:
I've verified that the JSON I'm submitting contains |
I'm testing with the following curl -s http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{
"model": "gpt-3.5-turbo", "cache_prompt": true,
"messages": [
{
"role": "system",
"content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}' | jq If possible, please provide |
I ran your curl command twice in succession and got the following log output:
The only difference between my command and yours was that I changed the host IP from 127.0.0.1 to 192.168.86.31. (I'm running the server on my M1 mac mini and sending requests from my laptop.) I then restarted the server to serve on 127.0.0.1 and ran the command from a terminal window on the mac mini. Same outcome. What other tests or output can I run to help sort this out? |
Just noticed this in your log:
It means you are still on So your server does not have the change that I implemented in #4347 The commit you should be testing on is: ef455cb |
That did it! Sorry for not noticing your fix was on a different branch. See below the log output when I send a sizeable prompt via my ficta application. The first request has to process all 809 prompt tokens (8 seconds) but the second evaluates only 73 tokens (1.1 seconds). Huge improvement. Thanks ever so much. Just so I'm clear on how caching works,
|
|
Expected Behavior
This is a follow-on to issue #4287. The README for
server
says that generation tags, e.g. mirostat_tau, are honored by the OAI compatible v1/chat/completions interface. That doesn't seem to be happening with eithercache_prompt
orslot_id
.Current Behavior
The cache is not used by by follow-on requests that include previous prompts and generated text. Instead,
server
re-evaluates the entire prompt. See the verbose log output appended below.Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Physical (or virtual) hardware you are using, e.g. for Linux:
Model Name: Mac mini
Model Identifier: Macmini9,1
Model Number: Z12P000KGLL/A
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB
System Firmware Version: 10151.41.12
OS Loader Version: 10151.41.12
Operating System, MacOS Sonoma
Failure Information (for bugs)
To reproduce:
e.g.
./server -m $MODELS/zephyr-7b-beta.Q4_K_M.gguf -c 8192 -t 8 --host 192.168.86.31 -v >> /tmp/llog2.txt
v1/chats/completions endpoint
with"cache_prompt": true
and"slot_id": 1
as JSON fields in the request.Failure Logs
The text was updated successfully, but these errors were encountered: