Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1/chat/completions endpoint does not honor cache_prompt #4329

Closed
Michael-F-Ellis opened this issue Dec 4, 2023 · 7 comments · Fixed by #4347
Closed

v1/chat/completions endpoint does not honor cache_prompt #4329

Michael-F-Ellis opened this issue Dec 4, 2023 · 7 comments · Fixed by #4347

Comments

@Michael-F-Ellis
Copy link

Michael-F-Ellis commented Dec 4, 2023

Expected Behavior

This is a follow-on to issue #4287. The README for server says that generation tags, e.g. mirostat_tau, are honored by the OAI compatible v1/chat/completions interface. That doesn't seem to be happening with either cache_prompt or slot_id.

Current Behavior

The cache is not used by by follow-on requests that include previous prompts and generated text. Instead, server re-evaluates the entire prompt. See the verbose log output appended below.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:
    Model Name: Mac mini
    Model Identifier: Macmini9,1
    Model Number: Z12P000KGLL/A
    Chip: Apple M1
    Total Number of Cores: 8 (4 performance and 4 efficiency)
    Memory: 16 GB
    System Firmware Version: 10151.41.12
    OS Loader Version: 10151.41.12

  • Operating System, MacOS Sonoma

$ python3 --version
Python 3.10.9
$ make --version
GNU Make 3.81
$ g++ --version
Apple clang version 15.0.0 (clang-1500.0.40.1)

Failure Information (for bugs)

To reproduce:

  1. Start the server with a suitable GGUF model (I've tested with zephyr-7B, and mistral-7B-open-orca)
    e.g. ./server -m $MODELS/zephyr-7b-beta.Q4_K_M.gguf -c 8192 -t 8 --host 192.168.86.31 -v >> /tmp/llog2.txt
  2. Send a request to the v1/chats/completions endpoint with "cache_prompt": true and "slot_id": 1 as JSON fields in the request.
  3. Catenate the generated text with the first prompt. Add a little new text.
  4. Send a new request with the catenated text as the prompt.
  5. Observe that the server evaluates the entire prompt instead of only the new text.

Failure Logs

{"timestamp":1701723744,"level":"INFO","function":"main","line":2658,"message":"build info","build":1595,"commit":"880f579"}
{"timestamp":1701723744,"level":"INFO","function":"main","line":2665,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
{"timestamp":1701723744,"level":"INFO","function":"main","line":3043,"message":"HTTP server listening","hostname":"192.168.86.31","port":8080}
{"timestamp":1701723787,"level":"VERBOSE","function":"update_slots","line":1783,"message":"prompt ingested","n_past":0,"cached":"","to_eval":" <|im_start|>user\n\"Continue the story that starts below:\\nOnce upon a time, \\n\"<|im_end|>\n<|im_start|>assistant\n"}
{"timestamp":1701723788,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":262,"token_text":"in","has_next_token":true,"n_remain":-1,"num_tokens_predicted":0,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723788,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":264,"token_text":" a","has_next_token":true,"n_remain":-1,"num_tokens_predicted":1,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723788,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":2082,"token_text":" far","has_next_token":true,"n_remain":-1,"num_tokens_predicted":2,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723788,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":28733,"token_text":"-","has_next_token":true,"n_remain":47,"num_tokens_predicted":3,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723788,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":1769,"token_text":"off","has_next_token":true,"n_remain":46,"num_tokens_predicted":4,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
<--- SNIP --->
{"timestamp":1701723791,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":264,"token_text":" a","has_next_token":true,"n_remain":2,"num_tokens_predicted":48,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723791,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":20180,"token_text":" prince","has_next_token":true,"n_remain":1,"num_tokens_predicted":49,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723791,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":477,"token_text":" from","has_next_token":false,"n_remain":0,"num_tokens_predicted":50,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":""}
{"timestamp":1701723791,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"192.168.86.145","remote_port":49447,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
{"timestamp":1701723791,"level":"VERBOSE","function":"log_server_request","line":2613,"message":"request","request":"{\"model\":\"url\",\"messages\":[{\"role\":\"user\",\"content\":\"\\\"Continue the story that starts below:\\\\nOnce upon a time, \\\\n\\\"\"}],\"temperature\":1.4,\"n\":1,\"max_tokens\":50,\"cache_prompt\":true,\"slot_id\":1}","response":"{\"__verbose\":{\"content\":\"in a far-off kingdom, there was a young princess named Isabella. She was kind, intelligent, and had a heart full of love for her people. However, her father, the king, had arranged for her to marry a prince from\",\"generation_settings\":{\"frequency_penalty\":0.0,\"grammar\":\"\",\"ignore_eos\":false,\"logit_bias\":[],\"min_p\":0.05000000074505806,\"mirostat\":0,\"mirostat_eta\":0.0,\"mirostat_tau\":0.0,\"model\":\"/Users/mellis/Library/ApplicationSupport/faraday/models/zephyr-7b-beta.Q4_K_M.gguf\",\"n_ctx\":8192,\"n_keep\":0,\"n_predict\":50,\"n_probs\":0,\"penalize_nl\":false,\"presence_penalty\":0.0,\"repeat_last_n\":0,\"repeat_penalty\":1.100000023841858,\"seed\":0,\"stop\":[\"<|im_end|>\"],\"stream\":false,\"temp\":1.399999976158142,\"tfs_z\":0.0,\"top_k\":40,\"top_p\":0.949999988079071,\"typical_p\":0.0},\"model\":\"gpt-3.5-turbo-0613\",\"oaicompat_token_ctr\":50,\"prompt\":\"<|im_start|>user\\n\\\"Continue the story that starts below:\\\\nOnce upon a time, \\\\n\\\"<|im_end|>\\n<|im_start|>assistant\\n\",\"slot_id\":0,\"stop\":true,\"stopped_eos\":false,\"stopped_limit\":true,\"stopped_word\":false,\"stopping_word\":\"\",\"timings\":{\"predicted_ms\":3684.106,\"predicted_n\":50,\"predicted_per_second\":13.571813623169366,\"predicted_per_token_ms\":73.68212,\"prompt_ms\":817.203,\"prompt_n\":46,\"prompt_per_second\":56.28956330312053,\"prompt_per_token_ms\":17.765282608695653},\"tokens_cached\":96,\"tokens_evaluated\":46,\"tokens_predicted\":50,\"truncated\":false},\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"message\":{\"content\":\"in a far-off kingdom, there was a young princess named Isabella. She was kind, intelligent, and had a heart full of love for her people. However, her father, the king, had arranged for her to marry a prince from\",\"role\":\"assistant\"}}],\"created\":1701723791,\"id\":\"chatcmpl-Zl1BZGt8wJ2cJZqtGVhqmlD29AXRqMke\",\"model\":\"gpt-3.5-turbo-0613\",\"object\":\"chat.completion\",\"usage\":{\"completion_tokens\":50,\"prompt_tokens\":46,\"total_tokens\":96}}"}

<--- SECOND REQUEST --->
{"timestamp":1701723858,"level":"VERBOSE","function":"update_slots","line":1783,"message":"prompt ingested","n_past":0,"cached":"","to_eval":" <|im_start|>user\n\"Continue the story that starts below:\\nOnce upon a time, \\nin a far-off kingdom, there was a young princess named Isabella. She was kind, intelligent, and had a heart full of love for her people. However, her father, the king, had arranged for her to marry a prince from a neighboring kingdom.\\n\\n\"<|im_end|>\n<|im_start|>assistant\n"}
{"timestamp":1701723860,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":2301,"token_text":"Is","has_next_token":true,"n_remain":0,"num_tokens_predicted":0,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723860,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":375,"token_text":"ab","has_next_token":true,"n_remain":0,"num_tokens_predicted":1,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723860,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":1994,"token_text":"ella","has_next_token":true,"n_remain":0,"num_tokens_predicted":2,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723860,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":403,"token_text":" was","has_next_token":true,"n_remain":47,"num_tokens_predicted":3,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
<--- SNIP --->
{"timestamp":1701723863,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":1885,"token_text":" conf","has_next_token":true,"n_remain":4,"num_tokens_predicted":46,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723863,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":1932,"token_text":"ided","has_next_token":true,"n_remain":3,"num_tokens_predicted":47,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723863,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":297,"token_text":" in","has_next_token":true,"n_remain":2,"num_tokens_predicted":48,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723863,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":559,"token_text":" her","has_next_token":true,"n_remain":1,"num_tokens_predicted":49,"stopped_eos":false,"stopped_word":false,"stopped_limit":false,"stopping_word":""}
{"timestamp":1701723864,"level":"VERBOSE","function":"process_token","line":1087,"message":"next token","token":16424,"token_text":" closest","has_next_token":false,"n_remain":0,"num_tokens_predicted":50,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":""}
{"timestamp":1701723864,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"192.168.86.145","remote_port":49465,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
{"timestamp":1701723864,"level":"VERBOSE","function":"log_server_request","line":2613,"message":"request","request":"{\"model\":\"url\",\"messages\":[{\"role\":\"user\",\"content\":\"\\\"Continue the story that starts below:\\\\nOnce upon a time, \\\\nin a far-off kingdom, there was a young princess named Isabella. She was kind, intelligent, and had a heart full of love for her people. However, her father, the king, had arranged for her to marry a prince from a neighboring kingdom.\\\\n\\\\n\\\"\"}],\"temperature\":1.4,\"n\":1,\"max_tokens\":50,\"cache_prompt\":true,\"slot_id\":1}","response":"{\"__verbose\":{\"content\":\"Isabella was not opposed to the idea of marriage, but she had her heart set on a different suitor. His name was Edward, a commoner who had captured her heart with his kindness and bravery. She had confided in her closest\",\"generation_settings\":{\"frequency_penalty\":0.0,\"grammar\":\"\",\"ignore_eos\":false,\"logit_bias\":[],\"min_p\":0.05000000074505806,\"mirostat\":0,\"mirostat_eta\":0.0,\"mirostat_tau\":0.0,\"model\":\"/Users/mellis/Library/ApplicationSupport/faraday/models/zephyr-7b-beta.Q4_K_M.gguf\",\"n_ctx\":8192,\"n_keep\":0,\"n_predict\":50,\"n_probs\":0,\"penalize_nl\":false,\"presence_penalty\":0.0,\"repeat_last_n\":0,\"repeat_penalty\":1.100000023841858,\"seed\":0,\"stop\":[\"<|im_end|>\"],\"stream\":false,\"temp\":1.399999976158142,\"tfs_z\":0.0,\"top_k\":40,\"top_p\":0.949999988079071,\"typical_p\":0.0},\"model\":\"gpt-3.5-turbo-0613\",\"oaicompat_token_ctr\":50,\"prompt\":\"<|im_start|>user\\n\\\"Continue the story that starts below:\\\\nOnce upon a time, \\\\nin a far-off kingdom, there was a young princess named Isabella. She was kind, intelligent, and had a heart full of love for her people. However, her father, the king, had arranged for her to marry a prince from a neighboring kingdom.\\\\n\\\\n\\\"<|im_end|>\\n<|im_start|>assistant\\n\",\"slot_id\":0,\"stop\":true,\"stopped_eos\":false,\"stopped_limit\":true,\"stopped_word\":false,\"stopping_word\":\"\",\"timings\":{\"predicted_ms\":3714.829,\"predicted_n\":50,\"predicted_per_second\":13.459569740626016,\"predicted_per_token_ms\":74.29658,\"prompt_ms\":1406.272,\"prompt_n\":105,\"prompt_per_second\":74.66549856642243,\"prompt_per_token_ms\":13.393066666666666},\"tokens_cached\":155,\"tokens_evaluated\":105,\"tokens_predicted\":50,\"truncated\":false},\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"message\":{\"content\":\"Isabella was not opposed to the idea of marriage, but she had her heart set on a different suitor. His name was Edward, a commoner who had captured her heart with his kindness and bravery. She had confided in her closest\",\"role\":\"assistant\"}}],\"created\":1701723864,\"id\":\"chatcmpl-UgvXb8MPIqGdEiAbnIJR7rfFlm8XX09W\",\"model\":\"gpt-3.5-turbo-0613\",\"object\":\"chat.completion\",\"usage\":{\"completion_tokens\":50,\"prompt_tokens\":105,\"total_tokens\":155}}"}

@ggerganov
Copy link
Owner

Can you verify that the issue is resolved with: #4347

@Michael-F-Ellis
Copy link
Author

Michael-F-Ellis commented Dec 6, 2023

Thanks for the response! Unfortunately, it's still not honoring the cache. I pulled, did a make clean and a make. Here are server log outputs for two requests. The first began with 707 tokens and a request to generate 50 more. It took 7 seconds for prompt eval and 4 seconds for prediction. The second request sent the catenation of the previous prompt and the generated text with no alterations. It took 7.4 seconds for prompt eval and 4 for prediction.

print_timings: prompt eval time =    7039.53 ms /   707 tokens (    9.96 ms per token,   100.43 tokens per second)
print_timings:        eval time =    3989.39 ms /    50 runs   (   79.79 ms per token,    12.53 tokens per second)
print_timings:       total time =   11028.91 ms
slot 0 released (758 tokens in cache)
{"timestamp":1701875948,"level":"INFO","function":"log_server_request","line":2606,"message":"request","remote_addr":"192.168.86.145","remote_port":61732,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 4]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    7367.88 ms /   763 tokens (    9.66 ms per token,   103.56 tokens per second)
print_timings:        eval time =    4001.73 ms /    50 runs   (   80.03 ms per token,    12.49 tokens per second)
print_timings:       total time =   11369.60 ms
slot 0 released (814 tokens in cache)
{"timestamp":1701875980,"level":"INFO","function":"log_server_request","line":2606,"message":"request","remote_addr":"192.168.86.145","remote_port":61738,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

The server command line was:

./server -m $MODELS/zephyr-7b-beta.Q4_K_M.gguf -c 8192 -t 8 --host 192.168.86.31

and the first part of the startup log is:

{"timestamp":1701875634,"level":"INFO","function":"main","line":2656,"message":"build info","build":1616,"commit":"caa9249"}
{"timestamp":1701875634,"level":"INFO","function":"main","line":2663,"message":"system info","n_threads":8,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/mellis/Library/ApplicationSupport/faraday/models/zephyr-7b-beta.Q4_K_M.gguf (version GGUF V3 (latest))

I've verified that the JSON I'm submitting contains "cache_prompt": true at the same level as temperature, n, and max_tokens. I also checked that your changes from #4347 are present in server.cpp.

@ggerganov
Copy link
Owner

I'm testing with the following curl command and it seems to work:

curl -s http://127.0.0.1:8080/v1/chat/completions     -H "Content-Type: application/json"     -H "Authorization: Bearer no-key"     -d '{
        "model": "gpt-3.5-turbo", "cache_prompt": true,
        "messages": [
            {
                "role": "system",
                "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
            },
            {
                "role": "user",
                "content": "Write a limerick about python exceptions"
            }
        ]
    }' | jq

If possible, please provide curl commands that fail to utilize the prompt cache so it is easier for me to find the issue

@Michael-F-Ellis
Copy link
Author

I ran your curl command twice in succession and got the following log output:

print_timings: prompt eval time =    1123.17 ms /    77 tokens (   14.59 ms per token,    68.56 tokens per second)
print_timings:        eval time =    3354.28 ms /    46 runs   (   72.92 ms per token,    13.71 tokens per second)
print_timings:       total time =    4477.45 ms
slot 0 released (124 tokens in cache)
{"timestamp":1701881683,"level":"INFO","function":"log_server_request","line":2606,"message":"request","remote_addr":"192.168.86.145","remote_port":62746,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 6]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    1077.96 ms /    77 tokens (   14.00 ms per token,    71.43 tokens per second)
print_timings:        eval time =    3365.27 ms /    46 runs   (   73.16 ms per token,    13.67 tokens per second)
print_timings:       total time =    4443.23 ms
slot 0 released (124 tokens in cache)
{"timestamp":1701881848,"level":"INFO","function":"log_server_request","line":2606,"message":"request","remote_addr":"192.168.86.145","remote_port":62769,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

The only difference between my command and yours was that I changed the host IP from 127.0.0.1 to 192.168.86.31. (I'm running the server on my M1 mac mini and sending requests from my laptop.) I then restarted the server to serve on 127.0.0.1 and ran the command from a terminal window on the mac mini. Same outcome.

What other tests or output can I run to help sort this out?

@ggerganov
Copy link
Owner

Just noticed this in your log:

{"timestamp":1701875634,"level":"INFO","function":"main","line":2656,"message":"build info","build":1616,"commit":"caa9249"}

It means you are still on master: caa9249

So your server does not have the change that I implemented in #4347

The commit you should be testing on is: ef455cb

@Michael-F-Ellis
Copy link
Author

That did it! Sorry for not noticing your fix was on a different branch. See below the log output when I send a sizeable prompt via my ficta application. The first request has to process all 809 prompt tokens (8 seconds) but the second evaluates only 73 tokens (1.1 seconds). Huge improvement. Thanks ever so much.

Just so I'm clear on how caching works,

  1. Am I correct that each pass caches only the prompt and not the generated text from that pass?
  2. If I edit some part of the prior prompt before re-sending, will the cache still be applied to everything before the edit?
llama server listening at http://192.168.86.31:8080

{"timestamp":1701884746,"level":"INFO","function":"main","line":3042,"message":"HTTP server listening","hostname":"192.168.86.31","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 809 tokens
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =    7938.18 ms /   809 tokens (    9.81 ms per token,   101.91 tokens per second)
print_timings:        eval time =    4027.66 ms /    50 runs   (   80.55 ms per token,    12.41 tokens per second)
print_timings:       total time =   11965.85 ms
slot 0 released (860 tokens in cache)
{"timestamp":1701884846,"level":"INFO","function":"log_server_request","line":2607,"message":"request","remote_addr":"192.168.86.145","remote_port":63396,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}
slot 0 is processing [task id: 1]
slot 0 : in cache: 790 tokens | to process: 73 tokens
slot 0 : kv cache rm - [790, end)

print_timings: prompt eval time =    1118.32 ms /    73 tokens (   15.32 ms per token,    65.28 tokens per second)
print_timings:        eval time =    4055.14 ms /    50 runs   (   81.10 ms per token,    12.33 tokens per second)
print_timings:       total time =    5173.46 ms
slot 0 released (914 tokens in cache)
{"timestamp":1701884911,"level":"INFO","function":"log_server_request","line":2607,"message":"request","remote_addr":"192.168.86.145","remote_port":63412,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}}

@ggerganov
Copy link
Owner

  1. It also caches the generated text
  2. Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants