-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: llama-server api first query very slow #9492
Comments
The same problem exists, only cuda.
|
Could you try without CUDA graph? Set |
Doesn't seem to be making any difference:
|
Has this started happening recently? Does it happen without I can't reproduce on my CUDA workstation. |
Happens both with and without docker. I wasn't using llama-server before so can't say if it's new or not. |
Hm, does adding |
Doesn't help unfortunately. |
@ggerganov I was able to reproduce the problem on HF endpoints with A10G GPU (I didn't notice this issue before). The first Here is the log with |
I just had it happen on an A10 machine as well using |
Still happens for me every time when using the docker image but not when I build from source (even when building from source inside a docker container). |
Happened on my RTX 2060 workstation using the following commands: GGML_CUDA=1 make -j && ./llama-server --host 0.0.0.0 --port 7020 --alias Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L --gpu-layers 33 --model ~/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --threads-http 1 --ctx-size 1024 --metrics --chat-template llama3 --verbose curl -s --request POST --url http://127.0.0.1:7020/v1/chat/completions --header "Content-Type: application/json" --data '{"messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello, how are you today?" } ], "n_predict": 512}' | jq Here is the log where the first decode takes ~8s: 0.25.727.078 I slot update_slots: id 0 | task 0 | kv cache rm [0, end)
0.25.727.086 I slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 28, n_tokens = 28, progress = 1.000000
0.25.727.089 I slot update_slots: id 0 | task 0 | prompt done, n_past = 28, n_tokens = 28
0.25.727.089 D srv update_slots: decoding batch, n_tokens = 28
0.33.928.787 D slot process_toke: id 0 | task 0 | n_decoded = 1, n_remaining = 511, next token: 'I'
0.33.928.791 D srv update_slots: run slots completed
0.33.928.791 D que start_loop: waiting for new tasks 0.00.359.190 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
0.00.359.192 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.359.193 I ggml_cuda_init: found 1 CUDA devices:
0.00.362.168 I Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
0.00.490.706 I llm_load_tensors: ggml ctx size = 0.27 MiB
0.00.910.535 I llm_load_tensors: offloading 32 repeating layers to GPU
0.00.910.539 I llm_load_tensors: offloading non-repeating layers to GPU
0.00.910.539 I llm_load_tensors: offloaded 33/33 layers to GPU
0.00.910.545 I llm_load_tensors: CPU buffer size = 532.31 MiB
0.00.910.545 I llm_load_tensors: CUDA0 buffer size = 5993.34 MiB
......................................................................................
0.01.735.246 I llama_new_context_with_model: n_ctx = 1024
0.01.735.248 I llama_new_context_with_model: n_batch = 1024
0.01.735.249 I llama_new_context_with_model: n_ubatch = 512
0.01.735.249 I llama_new_context_with_model: flash_attn = 0
0.01.735.253 I llama_new_context_with_model: freq_base = 500000.0
0.01.735.254 I llama_new_context_with_model: freq_scale = 1
0.01.735.826 I llama_kv_cache_init: CUDA0 KV buffer size = 128.00 MiB
0.01.735.830 I llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
0.01.736.939 I llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
0.01.742.493 I llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB
0.01.742.497 I llama_new_context_with_model: CUDA_Host compute buffer size = 10.01 MiB
0.01.742.497 I llama_new_context_with_model: graph nodes = 1030
0.01.742.497 I llama_new_context_with_model: graph splits = 2
0.01.742.499 W llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.811.008 I srv init: initializing slots, n_slots = 1
0.01.811.012 I slot init: id 0 | task -1 | new slot n_ctx_slot = 1024
0.01.811.015 D slot reset: id 0 | task -1 |
0.01.811.095 I main: model loaded
0.01.811.112 I main: chat template, built_in: 0, chat_example: '<|start_header_id|>system<|end_header_id|>
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'
0.01.811.112 I main: server is listening on 0.0.0.0:7020 - starting the main loop
0.01.811.113 D que start_loop: processing new tasks
0.01.811.113 D que start_loop: update slots
0.01.811.114 I srv update_slots: all slots are idle
0.01.811.114 D srv kv_cache_cle: clearing KV cache
0.01.811.430 D que start_loop: waiting for new tasks
0.25.726.656 D formatted_chat: '<|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello, how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'
0.25.726.673 D srv add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
0.25.726.674 D que post: new task, id = 0/1, front = 0
0.25.726.981 D que start_loop: processing new tasks
0.25.726.987 D que start_loop: processing task, id = 0
0.25.726.990 D slot get_availabl: id 0 | task -1 | selected slot by lru, t_last = -1
0.25.726.991 D slot reset: id 0 | task -1 |
0.25.727.011 I slot launch_slot_: id 0 | task 0 | processing task
0.25.727.013 D que start_loop: update slots
0.25.727.013 D srv update_slots: posting NEXT_RESPONSE
0.25.727.014 D que post: new task, id = 1, front = 0
0.25.727.016 I slot update_slots: id 0 | task 0 | tokenizing prompt, len = 1
0.25.727.069 I slot update_slots: id 0 | task 0 | prompt tokenized, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 28
0.25.727.078 I slot update_slots: id 0 | task 0 | kv cache rm [0, end)
0.25.727.086 I slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 28, n_tokens = 28, progress = 1.000000
0.25.727.089 I slot update_slots: id 0 | task 0 | prompt done, n_past = 28, n_tokens = 28
0.25.727.089 D srv update_slots: decoding batch, n_tokens = 28
0.33.928.787 D slot process_toke: id 0 | task 0 | n_decoded = 1, n_remaining = 511, next token: 'I'
0.33.928.791 D srv update_slots: run slots completed
0.33.928.791 D que start_loop: waiting for new tasks
0.33.928.792 D que start_loop: processing new tasks
0.33.928.793 D que start_loop: processing task, id = 1
0.33.928.793 D que start_loop: update slots
0.33.928.794 D srv update_slots: posting NEXT_RESPONSE
0.33.928.795 D que post: new task, id = 2, front = 0
0.33.928.796 D slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 1024, n_past = 29, n_system_tokens = 0, n_cache_tokens = 0, truncated = 0
0.33.928.797 D srv update_slots: decoding batch, n_tokens = 1
0.33.952.647 D slot process_toke: id 0 | task 0 | n_decoded = 2, n_remaining = 510, next token: ''m'
0.33.952.649 D srv update_slots: run slots completed
0.33.952.650 D que start_loop: waiting for new tasks
0.33.952.650 D que start_loop: processing new tasks
0.33.952.650 D que start_loop: processing task, id = 2
0.33.952.650 D que start_loop: update slots
0.33.952.650 D srv update_slots: posting NEXT_RESPONSE
0.33.952.651 D que post: new task, id = 3, front = 0
0.33.952.652 D slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 1024, n_past = 30, n_system_tokens = 0, n_cache_tokens = 0, truncated = 0
0.33.952.652 D srv update_slots: decoding batch, n_tokens = 1
0.33.976.285 D slot process_toke: id 0 | task 0 | n_decoded = 3, n_remaining = 509, next token: ' doing' After that, tried to restart the server and submit the same query many times but the first decode was always fast as expected: 0.04.133.752 I slot update_slots: id 0 | task 0 | kv cache rm [0, end)
0.04.133.758 I slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 28, n_tokens = 28, progress = 1.000000
0.04.133.759 I slot update_slots: id 0 | task 0 | prompt done, n_past = 28, n_tokens = 28
0.04.133.759 D srv update_slots: decoding batch, n_tokens = 28
0.04.204.428 D slot process_toke: id 0 | task 0 | n_decoded = 1, n_remaining = 511, next token: 'I'
0.04.204.430 D srv update_slots: run slots completed
0.04.204.431 D que start_loop: waiting for new tasks |
What happened?
I'm using the
openai
library to interact withllama-server
docker image on an A6000:docker run -p 8080:8080 --name llama-server -v ~/gguf_models:/models --gpus all ghcr.io/ggerganov/llama.cpp:server-cuda -m models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf -c 65536 -fa --host 0.0.0.0 --port 8080 --n-gpu-layers 99 -ctk q4_0 -ctv q4_0 -t 4
The first request I send takes about 80 seconds, during which at first a single CPU core gets 100% load for maybe ~55s (with GPU usage at 0%) and only then the GPU kicks in. The second time I execute the exact same call, it takes ~26s to respond and starts with both CPU (one core 100%) and GPU (~87%) working a the same time.
The API call itself is:
Name and Version
$./llama-server --version
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
^^^ not very helpful but I have just pulled a fresh docker image today i.e. 15/09/2024:
docker pull ghcr.io/ggerganov/llama.cpp:server-cuda
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: