Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server slowing down with each request (requests are identical) #4201

Closed
4 tasks done
ruped opened this issue Nov 24, 2023 · 8 comments
Closed
4 tasks done

Server slowing down with each request (requests are identical) #4201

ruped opened this issue Nov 24, 2023 · 8 comments
Assignees
Labels
bug Something isn't working stale

Comments

@ruped
Copy link

ruped commented Nov 24, 2023

Pre-Prerequisite

Thanks to all the contributors for all the great work on llama.cpp!

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behaviour

  • Identical requests to the server should take similar amounts of time
    • I made 4 identical requests to the server, sequential with 1 second pause after each.

Current Behaviour

  • Each request to the server takes longer to process.
  • Note particularly the prompt_eval time gets much slower.
  • Also: After the last request it says "slot 3 released (1155 tokens in cache)" - is that expected?
    • Why should there be data in the cache after the slot has been released and there are no active requests?
print_timings: prompt eval time =    1110.28 ms /  1134 tokens (    0.98 ms per token,  1021.37 tokens per second)
print_timings:        eval time =     561.83 ms /    20 runs   (   28.09 ms per token,    35.60 tokens per second)
print_timings:       total time =    1672.10 ms
slot 0 released (1155 tokens in cache)

slot 1 is processing [task id: 1]
slot 1 : kv cache rm - [0, end)
print_timings: prompt eval time =    1483.84 ms /  1134 tokens (    1.31 ms per token,   764.23 tokens per second)
print_timings:        eval time =     591.59 ms /    20 runs   (   29.58 ms per token,    33.81 tokens per second)
print_timings:       total time =    2075.43 ms
slot 1 released (1155 tokens in cache)

slot 2 is processing [task id: 2]
slot 2 : kv cache rm - [0, end)
print_timings: prompt eval time =    1764.20 ms /  1134 tokens (    1.56 ms per token,   642.78 tokens per second)
print_timings:        eval time =     618.07 ms /    20 runs   (   30.90 ms per token,    32.36 tokens per second)
print_timings:       total time =    2382.28 ms
slot 2 released (1155 tokens in cache)

slot 3 is processing [task id: 3]
slot 3 : kv cache rm - [0, end)
print_timings: prompt eval time =    2229.91 ms /  1134 tokens (    1.97 ms per token,   508.54 tokens per second)
print_timings:        eval time =     642.50 ms /    20 runs   (   32.12 ms per token,    31.13 tokens per second)
print_timings:       total time =    2872.41 ms
slot 3 released (1155 tokens in cache)

Environment and Context

  • Physical (or virtual) hardware you are using: Physical hardware, Nvidia GPU

  • Operating System: Linux

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

  • Start server
 (./build/bin/server -m ./ggml-model-q4.bin -ngl 9999  --ctx-size 8000 --host 0.0.0.0 --port 7777 --cont-batching --parallel 4)
  • Make requests to server with curl sequentially (one at a time).

Thanks!

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 24, 2023

It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled.

@ruped
Copy link
Author

ruped commented Nov 27, 2023

Thanks for confirming @FSSRepo.

Should this be moved to bug-confirmed status? Or is that not really necessary for it?

@ggerganov ggerganov added bug Something isn't working and removed bug-unconfirmed labels Nov 27, 2023
@ggerganov ggerganov self-assigned this Nov 27, 2023
@ggerganov
Copy link
Owner

ggerganov commented Nov 27, 2023

It's not something that only happens with the server example; it also occurs with the ./main example, so it's an internal issue with the function that has a regression in performance as the context is filled.

If the cache is cleared correctly, it would not be slower after each request, so there seems to be a server-specific problem.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 27, 2023

Upon reviewing carefully, it seems so, although I believe that only happens if the requests are launched at the same time in different slots.

Too remember the token speed it's a mean from start the task to end the task.

Sorry for edit your comment, i sometimes confuse quote with edit in GitHub Android App

@LoopControl
Copy link

I've also seen this issue with ./server when used with -cb and -np flags on my Linux system.

The first request runs at full expected speed but the following requests generate slower (with identical prompt).

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
@NeuralNotwerk
Copy link

Was this fixed? I'm still having this issue.

@ggerganov
Copy link
Owner

I think it should be fixed. If you reproduce it, please provide the curl command that you are using to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants