-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server-cuda closes connection while still processing tasks #6545
Comments
Hello, Probably your ingress closes the connection first, increase it. llama.cpp has not such timeout, at least I never faced it. There is actually a FYI, I have started to work on a kubernetes helm for But I need to update with latest update especially loading models from HF directly. |
I'm sorry, it seems like my comment is out of context. I misread it. |
The server has elvolved a lot since 438c2ca. But I happy to note you were part of the server ignition. Please ping if the issue is not solved. |
@FSSRepo this is not really a limitation of the llama.cpp API. To support what you described, the server would need to have a low batch size and constantly create new batches taking into account to distribute the work fairly among all the new requests. But there is nothing that could be changed in the llama.cpp API that would allow changing a batch while it is being evaluated. |
Hi @phymbert . |
Issue to be published in the llama.cpp github:
I am using the Docker Image ghcr.io/ggerganov/llama.cpp:server-cuda to deploy the server in a Kubernetes cluster in AWS using four A10G gpus. This is the configuration setup:
(not sure if it adds context, but I'm using a persistentVolumeClaim where I download and persist the model)
I already reviewed the server readme and all the command line options and also tested different image tags for server-cuda from the past days.
Based on this discussion y understand I have 10 slots for processing parallel requests, and I could be able to process 10 sequences with 10000 tokens each. The gpu I'm using should be able to process this load.
With this configuration, I executed a test for sending 5 concurrent requests of ~2300 tokens each. I understand this should be way below the maximum processable limit, but I'm getting a connection closed from the server while its is still processing the tasks in the used slots. The process is the following:
I am trying to understand if there is some additional configuration I'm missing or how can I improve concurrency in these cases without handling connection error from outside (additionally, when a the connection gets closed, I cannot reprocess the requests immediately since the server is still processing the previous requests)
The text was updated successfully, but these errors were encountered: