Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server-cuda closes connection while still processing tasks #6545

Closed
lucasBerardiniMarvik opened this issue Apr 8, 2024 · 5 comments
Closed

Comments

@lucasBerardiniMarvik
Copy link

Issue to be published in the llama.cpp github:

I am using the Docker Image ghcr.io/ggerganov/llama.cpp:server-cuda to deploy the server in a Kubernetes cluster in AWS using four A10G gpus. This is the configuration setup:

- name: llama-cpp-server
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    args:
    - "--model"
    - "/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf"
    - "--port"
    - "8000"
    - "--host"
    - "0.0.0.0"
    - "--ctx-size"
    - "100000"
    - "--n-gpu-layers"
    - "256"
    - "--cont-batching"
    - "--parallel" 
    - "10"
    - "--batch-size"
    - "4096"

(not sure if it adds context, but I'm using a persistentVolumeClaim where I download and persist the model)

I already reviewed the server readme and all the command line options and also tested different image tags for server-cuda from the past days.

Based on this discussion y understand I have 10 slots for processing parallel requests, and I could be able to process 10 sequences with 10000 tokens each. The gpu I'm using should be able to process this load.

With this configuration, I executed a test for sending 5 concurrent requests of ~2300 tokens each. I understand this should be way below the maximum processable limit, but I'm getting a connection closed from the server while its is still processing the tasks in the used slots. The process is the following:

  1. I send multiple requests to the server (5)
  2. The server gets disconnected without sending a response for some of the requests
  3. I check again the /health and see that the slots are still running
  4. I check the logs for the server and see that all tasks finish successfully. I don't see any error logs in the server

I am trying to understand if there is some additional configuration I'm missing or how can I improve concurrency in these cases without handling connection error from outside (additionally, when a the connection gets closed, I cannot reprocess the requests immediately since the server is still processing the previous requests)

@phymbert
Copy link
Collaborator

phymbert commented Apr 8, 2024

Hello,

Probably your ingress closes the connection first, increase it. llama.cpp has not such timeout, at least I never faced it.

There is actually a --timeout option, but it does not include the pp+tg.

FYI, I have started to work on a kubernetes helm for llama.cpp: https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes

But I need to update with latest update especially loading models from HF directly.

@phymbert phymbert added the kubernetes Helm & Kubernetes label Apr 8, 2024
@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 8, 2024

I'm sorry, it seems like my comment is out of context. I misread it.

@phymbert
Copy link
Collaborator

phymbert commented Apr 8, 2024

The server has elvolved a lot since 438c2ca. But I happy to note you were part of the server ignition. Please ping if the issue is not solved.

@slaren
Copy link
Member

slaren commented Apr 8, 2024

@FSSRepo this is not really a limitation of the llama.cpp API. To support what you described, the server would need to have a low batch size and constantly create new batches taking into account to distribute the work fairly among all the new requests. But there is nothing that could be changed in the llama.cpp API that would allow changing a batch while it is being evaluated.

@lucasBerardiniMarvik
Copy link
Author

Hi @phymbert .
The problem was, as you said, a limitation in the timeout from the kubernetes service and not a problem on the llama.cpp side.
Many thanks for your support here. Looking forward to test the helm in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants