Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up Llamadeploy for multiagent deployment on k8s #357

Open
hz6yc3 opened this issue Nov 11, 2024 · 9 comments
Open

Setting up Llamadeploy for multiagent deployment on k8s #357

hz6yc3 opened this issue Nov 11, 2024 · 9 comments
Assignees

Comments

@hz6yc3
Copy link

hz6yc3 commented Nov 11, 2024

There is no documentation that provides guidance on how to set up Llamadeploy (control plane, message queue and service deployment) on Kubernetes. The example provided in the code is little confusing and our company badly need some guidance on setting up Llamadeploy for enterprise deployment. Any relevant documentation or sample configuration that someone can share would be really helpful.

@masci masci added this to Framework Nov 11, 2024
@masci masci self-assigned this Nov 12, 2024
@hz6yc3
Copy link
Author

hz6yc3 commented Nov 12, 2024

@masci thanks a lot for looking into my question above. We are kind of blocked and there is a some urgency in completing the PoC for agentic workflows using LlamaIndex and greatly appreciate if you can provide some guidance with the request above.

@logan-markewich
Copy link
Collaborator

@hz6yc3 while it might not be totally clear from docs/examples, its fairly straightforward. You'd need to use the lower-level API
https://docs.llamaindex.ai/en/stable/module_guides/llama_deploy/30_manual_orchestration/

Basically, you can setup a docker image that deploys the core

Then another docker image from there that deploys a workflow service (or several, depending on how you want to manage scaling)

Once you have it running in docker, its fairly transferrable to then launching those docker images in a k8s cluster

This example walks through all of this, including k8s
https://github.com/run-llama/llama_deploy/tree/main/examples/message-queue-integrations
https://github.com/run-llama/llama_deploy/tree/main/examples/message-queue-integrations/rabbitmq/kubernetes

We are working on updates to make this easier though, using a more simple top-level yaml file rather than writing code for all the deployments. But in-lieu of that, the above is the best approach.

@hz6yc3
Copy link
Author

hz6yc3 commented Nov 12, 2024

@logan-markewich thanks a lot! Let me read through the documents. We were not sure on the guidance for centrally deploying the core components because based on the architecture in the documentation it seemed like we have to deploy the core components (control plane, message queue) for each deployment separately. The way we deploy applications in our company is that every application is deployed within its own namespace on the cluster so we weren't sure how we would want to set up the deployment pattern using llama deploy.

@rehevkor5
Copy link

Yeah, https://www.llamaindex.ai/blog/introducing-llama-deploy-a-microservice-based-way-to-deploy-llamaindex-workflows is somewhat misleading about these:

llama-deploy launches llama-index Workflows as scalable microservices

everything’s an independently-scalable microservice

microservices architecture of llama-deploy enables easy scaling of individual components, ensuring your system can handle growing demands

If you use the API Server / llamactl, the control plane and all the services are run in-process either as an asyncio task or as a uvicorn HTTP server. So, inherently centralized and not independently scalable. If you want your services to be independently scalable, you have to implement your own solution for that.

@masci
Copy link
Member

masci commented Nov 16, 2024

@rehevkor5 first of all, thanks for the feedback!

What you read in the article is still true but it dates back to before we introduced the apiserver, see how we changed the architecture diagram here so I see how this can be misleading. A quick recap to clarify the situation:

  • If you manually orchestrate the different components you see in the diagram, your system is consistent with what's in the article (every component is independent, talking to the others via HTTP hence scalable and close to an actual microservice)
  • If you're using the apiserver, the components are wrapped into a single thread so they can't scale independently.

Why the apiserver is monolithic then? The apiserver is a key component of what we want Llama Deploy to become in terms of user experience. We wanted to quickly validate the concept of "deployments" and their yaml definition with our users and get feedback as soon as possible, so we optimized the current "backend" of the apiserver for running in a single-process/single-container environment that was easy to setup.

But we're already planning an actual scalable implementation of the apiserver backend, currently we're leaning towards building on top of existing container orchestrators to move faster and avoid reinventing any wheel.

I'll expand the docs to include these considerations and call out that the apiserver is work in progress. Let me know if you have any question!

@hz6yc3
Copy link
Author

hz6yc3 commented Nov 16, 2024

@masci sounds like your suggested approach is manual orchestration for deploying the individual components for now until a scalable solution using api server is developed. Based on the updated architecture diagram you shared it sounds like we have to create separate "deployments" with its own control plane and message queue config for deploying the associated workflows?

@abdulhuq-cimulate
Copy link

abdulhuq-cimulate commented Nov 19, 2024

@masci I have been working on a POC to set up Llama Deploy workflows using the manual orchestration approach. I did manage to set it up using docker-compose using a custom docker image with both simple message queue and redis. As next step of the POC, I tried deploying the services to k8s. The setup I was going for is to have a centralized deployment of control plane and message queue (with multiple replicas), deploy workflows as a separat deployment (with multiple replicas) and register the workflow to the central control plane.
I believe, I am running into issues because each replica of the control plane might have its own isolated service metadata and each pod of the workflow deployment might register its own version of the service on the control plane.

Is there a way to share the service metadata information across the control plane deployments and to register one instance of the workflow?

In the meantime, I can scale down my replicas to 1 to mitigate the issue but curious to see if there is already a fix available.

Edit: A quick fix could be to allow passing in a KV store URI that is a separate service for the control plane to use here via env var like CONTROL_PLANE_SERVICE_KV_STORE_URI.

@masci
Copy link
Member

masci commented Nov 20, 2024

Edit: A quick fix could be to allow passing in a KV store URI that is a separate service for the control plane to use here via env var like CONTROL_PLANE_SERVICE_KV_STORE_URI.

Yes I believe that would be the solution, we already have a bunch of stores that can run on separate services https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/storage/kvstore so for example you could use the Redis implementation.

I'll look into it, tracking the feature with #370

@abdulhuq-cimulate
Copy link

abdulhuq-cimulate commented Dec 27, 2024

@masci @logan-markewich I have set up control-plane, message-queue, and workflow services using manual orchestration and the set up is as follows:

  • Redis Cluster is used as the message queue.
  • Control Plane k8s deployment with 2 replica pods (simple-message-queue is disabled).
    • Redis KV store for managing Control Plane state store.
    • Use default topic_namespace which is llama_deploy.control_plane
  • Workflow k8s deployment with 3 replica pods.

I am referencing the services using the k8s service URL:
Control Plane: http://control-plane:8000/
Workflow: http://workflow:8002/
(Both services are deployed in the same namespace)

When I am interacting with this deployment using the LlamaDeployClient, I am noticing that all replicas of the workflow service are consuming the same message from Redis and running the workflow and as a result, the client receives duplicated responses due to multiple workflows acting on the message. But once the control plane receives a final_result from one of the workflow replicas, it stops consuming messages for the same task_id, which is expected.

How do I ensure that only one pod replica of the workflow is consuming the message from the message queue and processing a single request instead of all replicas of the workflow?

Edit: I wonder if issue 363 will resolve this problem but maybe not completely? I tried using simple message queue with 1 replica (because consumers and queues are managed in memory) with 2 replicas of control plane that uses redis KV store and 3 replicas of the workflow service. As the message queue service is now responsible for publishing and consuming messages, it is able to process one request at a time. But if I scale the message queue, each replica will need to register the consumers and publishers to work as expected (a problem that can be solved similar to using a separate KV store like in control plane?).
But with redis, since the workflow service executes the message consumer client, each replica of the workflow is consuming the message which results in duplicated runs, at least this is my hypothesis looking at the code here: https://github.com/run-llama/llama_deploy/blob/main/llama_deploy/message_queues/redis.py#L152-L155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

5 participants