Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Keep-Alive Functionality for GPU Resource Optimization in LitServe #304

Open
skyking363 opened this issue Sep 29, 2024 · 12 comments
Open
Labels
enhancement New feature or request

Comments

@skyking363
Copy link


🚀 Feature

I would like to propose adding a feature to LitServe that enables models to be deployed with a keep-alive functionality, similar to what Ollama provides. This feature would allow the model to be unloaded from GPU memory when not in use and automatically loaded back when required.

Motivation

This feature would be helpful for users working with limited GPU resources. Currently, the GPU can become a bottleneck when multiple models are deployed. By releasing the GPU resources when a model is idle and reloading them on demand, we could improve efficiency and free up resources for other tasks.

Pitch

The main objective is to add a mechanism, perhaps through environment variables, that allows the system to automatically unload models when idle and reload them when needed, similar to Ollama's keep-alive functionality.

Alternatives

An alternative solution could involve manually managing GPU resources at the deployment level, but this can be cumbersome and error-prone. Automation via LitServe would streamline this process.

Additional context

This idea is inspired by a similar feature discussed in the Ollama repository: Ollama keep-alive environment variables. It could significantly optimize resource usage in environments where GPUs are scarce.

@skyking363 skyking363 added the enhancement New feature or request label Sep 29, 2024
@aniketmaurya
Copy link
Collaborator

hi @skyking363, thank you for your interest in LitServe and for suggesting a new feature. LitServe is designed for serving high-throughput servers at scale, while Ollama is intended to run LLMs on personal devices.

  • You can also use Ollama along with LitServe by loading the Ollama client in the LitAPI.setup method.
  • Incorporating this feature directly into LitServe would take it in a different direction than our target.

Tagging @lantiga @williamFalcon to hear their thoughts.

@aceliuchanghong
Copy link

I am looking for this features too.

I was trying to delete the model or empty the gpu. But it can't works.

maybe it can be an option. That would be great.

Thank you for reading and replying this

@aniketmaurya
Copy link
Collaborator

hi @aceliuchanghong, thank you for adding in to the discussion. Few questions:

  • To empty the GPU, you can stop the server. curious to understand why is it required to unload it programmatically.
  • What kind of model are you serving (size?) and want to unload?

@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 30, 2024

@skyking363 @aceliuchanghong thanks for your requests!

can you explain the motivation a bit more clearly with a concrete example?

  • are you running multiple servers on the samw GPU?
  • what problem is this solving for you? keeping GPU RAM free? if so, why? are you running other servers on the machine?
  • are you expecting the processes to disconnect also?
  • remember ollama is very different from lightning. if all you are doing is serving an llm (like ollama) then turning a server on/off might be okay. But what if you are serving something more complex like RAG with multiple models, DB connections, vector caches, etc… which model do you offload? when? why?

etc….

basically i have about a million questions here haha. So, it would be better to understand concretely based on a real-world example that shows what problem you want to solve and how this would solve it (a lifecycle diagram might help too).

@aceliuchanghong
Copy link

aceliuchanghong commented Sep 30, 2024

it's happens when i use a visual model to ocr some complicated image.
so i use litserve as api to do it.

but i just use it occasionally.so i want to the gpu can be free when most time i don't use it.
i just want the api service load model weight when i request it.when maybe 5 or 10 minites after no one request it.the api service can release the gpu space.

thankyou for replying~

add..cause i only have one machine with 4 L20,SO there are many service on it all time...xd

@grumpyp
Copy link
Contributor

grumpyp commented Oct 1, 2024

@aniketmaurya Would it make sense to introduce some kind of model unloading if the server has a certain amount of time without any request? And then lazy-loading it back to memory - similar to some idle-state let's say?

I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine..

@aniketmaurya
Copy link
Collaborator

I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine..

I think the main question here would be that in a production environment do you do this?

@aceliuchanghong
Copy link

I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine..

I think the main question here would be that in a production environment do you do this?

yeah.we use litserve in production env or that's why i don't use fastapi or something else cause it support llm(etc.) very well

@skyking363
Copy link
Author

hi @skyking363, thank you for your interest in LitServe and for suggesting a new feature. LitServe is designed for serving high-throughput servers at scale, while Ollama is intended to run LLMs on personal devices.

  • You can also use Ollama along with LitServe by loading the Ollama client in the LitAPI.setup method.
  • Incorporating this feature directly into LitServe would take it in a different direction than our target.

Tagging @lantiga @williamFalcon to hear their thoughts.

Thank you for your reply. I currently choose to use LitServe instead of Ollama for two main reasons:

LitServe offers more flexibility compared to Ollama, such as the ability to return both sparse and dense embedding vectors during the embedding process, something that Ollama cannot do.
LitServe demonstrates superior GPU efficiency and throughput, which is crucial for my application needs.
This is why I prefer to move away from Ollama, and therefore I won't be adopting the suggestion to load the Ollama client.

Thank you again for your suggestions and support!

@skyking363
Copy link
Author

@skyking363 @aceliuchanghong thanks for your requests!

can you explain the motivation a bit more clearly with a concrete example?

  • are you running multiple servers on the samw GPU?
  • what problem is this solving for you? keeping GPU RAM free? if so, why? are you running other servers on the machine?
  • are you expecting the processes to disconnect also?
  • remember ollama is very different from lightning. if all you are doing is serving an llm (like ollama) then turning a server on/off might be okay. But what if you are serving something more complex like RAG with multiple models, DB connections, vector caches, etc… which model do you offload? when? why?

etc….

basically i have about a million questions here haha. So, it would be better to understand concretely based on a real-world example that shows what problem you want to solve and how this would solve it (a lifecycle diagram might help too).

Thank you for your reply, @williamFalcon !

To provide a more concrete example of my use case:

I am running multiple services on a machine with 8 A100 GPUs. These services involve running multiple LLMs simultaneously (e.g., Llama 3.1 405B, Llama 3.2 90B, etc.), which are either used for user chat interactions or periodic tasks (such as ingesting data into a database). Additionally, I have some long-running API services that utilize multiple models, including visual models for tasks like optical character recognition (OCR). However, these models are not always in use—there are often long idle periods between requests.

My goal is to release GPU memory during these idle periods so that other services can utilize the resources without shutting down the API service itself. Ideally, the models would automatically load when a request comes in and unload after a prolonged period of inactivity. This way, we can more efficiently utilize GPU resources without manual management or service restarts.

This mechanism would allow us to manage limited GPU resources more flexibly and efficiently, especially when running services involving RAG or multi-model combinations. Of course, I understand that in more complex production environments, automatic unloading may not always be appropriate, but in scenarios where models are only used at specific times, this feature could be extremely beneficial.

Thank you again for your detailed response and suggestions! I will consider using a lifecycle diagram to further clarify how this functionality could be implemented.

@fcakyon
Copy link

fcakyon commented Oct 6, 2024

This feature is already available in the lightning studio (scale to zero). Can you add a simpler version of it to the litserve?

cc: @williamFalcon @aniketmaurya

@aniketmaurya
Copy link
Collaborator

thank you for the detailed response @skyking363!! We will be taking this feature request and keep you updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants