This Docker configuration utilizes OpenLLM for both GPU and Serverless deployments on Runpod. It employs an environment variable, MODE_TO_RUN
, to dictate the startup behavior. Depending on the MODE_TO_RUN
value, the configuration may launch handler.py
for serverless operations or initiate OpenSSH and Jupyter Lab for GPU pods. This adaptable setup allows for straightforward modifications to meet various deployment requirements.
Note some of these like 70B and Mixtral are inherently super large. So I recommend mistral / llamab13b for serverless, or to look into network volumes yourself if you want to use larger models. But network volumes is not a usecase I am personally interested to test as there is large latency to load such large models from a network volume anyways into VRAM.
Also, with these base files, it means you can use the Dockerfile_Iteration to just update the handler.py and not have to redownload the model again. You can also modify the Docker_Iteration file to add more dependency and whatever else you want, knowing you have a stable base image to work off of and not have to preload such large models yourself.
OpenLLM Mistral 7B - GPU Pod and Serverless Ready
(What I personally work with, and confirmed working)
OpenLLM Llama2 13b - Pod and Serverless Ready
(Confirmed working, had to change the MAX_MODEL_LEN ENV variable to 4096 in the Runpod template)
These models should have been stored in network storage rather than a docker image.
OpenLLM Llama2 70B - GPU Pod and Serverless Ready
OpenLLM Mixtral 8x7B - Pod and Serverless Ready
Below is a table of the environment variables that can be passed to the Docker container. These variables enable customization of the deployment's behavior, offering flexibility for different scenarios.
Variable | Description | Expected Values | Default Value |
---|---|---|---|
MODE_TO_RUN |
Determines the container's operational mode, affecting the execution of handler.py and the initiation of services. |
serverless , pod , both |
pod |
MODEL |
Identifier for the OpenLLM model to be used. Specifies which AI model your applications will utilize during BUILD STEP OF THE DOCKERFILE | Model Identifier (string) | mistralai/Mistral-7B-Instruct-v0.1 |
CONCURRENCY_MODIFIER |
A factor used to adjust the concurrency level for handling requests, allowing for tuning based on workload. | Integer | 1 |
MAX_MODEL_LEN |
Sets the maximum sequence length the model can handle, impacting memory usage and processing capabilities. | Integer | 25000 |
To find specific model identifiers for use with OpenLLM, visit the OpenLLM GitHub repository. This resource offers a comprehensive list of available models and their identifiers, which can be utilized to set the MODEL
environment variable. THIS VARIABLE IS ONLY USED DURING THE BUILD STEP OF THE DOCKERFILE. If this changes from the build step, and the handler.py cannot find the model, it will try to download a new model which can slow down your API. So only create this discrepancy if you want to change the model during development on a GPU Pod.
The MAX_MODEL_LEN
environment variable defines the maximum number of tokens (sequence length) that the model can process in a single operation. This setting is crucial for managing the balance between performance and the capabilities of the underlying hardware, especially in terms of GPU memory utilization. Adjusting this variable can help avoid errors related to exceeding the maximum Key-Value (KV) cache length. If you encounter an error suggesting that the model's sequence length is larger than the KV cache can support, consider adjusting the MAX_MODEL_LEN
value downwards.
An ex. error might look like:
The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (27296). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Or:
ValueError: User-specified max_model_len (25000) is greater than the derived max_model_len (None=4096 in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
If this is this is the case either adjust the handler.py with a hard-coded value, or set the ENV variable to a lower value. I cannot test every model's max_length so this is something if you see as an error just adjust manually.
serverless
: Executeshandler.py
for serverless request handling, optimal for scalable, efficient deployments.pod
: Initiates essential services like OpenSSH and Jupyter Lab, bypassinghandler.py
, suitable for development or tasks requiring GPU resources.both
: Combines the functionalities of both modes by executinghandler.py
and initiating essential services, ideal for debugging serverless deployments with additional tool access because you can set a minimum active worker to 1 and then essentially just use Jupyter Notebook / SSH to debug the worker. (Note: this both behavior might no longer be supported by Runpod b/c they don't let us currently specify HTTP / TCP ports on worker templates anymore.)
-
Build the Docker Image: Create your image using the Dockerfile, optionally specifying the
MODEL
,CONCURRENCY_MODIFIER
, andMAX_MODEL_LEN
variables as needed.docker build --build-arg MODEL_ARG=<your_model_identifier> --build-arg MAX_MODEL_LEN=<desired_max_length> -t your_image_name .
-
Run the Container: Start your container with the desired
MODE_TO_RUN
,MAX_MODEL_LEN
, and any other environment variables.docker run -e MODE_TO_RUN=serverless -e CONCURRENCY_MODIFIER=2 -e MAX_MODEL_LEN=25000 your_image_name
You can also just leave the Docker run empty and let the default run take over. And if you want to instead modify the ENV variables, just use Runpod's GUI where you can write the ENV variables there instead of in the RUN command if you are running on runpod. This can be helpful especially if you are for ex. debugging in GPU Pod, and then you want to just deploy on SERVERLESS by just changing the ENV variable when you are making a new template.
-
Accessing Services: Depending on the chosen mode,
- In
serverless
andboth
, interact with the deployed model through the specified handler. - In
pod
andboth
, access Jupyter Lab and SSH services as provided.
- In
For those using Depot to build and deploy containers, the command structure is slightly different. Here's how you can include the environment variables as build arguments with Depot:
depot build -t yourusername/serverlessllm:1.0 . \
--build-arg MODE_TO_RUN=pod \
--build-arg MODEL=mistralai/Mistral-7B-Instruct-v0.1 \
--build-arg CONCURRENCY_MODIFIER=1 \
--build-arg MAX_MODEL_LEN=25000 \
--push --platform linux/amd64
For traditional Docker builds, you can incorporate the environment variables into your build command like so:
docker build -t yourusername/serverlessllm:1.0 . \
--build-arg MODE_TO_RUN=pod \
--build-arg MODEL=mistralai/Mistral-7B-Instruct-v0.1 \
--build-arg CONCURRENCY_MODIFIER=1 \
--build-arg MAX_MODEL_LEN=25000
Other variables at least if you are deploying on runpod, the ENV can be changed at runpod. The only thing that really matters at build time is the MODEL
so it might look closer to like below:
depot build -t yourusername/serverlessllm:1.0 . \
--build-arg MODEL=mistralai/Mistral-7B-Instruct-v0.1 \
--push --platform linux/amd64
Start using the container with GPU_POD (This is my runpod public URL to the template all ready to go for you.)
If you want to deploy on serverless it's super easy! Essentially copy the template but set the environment variable for the MODE to serverless. Check to make sure the model repository names match up as I might update template names, or you might be using a different model. Find the right one by refering to Docker which model you want to use. MAKE SURE TO INCLUDE THE USERNAME/IMAGE:TAG if you are missing the username, image, or tag it won't work!!:
Adjust the MAX_MODEL_LEN
depending on what you are running. For ex. what I show in the picture works for Mistral, but maybe not for Llama which uses 4096. You can probably test in GPU Pod to see if you get any errors if using a model that I haven't tested.
For the container size, look at the compressed docker image size on Dockerhub, and I generally like to add another 5GB to 10GB ontop of that. You can play around to see.
If you end up wanting to change the handler.py I recommend to build using a flag to target the "Dockerfile_Iteration" after you build using the standard "Dockerfile" once. This way you can have the models cached during the docker build process in the base image and only update the handler.py. This way you can avoid the long wait time to "redownload the model" and just update the handler.py.
docker build -f Dockerfile_Iteration -t your_image_name .
The methodology is:
- Build your first Dockerfile where we preload the model + everything the first time
- Launch a GPU Pod on Runpod and test the handler.py -- If everything looks good, just use the same image for serverless and modify the env variable to change how it starts up -- If you need to iterate, iterate on Runpod and then copy and paste the handler.py back locally and then build using the Docker_Iteration file, which means you won't have to redownload the whole model again and instead just keep iterating on handler.py
- Once ready, then relaunch back on GPU Pod and Serverless until ready.
Note! If the base models are already built for you on my CI/CD pipeline, what that means is that you can just use the Dockerfile_Iteration file, to target that specific repository and just modify the handler.py. This way you can avoid the long wait time to "redownload the model" and just update the handler.py.
Click here to see the example code I wrote for 'normal' and 'stream' mode. stream_client_side.py
I purposely kept the API very simple. The prompt you can construct however you want, rather than separating out things like system / user prompts. You can just construct the prompt f"{system_prompt} {user_prompt}"
and then send it to the API. The API will then generate the text based on the prompt and return the text. And if you disagree, you can easily modify the handler.py to act differently of for your use-case, especially if you want to change and add more properties such as temperature etc. But I wanted to keep it as simple as possible at first.
The API expects input provided as a JSON object with the following structure:
{
"input": {
"prompt": "<user_prompt>",
"answerType": "<stream | normal>"
}
}
Note you can see more examples on my stream_client_side.py
Stream Mode ("stream"): The API yields multiple JSON objects, each containing a part of the generated text:
How to call streammode on runpod:
Documentation on stream mode invocation:
{"text": "generated text part 1"}
{"text": "generated text part 2"}
...
Normal Mode ("normal"): The API returns a single JSON object with the complete generated text:
{
"text": "generated text"
}
The preload.py is specifically looking for backends with 'vllm' which is essentially an optimized version of the model. If you want to use a different backend, you will need to modify the preload function and handler.py as otherwise if not specified it will default to vllm usually.
Running:
openllm models
The relevant backends that support vllm you can find if it is said in the supported_backends list.
{
"baichuan": {
"architecture": "BaichuanForCausalLM",
"example_id": "baichuan-inc/baichuan2-7b-chat",
"supported_backends": [
"pt",
"vllm"
],
"installation": "pip install \"openllm[baichuan]\"",
"items": []
},
"chatglm": {
"architecture": "ChatGLMModel",
"example_id": "thudm/chatglm2-6b-int4",
"supported_backends": [
"pt",
"vllm"
],
"installation": "pip install \"openllm[chatglm]\"",
"items": []
},
"dolly_v2": {
"architecture": "GPTNeoXForCausalLM",
"example_id": "databricks/dolly-v2-3b",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": []
},
"falcon": {
"architecture": "FalconForCausalLM",
"example_id": "tiiuae/falcon-40b-instruct",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install \"openllm[falcon]\"",
"items": []
},
"flan_t5": {
"architecture": "T5ForConditionalGeneration",
"example_id": "google/flan-t5-large",
"supported_backends": [
"pt"
],
"installation": "pip install openllm",
"items": []
},
"gpt_neox": {
"architecture": "GPTNeoXForCausalLM",
"example_id": "eleutherai/gpt-neox-20b",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": []
},
"llama": {
"architecture": "LlamaForCausalLM",
"example_id": "NousResearch/llama-2-70b-hf",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": []
},
"mistral": {
"architecture": "MistralForCausalLM",
"example_id": "mistralai/Mistral-7B-v0.1",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": [
"vllm-mistralai--mistral-7b-instruct-v0.1:9ab9e76e2b09f9f29ea2d56aa5bd139e4445c59e"
]
},
"mixtral": {
"architecture": "MixtralForCausalLM",
"example_id": "mistralai/Mixtral-8x7B-v0.1",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": []
},
"mpt": {
"architecture": "MPTForCausalLM",
"example_id": "mosaicml/mpt-30b-chat",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install \"openllm[mpt]\"",
"items": []
},
"opt": {
"architecture": "OPTForCausalLM",
"example_id": "facebook/opt-1.3b",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": []
},
"phi": {
"architecture": "PhiForCausalLM",
"example_id": "microsoft/phi-1_5",
"supported_backends": [
"pt",
"vllm"
],
"installation": "pip install openllm",
"items": []
},
"qwen": {
"architecture": "QWenLMHeadModel",
"example_id": "qwen/Qwen-14B-Chat",
"supported_backends": [
"pt",
"vllm"
],
"installation": "pip install \"openllm[qwen]\"",
"items": []
},
"stablelm": {
"architecture": "GPTNeoXForCausalLM",
"example_id": "stabilityai/stablelm-base-alpha-3b",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install openllm",
"items": []
},
"starcoder": {
"architecture": "GPTBigCodeForCausalLM",
"example_id": "bigcode/starcoderbase",
"supported_backends": [
"pt",
"vllm",
"ctranslate"
],
"installation": "pip install \"openllm[starcoder]\"",
"items": []
},
"yi": {
"architecture": "YiForCausalLM",
"example_id": "01-ai/Yi-34B",
"supported_backends": [
"pt",
"vllm"
],
"installation": "pip install openllm",
"items": []
}
}