Made with Open Source ❤️ from the team at Spryngtime (YC W22).
The OpenAI Load Balancer is a Python library designed to distribute API requests across multiple endpoints (supports both OpenAI and Azure). It implements a round-robin mechanism for load balancing and includes exponential backoff for retrying failed requests.
Supported OpenAI functions: ChatCompletion, Embedding, Completion (will be deprecated soon)
- Round Robin Load Balancing: Distributes requests evenly across a set of API endpoints.
- Exponential Backoff: Includes retry logic with exponential backoff for each API call.
- Failure Detection: Temporarily removes failed endpoints based on configurable thresholds.
- Flexible Configuration: Customizable settings for endpoints, failure thresholds, cooldown periods, and more.
- Easy Integration: Designed to be easily integrated into projects that use OpenAI's API.
- Fallback: If OpenAI's endpoint goes down, if your Azure endpoint is still up, then your service stays up, and vice versa.
To install the OpenAI Load Balancer, run the following command:
pip install openai-load-balancer
PyPi - https://pypi.org/project/openai-load-balancer/
First, setup your OpenAI API Endpoints. The API keys and base_url will be read from your env variable.
# Example configuration
ENDPOINTS = [
{
"api_type": "azure",
"base_url": "AZURE_API_BASE_URL_1",
"api_key_env": "AZURE_API_KEY_1",
"version": "2023-05-15"
},
{
"api_type": "open_ai",
"base_url": "https://api.openai.com/v1",
"api_key_env": "OPENAI_API_KEY_1",
"version": None
}
# Add more configurations as needed
]
If you are using both Azure and OpenAI, specify the mapping of the OpenAI's model name to your Azure's engine name
MODEL_ENGINE_MAPPING = {
"gpt-4": "gpt4",
"gpt-3.5-turbo": "gpt-35-turbo",
"text-embedding-ada-002": "text-embedding-ada-002"
# Add more mappings as needed
}
Import and initialize the load balancer with the endpoints and mapping:
from openai_load_balancer import initialize_load_balancer
openai_load_balancer = initialize_load_balancer(
endpoints=ENDPOINTS, model_engine_mapping=MODEL_ENGINE_MAPPING)
Simply replace openai
with openai_load_balancer
in your function calls!:
response = openai_load_balancer.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! This is a request."}
],
# Additional parameters...
)
You can also configure the load balancer with the following variables
# The number of consecutive failures of a request to an endpoint before the endpoint is temporarily marked as inactive
FAILURE_THRESHOLD = 5
# The minimum amount of time an endpoint is marked as inactive before it is reset to active.
COOLDOWN_PERIOD = timedelta(minutes=10)
# Whether or not to enable load balancing. If disabled, the first active endpoint will always be used, and other endpoints will only be used in case the first one fails.
LOAD_BALANCING_ENABLED = True
Contributions to the OpenAI Load Balancer are welcome!
This project is licensed under the MIT License.