Releases: BerriAI/litellm
v1.23.8
Full Changelog: v1.23.7...v1.23.8
v1.23.7
- [FEAT] ui - view total proxy spend / budget by @ishaan-jaff in #1915
- [FEAT] Bedrock set timeouts on litellm.completion by @ishaan-jaff in #1919
- [FEAT] Use LlamaIndex with Proxy - Support azure deployments for /embeddings - by @ishaan-jaff in #1921
- [FIX] Verbose Logger - don't double print CURL command by @ishaan-jaff in #1924
- [FEAT] Set timeout for bedrock on proxy by @ishaan-jaff in #1922
- feat(proxy_server.py): show admin global spend as time series data by @krrishdholakia in #1920
1. Bedrock Set Timeouts
Usage - litellm.completion
response = litellm.completion(
model="bedrock/anthropic.claude-instant-v1",
timeout=0.01,
messages=[{"role": "user", "content": "hello, write a 20 pg essay"}],
)
Usage on Proxy config.yaml
model_list:
- model_name: BEDROCK_GROUP
litellm_params:
model: bedrock/cohere.command-text-v14
timeout: 0.0001
2 View total proxy spend / budget
3. Use LlamaIndex with Proxy - Support azure deployments for /embeddings
Send Embedding requests like this
http://0.0.0.0:4000/openai/deployments/azure-embedding-model/embeddings?api-version=2023-07-01-preview
This allow users to use llama index AzureOpenAI with LiteLLM
Use LlamaIndex with LiteLLM Proxy
import os, dotenv
from dotenv import load_dotenv
load_dotenv()
from llama_index.llms import AzureOpenAI
from llama_index.embeddings import AzureOpenAIEmbedding
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
llm = AzureOpenAI(
engine="azure-gpt-3.5",
temperature=0.0,
azure_endpoint="http://0.0.0.0:4000",
api_key="sk-1234",
api_version="2023-07-01-preview",
)
embed_model = AzureOpenAIEmbedding(
deployment_name="azure-embedding-model",
azure_endpoint="http://0.0.0.0:4000",
api_key="sk-1234",
api_version="2023-07-01-preview",
)
# response = llm.complete("The sky is a beautiful blue and")
# print(response)
documents = SimpleDirectoryReader("llama_index_data").load_data()
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
Full Changelog: v1.23.5...v1.23.7
v1.23.5
What's Changed
- fix(proxy_server.py): enable aggregate queries via /spend/keys by @krrishdholakia in #1901
- fix(factory.py): mistral message input fix by @krrishdholakia in #1902
Full Changelog: v1.23.4...v1.23.5
v1.23.4
What's Changed
- [FEAT] 76 % Faster s3 logging Proxy / litellm.acompletion / router.acompletion 🚀 by @ishaan-jaff in #1892
- (feat) Add support for AWS credentials from profile file by @dleen in #1895
- Litellm langfuse error logging - log input by @krrishdholakia in #1898
- Admin UI - View Models, TPM, RPM Limit of a Key by @ishaan-jaff in #1903
- Admin UI - show delete confirmation when deleting keys by @ishaan-jaff in #1904
Full Changelog: v1.23.3...v1.23.4
v1.23.3
What's Changed
- [FEAT] 78% Faster s3 Cache⚡️- Proxy/ litellm.acompletion/ litellm.Router.acompletion by @ishaan-jaff in #1891
Full Changelog: v1.23.2...v1.23.3
v1.23.2
What's Changed 🐬
- [FEAT] Azure Pricing - based on base_model in model_info
- [Feat] Semantic Caching - Track Cost of using embedding, Use Langfuse Trace ID
- [Feat] Slack Alert when budget tracking fails
1. [FEAT] Azure Pricing - based on base_model in model_info by @ishaan-jaff in #1874
Azure Pricing - Use Base model for cost calculation
Why ?
Azure returns gpt-4
in the response when azure/gpt-4-1106-preview
is used, We were using gpt-4
when calculating response_cost
How to use - set base_model
on config.yaml
model_list:
- model_name: azure-gpt-3.5
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
model_info:
base_model: azure/gpt-4-1106-preview
View Cost calculated on Langfuse
This used the correct pricing for azure/gpt-4-1106-preview
= (9*0.00001) + (28*0.00003)
2. [Feat] Semantic Caching - Track Cost of using embedding, Use Langfuse Trace ID by @ishaan-jaff in #1878
- If a
trace_id
is passed we'll place the semantic cache embedding call in the same trace - We now track cost for the API key that will make the embedding call for semantic caching
3. [Feat] Slack Alert when budget tracking fails by @ishaan-jaff in #1877
Full Changelog: v1.23.1...v1.23.2
v1.23.1
What's Changed
- [Feat] add azure/gpt-4-0125-preview by @ishaan-jaff in #1876
Full Changelog: v1.23.0...v1.23.1
v1.23.0
What's Changed
- feat(ui): enable admin to view all valid keys created on the proxy by @krrishdholakia in #1843
- fix(proxy_server.py): prisma client fixes for high traffic by @krrishdholakia in #1860
Full Changelog: v1.22.11...v1.23.0
v1.22.11
Full Changelog: v1.22.10...v1.22.11
v1.22.10
What's Changed
- fix(proxy_server.py): do a health check on db before returning if proxy ready (if db connected) by @krrishdholakia in #1856
- fix(utils.py): return finish reason for last vertex ai chunk by @krrishdholakia in #1847
- fix(proxy/utils.py): if langfuse trace id passed in, include in slack alert by @krrishdholakia in #1839
- [Feat] Budgets for 'user' param passed to /chat/completions, /embeddings etc by @ishaan-jaff in #1859
Semantic Caching Support - Add Semantic Caching to litellm💰 by @ishaan-jaff in #1829
- Use with LiteLLM Proxy https://docs.litellm.ai/docs/proxy/caching
- Use with litellm.completion https://docs.litellm.ai/docs/caching/redis_cache
Usage with Proxy
Step 1: Add cache
to the config.yaml
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
- model_name: azure-embedding-model
litellm_params:
model: azure/azure-embedding-model
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
litellm_settings:
set_verbose: True
cache: True # set cache responses to True, litellm defaults to using a redis cache
cache_params:
type: "redis-semantic"
similarity_threshold: 0.8 # similarity threshold for semantic cache
redis_semantic_cache_embedding_model: azure-embedding-model # set this to a model_name set in model_list
Step 2: Add Redis Credentials to .env
Set either REDIS_URL
or the REDIS_HOST
in your os environment, to enable caching.
REDIS_URL = "" # REDIS_URL='redis://username:password@hostname:port/database'
## OR ##
REDIS_HOST = "" # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = "" # REDIS_PORT='18841'
REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'
Additional kwargs
You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:
REDIS_<redis-kwarg-name> = ""
Step 3: Run proxy with config
$ litellm --config /path/to/config.yaml
That's IT !
(You'll see semantic-similarity on langfuse if you set langfuse as a success_callback)
(FYI the api key here is deleted 🔑)
Usage with litellm.completion
litellm.cache = Cache(
type="redis-semantic",
host=os.environ["REDIS_HOST"],
port=os.environ["REDIS_PORT"],
password=os.environ["REDIS_PASSWORD"],
similarity_threshold=0.8,
redis_semantic_cache_embedding_model="text-embedding-ada-002",
)
response1 = completion(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": f"write a one sentence poem about: {random_number}",
}
],
max_tokens=20,
)
print(f"response1: {response1}")
random_number = random.randint(1, 100000)
response2 = completion(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": f"write a one sentence poem about: {random_number}",
}
],
max_tokens=20,
)
print(f"response2: {response1}")
assert response1.id == response2.id
Budgets for 'user' param passed to /chat/completions, /embeddings etc
budget user
passed to /chat/completions, without needing to create a key for every user passed
docs: https://docs.litellm.ai/docs/proxy/users
How to Use
- Define a litellm.max_user_budget on your confg
litellm_settings:
max_budget: 10 # global budget for proxy
max_user_budget: 0.0001 # budget for 'user' passed to /chat/completions
- Make a /chat/completions call, pass 'user' - First call Works
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-zi5onDRdHGD24v0Zdn7VBA' \
--data ' {
"model": "azure-gpt-3.5",
"user": "ishaan3",
"messages": [
{
"role": "user",
"content": "what time is it"
}
]
}'
- Make a /chat/completions call, pass 'user' - Call Fails, since 'ishaan3' over budget
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-zi5onDRdHGD24v0Zdn7VBA' \
--data ' {
"model": "azure-gpt-3.5",
"user": "ishaan3",
"messages": [
{
"role": "user",
"content": "what time is it"
}
]
}'
Error
{"error":{"message":"Authentication Error, ExceededBudget: User ishaan3 has exceeded their budget. Current spend: 0.0008869999999999999; Max Budget: 0.0001","type":"auth_error","param":"None","code":401}}%
Full Changelog: v1.22.9...v1.22.10