Skip to content

Releases: BerriAI/litellm

v1.23.8

10 Feb 17:14
Compare
Choose a tag to compare

Full Changelog: v1.23.7...v1.23.8

v1.23.7

10 Feb 04:59
e977685
Compare
Choose a tag to compare

1. Bedrock Set Timeouts

Usage - litellm.completion

response = litellm.completion(
    model="bedrock/anthropic.claude-instant-v1",
    timeout=0.01,
    messages=[{"role": "user", "content": "hello, write a 20 pg essay"}],
)

Usage on Proxy config.yaml

model_list:
  - model_name: BEDROCK_GROUP
    litellm_params:
      model: bedrock/cohere.command-text-v14
      timeout: 0.0001

2 View total proxy spend / budget

Screenshot 2024-02-09 at 11 50 23 AM

3. Use LlamaIndex with Proxy - Support azure deployments for /embeddings

Send Embedding requests like this

http://0.0.0.0:4000/openai/deployments/azure-embedding-model/embeddings?api-version=2023-07-01-preview

This allow users to use llama index AzureOpenAI with LiteLLM

Use LlamaIndex with LiteLLM Proxy

import os, dotenv

from dotenv import load_dotenv

load_dotenv()

from llama_index.llms import AzureOpenAI
from llama_index.embeddings import AzureOpenAIEmbedding
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext

llm = AzureOpenAI(
    engine="azure-gpt-3.5",
    temperature=0.0,
    azure_endpoint="http://0.0.0.0:4000",
    api_key="sk-1234",
    api_version="2023-07-01-preview",
)

embed_model = AzureOpenAIEmbedding(
    deployment_name="azure-embedding-model",
    azure_endpoint="http://0.0.0.0:4000",
    api_key="sk-1234",
    api_version="2023-07-01-preview",
)


# response = llm.complete("The sky is a beautiful blue and")
# print(response)

documents = SimpleDirectoryReader("llama_index_data").load_data()
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)

Full Changelog: v1.23.5...v1.23.7

v1.23.5

09 Feb 07:30
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.23.4...v1.23.5

v1.23.4

09 Feb 06:08
Compare
Choose a tag to compare

What's Changed

  • [FEAT] 76 % Faster s3 logging Proxy / litellm.acompletion / router.acompletion 🚀 by @ishaan-jaff in #1892
  • (feat) Add support for AWS credentials from profile file by @dleen in #1895
  • Litellm langfuse error logging - log input by @krrishdholakia in #1898
  • Admin UI - View Models, TPM, RPM Limit of a Key by @ishaan-jaff in #1903
  • Admin UI - show delete confirmation when deleting keys by @ishaan-jaff in #1904

litellm_key_gen5

Full Changelog: v1.23.3...v1.23.4

v1.23.3

08 Feb 19:45
Compare
Choose a tag to compare

What's Changed

  • [FEAT] 78% Faster s3 Cache⚡️- Proxy/ litellm.acompletion/ litellm.Router.acompletion by @ishaan-jaff in #1891

Full Changelog: v1.23.2...v1.23.3

v1.23.2

08 Feb 04:41
Compare
Choose a tag to compare

What's Changed 🐬

  1. [FEAT] Azure Pricing - based on base_model in model_info
  2. [Feat] Semantic Caching - Track Cost of using embedding, Use Langfuse Trace ID
  3. [Feat] Slack Alert when budget tracking fails

1. [FEAT] Azure Pricing - based on base_model in model_info by @ishaan-jaff in #1874

Azure Pricing - Use Base model for cost calculation

Why ?

Azure returns gpt-4 in the response when azure/gpt-4-1106-preview is used, We were using gpt-4 when calculating response_cost

How to use - set base_model on config.yaml

model_list:
  - model_name: azure-gpt-3.5
    litellm_params:
      model: azure/chatgpt-v-2
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY
      api_version: "2023-07-01-preview"
    model_info:
      base_model: azure/gpt-4-1106-preview

View Cost calculated on Langfuse

This used the correct pricing for azure/gpt-4-1106-preview = (9*0.00001) + (28*0.00003)
Screenshot 2024-02-07 at 4 39 12 PM

2. [Feat] Semantic Caching - Track Cost of using embedding, Use Langfuse Trace ID by @ishaan-jaff in #1878

  • If a trace_id is passed we'll place the semantic cache embedding call in the same trace
  • We now track cost for the API key that will make the embedding call for semantic caching
Screenshot 2024-02-07 at 7 18 57 PM

3. [Feat] Slack Alert when budget tracking fails by @ishaan-jaff in #1877

Screenshot 2024-02-07 at 8 08 27 PM

Full Changelog: v1.23.1...v1.23.2

v1.23.1

08 Feb 02:37
e17e783
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.23.0...v1.23.1

v1.23.0

07 Feb 09:13
Compare
Choose a tag to compare

What's Changed

Full Changelog: v1.22.11...v1.23.0

v1.22.11

07 Feb 04:09
Compare
Choose a tag to compare

Full Changelog: v1.22.10...v1.22.11

v1.22.10

07 Feb 02:54
Compare
Choose a tag to compare

What's Changed

  • fix(proxy_server.py): do a health check on db before returning if proxy ready (if db connected) by @krrishdholakia in #1856
  • fix(utils.py): return finish reason for last vertex ai chunk by @krrishdholakia in #1847
  • fix(proxy/utils.py): if langfuse trace id passed in, include in slack alert by @krrishdholakia in #1839
  • [Feat] Budgets for 'user' param passed to /chat/completions, /embeddings etc by @ishaan-jaff in #1859

Semantic Caching Support - Add Semantic Caching to litellm💰 by @ishaan-jaff in #1829

Usage with Proxy

Step 1: Add cache to the config.yaml

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
  - model_name: azure-embedding-model
    litellm_params:
      model: azure/azure-embedding-model
      api_base: os.environ/AZURE_API_BASE
      api_key: os.environ/AZURE_API_KEY
      api_version: "2023-07-01-preview"

litellm_settings:
  set_verbose: True
  cache: True          # set cache responses to True, litellm defaults to using a redis cache
  cache_params:
    type: "redis-semantic"  
    similarity_threshold: 0.8   # similarity threshold for semantic cache
    redis_semantic_cache_embedding_model: azure-embedding-model # set this to a model_name set in model_list

Step 2: Add Redis Credentials to .env

Set either REDIS_URL or the REDIS_HOST in your os environment, to enable caching.

REDIS_URL = ""        # REDIS_URL='redis://username:password@hostname:port/database'
## OR ## 
REDIS_HOST = ""       # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = ""       # REDIS_PORT='18841'
REDIS_PASSWORD = ""   # REDIS_PASSWORD='liteLlmIsAmazing'

Additional kwargs
You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:

REDIS_<redis-kwarg-name> = ""

Step 3: Run proxy with config

$ litellm --config /path/to/config.yaml

That's IT !

(You'll see semantic-similarity on langfuse if you set langfuse as a success_callback)
(FYI the api key here is deleted 🔑)

Screenshot 2024-02-06 at 11 15 01 AM

Usage with litellm.completion

litellm.cache = Cache(
        type="redis-semantic",
        host=os.environ["REDIS_HOST"],
        port=os.environ["REDIS_PORT"],
        password=os.environ["REDIS_PASSWORD"],
        similarity_threshold=0.8,
        redis_semantic_cache_embedding_model="text-embedding-ada-002",
  )
  response1 = completion(
      model="gpt-3.5-turbo",
      messages=[
          {
              "role": "user",
              "content": f"write a one sentence poem about: {random_number}",
          }
      ],
      max_tokens=20,
  )
  print(f"response1: {response1}")

  random_number = random.randint(1, 100000)

  response2 = completion(
      model="gpt-3.5-turbo",
      messages=[
          {
              "role": "user",
              "content": f"write a one sentence poem about: {random_number}",
          }
      ],
      max_tokens=20,
  )
  print(f"response2: {response1}")
  assert response1.id == response2.id

Budgets for 'user' param passed to /chat/completions, /embeddings etc

budget user passed to /chat/completions, without needing to create a key for every user passed
docs: https://docs.litellm.ai/docs/proxy/users

How to Use

  1. Define a litellm.max_user_budget on your confg
litellm_settings:
  max_budget: 10      # global budget for proxy 
  max_user_budget: 0.0001 # budget for 'user' passed to /chat/completions
  1. Make a /chat/completions call, pass 'user' - First call Works
curl --location 'http://0.0.0.0:4000/chat/completions' \
        --header 'Content-Type: application/json' \
        --header 'Authorization: Bearer sk-zi5onDRdHGD24v0Zdn7VBA' \
        --data ' {
        "model": "azure-gpt-3.5",
        "user": "ishaan3",
        "messages": [
            {
            "role": "user",
            "content": "what time is it"
            }
        ]
        }'
  1. Make a /chat/completions call, pass 'user' - Call Fails, since 'ishaan3' over budget
curl --location 'http://0.0.0.0:4000/chat/completions' \
        --header 'Content-Type: application/json' \
        --header 'Authorization: Bearer sk-zi5onDRdHGD24v0Zdn7VBA' \
        --data ' {
        "model": "azure-gpt-3.5",
        "user": "ishaan3",
        "messages": [
            {
            "role": "user",
            "content": "what time is it"
            }
        ]
        }'

Error

{"error":{"message":"Authentication Error, ExceededBudget: User ishaan3 has exceeded their budget. Current spend: 0.0008869999999999999; Max Budget: 0.0001","type":"auth_error","param":"None","code":401}}%                

Full Changelog: v1.22.9...v1.22.10