[Feature]: Prompt caching friendly routing strategy #6784

jeromeroussin · 2024-11-17T14:17:44Z

The Feature

Prompt caching is harder to trigger when litellm load balances across several deployments (using Azure as an example). If the litellm gateway is configured with, say, 3 deployments for a specific model, it may take 3 or more calls before prompt caching takes place, and cost saving+lower latency achieved. The more deployments for a single model, the more calls it will take to "warm up" prompt caching in each deployment.

I am suggesting the following prompt caching friendly routing strategy: whenever a prompt of over 1024 tokens is detected, litellm would cache the beginning of the prompt along with the model-id it landed on. On subsequent calls with the same first 1024 tokens in the prompt, litellm would route the request to the same cached model-id. The cache entries would only need to live in the cache for as long as the prompt caching TTL of the LLM providers themselves (which varies from 5m to one hour)

Motivation, pitch

Lower costs, lower latencies with prompt caching that kicks in immediately while not sacrificing load-balancing.

Twitter / LinkedIn details

https://www.linkedin.com/in/jeromeroussin/

krrishdholakia · 2024-11-19T17:04:10Z

Need a way to:

identify if a request has prompt caching enabled
check cache for model id, for that request
if found, route to that model_id
if none, pick from loadbalancing logic AND store model id for that call (parallel to completion call)

Allows user to identify if messages/tools have prompt caching Related issue: #6784

…vious deployments Closes #6784

jeromeroussin · 2024-12-10T12:53:19Z

Thanks. As I have made you aware, the current implementation of this new routing scheme is going to have limited impact in our environment. Azure OpenAi does not align prompt caching with message boundaries, whereas your current algorithm does.

jeromeroussin · 2025-01-06T17:04:17Z

Can you reopen this since you reverted the PR?

@micahjsmith

* fix(main.py): support passing max retries to azure/openai embedding integrations Fixes BerriAI#7003 * feat(team_endpoints.py): allow updating team model aliases Closes BerriAI#6956 * feat(router.py): allow specifying model id as fallback - skips any cooldown check Allows a default model to be checked if all models in cooldown s/o @micahjsmith * docs(reliability.md): add fallback to specific model to docs * fix(utils.py): new 'is_prompt_caching_valid_prompt' helper util Allows user to identify if messages/tools have prompt caching Related issue: BerriAI#6784 * feat(router.py): store model id for prompt caching valid prompt Allows routing to that model id on subsequent requests * fix(router.py): only cache if prompt is valid prompt caching prompt prevents storing unnecessary items in cache * feat(router.py): support routing prompt caching enabled models to previous deployments Closes BerriAI#6784 * test: fix linting errors * feat(databricks/): convert basemodel to dict and exclude none values allow passing pydantic message to databricks * fix(utils.py): ensure all chat completion messages are dict * (feat) Track `custom_llm_provider` in LiteLLMSpendLogs (BerriAI#7081) * add custom_llm_provider to SpendLogsPayload * add custom_llm_provider to SpendLogs * add custom llm provider to SpendLogs payload * test_spend_logs_payload * Add MLflow to the side bar (BerriAI#7031) Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> * (bug fix) SpendLogs update DB catch all possible DB errors for retrying (BerriAI#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure * (Feat) Add StructuredOutputs support for Fireworks.AI (BerriAI#7085) * fix model cost map fireworks ai "supports_response_schema": true, * fix supports_response_schema * fix map openai params fireworks ai * test_map_response_format * test_map_response_format * added deepinfra/Meta-Llama-3.1-405B-Instruct (BerriAI#7084) * bump: version 1.53.9 → 1.54.0 * fix deepinfra * litellm db fixes LiteLLM_UserTable (BerriAI#7089) * ci/cd queue new release * fix llama-3.3-70b-versatile * refactor - use consistent file naming convention `AI21/` -> `ai21` (BerriAI#7090) * fix refactor - use consistent file naming convention * ci/cd run again * fix naming structure * fix use consistent naming (BerriAI#7092) --------- Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Yuki Watanabe <31463517+B-Step62@users.noreply.github.com> Co-authored-by: ali sayyah <ali.sayyah2@gmail.com>

jeromeroussin added the enhancement New feature or request label Nov 17, 2024

krrishdholakia added a commit that referenced this issue Dec 7, 2024

fix(utils.py): new 'is_prompt_caching_valid_prompt' helper util

6f9d14d

Allows user to identify if messages/tools have prompt caching Related issue: #6784

krrishdholakia added a commit that referenced this issue Dec 7, 2024

feat(router.py): support routing prompt caching enabled models to pre…

089b092

…vious deployments Closes #6784

krrishdholakia closed this as completed in 0c0498d Dec 8, 2024

krrishdholakia mentioned this issue Dec 8, 2024

Litellm dev 12 07 2024 #7086

Merged

ishaan-jaff reopened this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Prompt caching friendly routing strategy #6784

[Feature]: Prompt caching friendly routing strategy #6784

jeromeroussin commented Nov 17, 2024

krrishdholakia commented Nov 19, 2024

jeromeroussin commented Dec 10, 2024

jeromeroussin commented Jan 6, 2025

[Feature]: Prompt caching friendly routing strategy #6784

[Feature]: Prompt caching friendly routing strategy #6784

Comments

jeromeroussin commented Nov 17, 2024

The Feature

Motivation, pitch

Twitter / LinkedIn details

krrishdholakia commented Nov 19, 2024

jeromeroussin commented Dec 10, 2024

jeromeroussin commented Jan 6, 2025