-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Prompt caching friendly routing strategy #6784
Labels
enhancement
New feature or request
Comments
Need a way to:
|
krrishdholakia
added a commit
that referenced
this issue
Dec 7, 2024
Allows user to identify if messages/tools have prompt caching Related issue: #6784
krrishdholakia
added a commit
that referenced
this issue
Dec 7, 2024
…vious deployments Closes #6784
Merged
Thanks. As I have made you aware, the current implementation of this new routing scheme is going to have limited impact in our environment. Azure OpenAi does not align prompt caching with message boundaries, whereas your current algorithm does. |
Can you reopen this since you reverted the PR? |
rajatvig
pushed a commit
to rajatvig/litellm
that referenced
this issue
Jan 16, 2025
* fix(main.py): support passing max retries to azure/openai embedding integrations Fixes BerriAI#7003 * feat(team_endpoints.py): allow updating team model aliases Closes BerriAI#6956 * feat(router.py): allow specifying model id as fallback - skips any cooldown check Allows a default model to be checked if all models in cooldown s/o @micahjsmith * docs(reliability.md): add fallback to specific model to docs * fix(utils.py): new 'is_prompt_caching_valid_prompt' helper util Allows user to identify if messages/tools have prompt caching Related issue: BerriAI#6784 * feat(router.py): store model id for prompt caching valid prompt Allows routing to that model id on subsequent requests * fix(router.py): only cache if prompt is valid prompt caching prompt prevents storing unnecessary items in cache * feat(router.py): support routing prompt caching enabled models to previous deployments Closes BerriAI#6784 * test: fix linting errors * feat(databricks/): convert basemodel to dict and exclude none values allow passing pydantic message to databricks * fix(utils.py): ensure all chat completion messages are dict * (feat) Track `custom_llm_provider` in LiteLLMSpendLogs (BerriAI#7081) * add custom_llm_provider to SpendLogsPayload * add custom_llm_provider to SpendLogs * add custom llm provider to SpendLogs payload * test_spend_logs_payload * Add MLflow to the side bar (BerriAI#7031) Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> * (bug fix) SpendLogs update DB catch all possible DB errors for retrying (BerriAI#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure * (Feat) Add StructuredOutputs support for Fireworks.AI (BerriAI#7085) * fix model cost map fireworks ai "supports_response_schema": true, * fix supports_response_schema * fix map openai params fireworks ai * test_map_response_format * test_map_response_format * added deepinfra/Meta-Llama-3.1-405B-Instruct (BerriAI#7084) * bump: version 1.53.9 → 1.54.0 * fix deepinfra * litellm db fixes LiteLLM_UserTable (BerriAI#7089) * ci/cd queue new release * fix llama-3.3-70b-versatile * refactor - use consistent file naming convention `AI21/` -> `ai21` (BerriAI#7090) * fix refactor - use consistent file naming convention * ci/cd run again * fix naming structure * fix use consistent naming (BerriAI#7092) --------- Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Yuki Watanabe <31463517+B-Step62@users.noreply.github.com> Co-authored-by: ali sayyah <ali.sayyah2@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The Feature
Prompt caching is harder to trigger when litellm load balances across several deployments (using Azure as an example). If the litellm gateway is configured with, say, 3 deployments for a specific model, it may take 3 or more calls before prompt caching takes place, and cost saving+lower latency achieved. The more deployments for a single model, the more calls it will take to "warm up" prompt caching in each deployment.
I am suggesting the following prompt caching friendly routing strategy: whenever a prompt of over 1024 tokens is detected, litellm would cache the beginning of the prompt along with the model-id it landed on. On subsequent calls with the same first 1024 tokens in the prompt, litellm would route the request to the same cached model-id. The cache entries would only need to live in the cache for as long as the prompt caching TTL of the LLM providers themselves (which varies from 5m to one hour)
Motivation, pitch
Lower costs, lower latencies with prompt caching that kicks in immediately while not sacrificing load-balancing.
Twitter / LinkedIn details
https://www.linkedin.com/in/jeromeroussin/
The text was updated successfully, but these errors were encountered: