(bug fix) SpendLogs update DB catch all possible DB errors for retrying #7082

ishaan-jaff · 2024-12-07T20:29:21Z

SpendLogs reliability improvements

Retry mechanism was catching except httpx.ReadTimeout:, users were seeing httpx.ConnectError during partial DB outages
Fixes exponential backoff for updating SpendLogs in DB
correctly raises error after num retries is crossed for exceptions

Relevant issues

Type

🐛 Bug Fix
✅ Test

Changes

[REQUIRED] Testing - Attach a screenshot of any new tests passing locall

If UI changes, send a screenshot/GIF of working UI fixes

Tests retry mechanism for various database connection errors (ConnectError, ReadError, ReadTimeout), verifying successful completion after multiple retries and proper transaction cleanup
Validates proper failure handling when maximum retry attempts are exceeded, ensuring correct error propagation and failure handler notification
Verifies immediate failure for non-connection related errors without retry attempts, confirming proper error handling and failure notification
Tests exponential backoff implementation between retry attempts, confirming correct delay intervals (1s, 2s) and successful completion
Validates processing of large batches of spend logs (150 logs in batches of 100), ensuring correct batch splitting, processing order, and complete log handling

vercel · 2024-12-07T20:29:26Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
litellm	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 7, 2024 10:16pm

…N blocking

codecov · 2024-12-07T23:19:19Z

Codecov Report

Attention: Patch coverage is 51.35135% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
litellm/proxy/utils.py	47.05%	18 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ng (#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure

@micahjsmith

* fix(main.py): support passing max retries to azure/openai embedding integrations Fixes #7003 * feat(team_endpoints.py): allow updating team model aliases Closes #6956 * feat(router.py): allow specifying model id as fallback - skips any cooldown check Allows a default model to be checked if all models in cooldown s/o @micahjsmith * docs(reliability.md): add fallback to specific model to docs * fix(utils.py): new 'is_prompt_caching_valid_prompt' helper util Allows user to identify if messages/tools have prompt caching Related issue: #6784 * feat(router.py): store model id for prompt caching valid prompt Allows routing to that model id on subsequent requests * fix(router.py): only cache if prompt is valid prompt caching prompt prevents storing unnecessary items in cache * feat(router.py): support routing prompt caching enabled models to previous deployments Closes #6784 * test: fix linting errors * feat(databricks/): convert basemodel to dict and exclude none values allow passing pydantic message to databricks * fix(utils.py): ensure all chat completion messages are dict * (feat) Track `custom_llm_provider` in LiteLLMSpendLogs (#7081) * add custom_llm_provider to SpendLogsPayload * add custom_llm_provider to SpendLogs * add custom llm provider to SpendLogs payload * test_spend_logs_payload * Add MLflow to the side bar (#7031) Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> * (bug fix) SpendLogs update DB catch all possible DB errors for retrying (#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure * (Feat) Add StructuredOutputs support for Fireworks.AI (#7085) * fix model cost map fireworks ai "supports_response_schema": true, * fix supports_response_schema * fix map openai params fireworks ai * test_map_response_format * test_map_response_format * added deepinfra/Meta-Llama-3.1-405B-Instruct (#7084) * bump: version 1.53.9 → 1.54.0 * fix deepinfra * litellm db fixes LiteLLM_UserTable (#7089) * ci/cd queue new release * fix llama-3.3-70b-versatile * refactor - use consistent file naming convention `AI21/` -> `ai21` (#7090) * fix refactor - use consistent file naming convention * ci/cd run again * fix naming structure * fix use consistent naming (#7092) --------- Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Yuki Watanabe <31463517+B-Step62@users.noreply.github.com> Co-authored-by: ali sayyah <ali.sayyah2@gmail.com>

…ng (BerriAI#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure

@micahjsmith

* fix(main.py): support passing max retries to azure/openai embedding integrations Fixes BerriAI#7003 * feat(team_endpoints.py): allow updating team model aliases Closes BerriAI#6956 * feat(router.py): allow specifying model id as fallback - skips any cooldown check Allows a default model to be checked if all models in cooldown s/o @micahjsmith * docs(reliability.md): add fallback to specific model to docs * fix(utils.py): new 'is_prompt_caching_valid_prompt' helper util Allows user to identify if messages/tools have prompt caching Related issue: BerriAI#6784 * feat(router.py): store model id for prompt caching valid prompt Allows routing to that model id on subsequent requests * fix(router.py): only cache if prompt is valid prompt caching prompt prevents storing unnecessary items in cache * feat(router.py): support routing prompt caching enabled models to previous deployments Closes BerriAI#6784 * test: fix linting errors * feat(databricks/): convert basemodel to dict and exclude none values allow passing pydantic message to databricks * fix(utils.py): ensure all chat completion messages are dict * (feat) Track `custom_llm_provider` in LiteLLMSpendLogs (BerriAI#7081) * add custom_llm_provider to SpendLogsPayload * add custom_llm_provider to SpendLogs * add custom llm provider to SpendLogs payload * test_spend_logs_payload * Add MLflow to the side bar (BerriAI#7031) Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> * (bug fix) SpendLogs update DB catch all possible DB errors for retrying (BerriAI#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure * (Feat) Add StructuredOutputs support for Fireworks.AI (BerriAI#7085) * fix model cost map fireworks ai "supports_response_schema": true, * fix supports_response_schema * fix map openai params fireworks ai * test_map_response_format * test_map_response_format * added deepinfra/Meta-Llama-3.1-405B-Instruct (BerriAI#7084) * bump: version 1.53.9 → 1.54.0 * fix deepinfra * litellm db fixes LiteLLM_UserTable (BerriAI#7089) * ci/cd queue new release * fix llama-3.3-70b-versatile * refactor - use consistent file naming convention `AI21/` -> `ai21` (BerriAI#7090) * fix refactor - use consistent file naming convention * ci/cd run again * fix naming structure * fix use consistent naming (BerriAI#7092) --------- Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Yuki Watanabe <31463517+B-Step62@users.noreply.github.com> Co-authored-by: ali sayyah <ali.sayyah2@gmail.com>

ishaan-jaff added 3 commits December 7, 2024 12:24

catch DB_CONNECTION_ERROR_TYPES

9246747

fix DB retry mechanism for SpendLog updates

6ea2c36

use DB_CONNECTION_ERROR_TYPES in auth checks

b23fc70

vercel bot deployed to Preview December 7, 2024 20:29 View deployment

ishaan-jaff added 2 commits December 7, 2024 13:12

fix exp back off for writing SpendLogs

a7d63ab

use _raise_failed_update_spend_exception to ensure errors print as NO…

5d185aa

…N blocking

vercel bot deployed to Preview December 7, 2024 22:00 View deployment

test_update_spend_logs_multiple_batches_with_failure

3466468

vercel bot deployed to Preview December 7, 2024 22:16 View deployment

BerriAI deleted a comment from delve-auditor bot Dec 7, 2024

ishaan-jaff merged commit c33cebb into main Dec 7, 2024
26 of 28 checks passed

krrishdholakia mentioned this pull request Dec 8, 2024

Litellm dev 12 07 2024 #7086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(bug fix) SpendLogs update DB catch all possible DB errors for retrying #7082

(bug fix) SpendLogs update DB catch all possible DB errors for retrying #7082

ishaan-jaff commented Dec 7, 2024 •

edited

Loading

vercel bot commented Dec 7, 2024 •

edited

Loading

codecov bot commented Dec 7, 2024

(bug fix) SpendLogs update DB catch all possible DB errors for retrying #7082

(bug fix) SpendLogs update DB catch all possible DB errors for retrying #7082

Conversation

ishaan-jaff commented Dec 7, 2024 • edited Loading

SpendLogs reliability improvements

Relevant issues

Type

Changes

[REQUIRED] Testing - Attach a screenshot of any new tests passing locall

vercel bot commented Dec 7, 2024 • edited Loading

codecov bot commented Dec 7, 2024

Codecov Report

ishaan-jaff commented Dec 7, 2024 •

edited

Loading

vercel bot commented Dec 7, 2024 •

edited

Loading