Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the transient errors in efcore when using Cosmos DB #28629

Closed
rmt2021 opened this issue Aug 8, 2022 · 2 comments · Fixed by #28686
Closed

Handle the transient errors in efcore when using Cosmos DB #28629

rmt2021 opened this issue Aug 8, 2022 · 2 comments · Fixed by #28686
Labels
area-cosmos closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. community-contribution customer-reported type-enhancement
Milestone

Comments

@rmt2021
Copy link
Contributor

rmt2021 commented Aug 8, 2022

It is possible to encounter transient errors when using efcore with Cosmos DB, and it seems that the efcore cannot handle such errors (like 410 Gone) now. Is it better if we can provide some retries for the transient error codes? Otherwise APIs like: CreateDatabaseIfNotExistsAsync, CreateContainerStreamAsync, CreateItemStreamAsync, ReadNextAsync, ReadNextAsync, DeleteItemStreamAsync, ReplaceItemStreamAsync, ReadItemStreamAsync will directly fail due to transient faults.

As this document shows, there are some transient error codes (408, 410, 429, 449, and 503) that we can retry to make efcore more resilient:
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/conceptual-resilient-sdk-applications#should-my-application-retry-on-errors

@roji
Copy link
Member

roji commented Aug 9, 2022

@rmt2021 see the conversation in #8443 (comment). As the docs you linked to specify, the error codes you listed should be retried by the SDK, which the EF provider uses. Are you seeing a different behavior?

@rmt2021
Copy link
Contributor Author

rmt2021 commented Aug 12, 2022

Thanks for sharing this conversation @roji.

The SDK will only retry on the error codes when there are multiple available regions. For example, if 503 happens, and there is only one region for the Cosmos DB, then the retry will not happen, as the SDK code written:
https://github.com/Azure/azure-cosmos-dotnet-v3/blob/b713ce4cb3e482175a8a6a9b8fc7051d9c7b5e91/Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs#L373-L378

I notice that efcore actually has this code snippet to retry for transient error codes:

static bool IsTransient(HttpStatusCode statusCode)
=> statusCode == HttpStatusCode.ServiceUnavailable
|| statusCode == HttpStatusCode.TooManyRequests;

which is to retry for 503 (ServiceUnavailable) and 429 (TooManyRequests).

I believe we should also retry on 408, 410 and 449.

AndriySvyryd pushed a commit that referenced this issue Aug 12, 2022
@AndriySvyryd AndriySvyryd added this to the 7.0.0-rc1 milestone Aug 12, 2022
@ajcvickers ajcvickers added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Aug 14, 2022
@ajcvickers ajcvickers modified the milestones: 7.0.0-rc1, 7.0.0 Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-cosmos closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. community-contribution customer-reported type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants