-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AggregateException (instead of CosmosException) being thrown on GetFeedRanges when Gateway fails #4528
Comments
Can you please attach the full exception? Normally TransportExceptions materialize as a public CosmosException with 503 as status code. DocumentClientException is still there, but internal, that is expected. The key part is understanding what was the upper-most type, which should be CosmosException (regardless of the InnerException property value). |
@ealsur Thanks for getting back to me on this. System.AggregateException: One or more errors occurred. (Channel is closed |
@albertofori This seems a failure on the service.
This means that there was a response from the Cosmos DB Gateway endpoint. The content of the response from the Gateway service is the one with the AggregateException. The body of the Gateway response is the being printed here, these TCP errors are not happening on the client. Sounds like the Gateway service is being overly verbose in including the failure. The response was a ServiceUnavailable error (503) |
Using the details on this error, I can see the same exception details on the service logs. From the SDK side, this is a Service Unavailable and should be treated as such: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications#timeouts-and-connectivity-related-failures-http-408503 But there are no SDK changes that can prevent the service from hitting this error or producing the body containing these DocumentClientExceptions in the body content. Looking at the service logs, there also appears to be only 2 failures in a 24h period. |
If these are happening more frequently, please file a support ticket, they seem to be transient failures on the service but the volume does not seem to be affecting SLA |
Accidentally closed |
@ealsur Thanks! |
I would also expect the outcome is a It seems we do account for similar cases:
But in this case, you are performing a GetFeedRanges call, which is purely a metadata operation. The gap might be that in this case, there is no handling of these potential cases as the operation does not flow through the Handler pipeline. Reference: azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/Resource/Container/ContainerCore.cs Line 273 in 18a677a
In this case, it would be ideal to avoid AggregateException on the AsyncNonBlockingCache or do the conversion on GetFeedRanges. I'll tag this issue appropiately to signal it needs addressing. |
The "reference" however (what is mentioned in the title) cannot be removed, as it's part of the content of the Gateway response and internal. |
Sounds good! Getting this as a CosmosException would then make handling of the error more consistent with other errors via the StatusCode property without having to dive into the InnerException Thanks a lot @ealsur! |
Understood, and I agree with this as it provides more information on the underlying cause. |
Hi @ealsur, I am wondering if any of the updates after version 3.33.1 addresses this discussed issue. Just wanted to make sure whether this should not be expected with the latest version of the Microsoft.Azure.Cosmos package. |
I don't see any PRs that fixed this issue yet. |
@ealsur We are currently seeing more of these errors in our microservices. We have logic to handle inner exceptions of an AggregateException but since the inner DocumentClientException is not public in the v3 sdk, we cannot write appropriate error handling logic for this particular 503 error, and our services crash as a result of this inconsistency. Since we already agreed on a proposed fix to surface this as a 503 CosmosException (similar to other requests that are not feed range queries), is it possible to commit to this fix in the next release? |
@albertofori In our attempts to repro, the only thing we could find is that the GetFeedRangesAsync call would throw a DocumentClientException. Which it should be a CosmosException. But there is no AggregateException. Could the AggregateException be coming from the fact that you are receiving the outcome of this call in something like a |
Hi @ealsur, indeed, we do use a ContinueWith block so the top-level AggregateException might be based on our flow. So the SDK might be directly throwing a DocumentClientException as you have reproed. So, as you mentioned, a change to throw |
Thanks @ealsur! Please may I know when the next release of the SDK will be? |
We are using CosmosDB SDK version 3.33.1. We currently see a Microsoft.Azure.Documents.DocumentClientException as the immediate inner exception under an aggregate exception that is thrown when there is a network issue.
This seems to be a reference to a V2 SDK exception which is not directly exposed via the current SDK.
Is it intended to expose such an exception directly within V3 SDK AggregateException under these circumstances.
Please find below the stack trace (First inner exception can be found on the last line):
System.AggregateException: One or more errors occurred. (Channel is closed ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab, RequestStartTime: 2024-05-31T17:07:41.1592990Z, RequestEndTime: 2024-05-31T17:07:47.5512837Z, Number of regions attempted:1 {"systemHistory":[{"dateUtc":"2024-05-31T17:06:53.5338902Z","cpu":0.452,"memory":663574236.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0278,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:03.5440037Z","cpu":0.232,"memory":663573756.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.1174,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:13.5537779Z","cpu":0.124,"memory":663597240.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0793,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:23.5635773Z","cpu":0.295,"memory":663585964.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.2314,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:33.5737057Z","cpu":0.129,"memory":663588336.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0326,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:43.5834609Z","cpu":0.176,"memory":663585136.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.246,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401}]} RequestStart: 2024-05-31T17:07:41.1595176Z; ResponseTime: 2024-05-31T17:07:47.5512837Z; StoreResult: StorePhysicalAddress: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 503, SubStatusCode: 20006, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: A client transport error occurred: The connection failed. (Time: 2024-05-31T17:07:47.5484179Z, activity ID: 29bda807-7374-4033-a295-dd3ba89246ab, error code: ConnectionBroken [0x0012], base error: socket error TimedOut [0x0000274C], URI: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, connection: 10.0.1.8:47632 -> 10.0.1.17:11000, payload sent: True), BELatencyMs: , ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab, RetryAfterInMs: , ReplicaHealthStatuses: [(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM)], TransportRequestTimeline: {"requestTimeline":[{"event": "Created", "startTimeUtc": "2024-05-31T17:07:41.1594362Z", "durationInMs": 0.0166},{"event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-05-31T17:07:41.1594528Z", "durationInMs": 0.0086},{"event": "Pipelined", "startTimeUtc": "2024-05-31T17:07:41.1594614Z", "durationInMs": 0.0422},{"event": "Transit Time", "startTimeUtc": "2024-05-31T17:07:41.1595036Z", "durationInMs": 6389.3898},{"event": "Failed", "startTimeUtc": "2024-05-31T17:07:47.5488934Z", "durationInMs": 0}],"serviceEndpointStats":{"inflightRequests":8,"openConnections":1},"connectionStats":{"waitforConnectionInit":"False","callsPendingReceive":7,"lastSendAttempt":"2024-05-31T17:07:40.7640721Z","lastSend":"2024-05-31T17:07:40.7640802Z","lastReceive":"2024-05-31T17:07:26.7535939Z"},"requestSizeInBytes":725,"requestBodySizeInBytes":275}; ResourceType: DatabaseAccount, OperationType: MetadataCheckAccess , Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/2.0 cosmos-netstandard-sdk/3.33.1) ---> Microsoft.Azure.Documents.DocumentClientException: Channel is closed ActivityId: ....
The text was updated successfully, but these errors were encountered: