Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AggregateException (instead of CosmosException) being thrown on GetFeedRanges when Gateway fails #4528

Closed
albertofori opened this issue Jun 4, 2024 · 17 comments · Fixed by #4640
Labels
bug Something isn't working customer-reported Issue created by a customer

Comments

@albertofori
Copy link
Member

albertofori commented Jun 4, 2024

We are using CosmosDB SDK version 3.33.1. We currently see a Microsoft.Azure.Documents.DocumentClientException as the immediate inner exception under an aggregate exception that is thrown when there is a network issue.

This seems to be a reference to a V2 SDK exception which is not directly exposed via the current SDK.
Is it intended to expose such an exception directly within V3 SDK AggregateException under these circumstances.

Please find below the stack trace (First inner exception can be found on the last line):

System.AggregateException: One or more errors occurred. (Channel is closed ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab, RequestStartTime: 2024-05-31T17:07:41.1592990Z, RequestEndTime: 2024-05-31T17:07:47.5512837Z, Number of regions attempted:1 {"systemHistory":[{"dateUtc":"2024-05-31T17:06:53.5338902Z","cpu":0.452,"memory":663574236.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0278,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:03.5440037Z","cpu":0.232,"memory":663573756.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.1174,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:13.5537779Z","cpu":0.124,"memory":663597240.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0793,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:23.5635773Z","cpu":0.295,"memory":663585964.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.2314,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:33.5737057Z","cpu":0.129,"memory":663588336.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0326,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:43.5834609Z","cpu":0.176,"memory":663585136.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.246,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401}]} RequestStart: 2024-05-31T17:07:41.1595176Z; ResponseTime: 2024-05-31T17:07:47.5512837Z; StoreResult: StorePhysicalAddress: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 503, SubStatusCode: 20006, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: A client transport error occurred: The connection failed. (Time: 2024-05-31T17:07:47.5484179Z, activity ID: 29bda807-7374-4033-a295-dd3ba89246ab, error code: ConnectionBroken [0x0012], base error: socket error TimedOut [0x0000274C], URI: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, connection: 10.0.1.8:47632 -> 10.0.1.17:11000, payload sent: True), BELatencyMs: , ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab, RetryAfterInMs: , ReplicaHealthStatuses: [(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM)], TransportRequestTimeline: {"requestTimeline":[{"event": "Created", "startTimeUtc": "2024-05-31T17:07:41.1594362Z", "durationInMs": 0.0166},{"event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-05-31T17:07:41.1594528Z", "durationInMs": 0.0086},{"event": "Pipelined", "startTimeUtc": "2024-05-31T17:07:41.1594614Z", "durationInMs": 0.0422},{"event": "Transit Time", "startTimeUtc": "2024-05-31T17:07:41.1595036Z", "durationInMs": 6389.3898},{"event": "Failed", "startTimeUtc": "2024-05-31T17:07:47.5488934Z", "durationInMs": 0}],"serviceEndpointStats":{"inflightRequests":8,"openConnections":1},"connectionStats":{"waitforConnectionInit":"False","callsPendingReceive":7,"lastSendAttempt":"2024-05-31T17:07:40.7640721Z","lastSend":"2024-05-31T17:07:40.7640802Z","lastReceive":"2024-05-31T17:07:26.7535939Z"},"requestSizeInBytes":725,"requestBodySizeInBytes":275}; ResourceType: DatabaseAccount, OperationType: MetadataCheckAccess , Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/2.0 cosmos-netstandard-sdk/3.33.1) ---> Microsoft.Azure.Documents.DocumentClientException: Channel is closed ActivityId: ....

@albertofori albertofori changed the title Reference to exceptions from V2 library in V3 errors. Reference to exceptions from V2 library in V3 AggregateException. Jun 4, 2024
@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

Can you please attach the full exception? Normally TransportExceptions materialize as a public CosmosException with 503 as status code.

DocumentClientException is still there, but internal, that is expected. The key part is understanding what was the upper-most type, which should be CosmosException (regardless of the InnerException property value).

@albertofori
Copy link
Member Author

@ealsur Thanks for getting back to me on this.
Below is the full exception, the upper-most type appears to be an AggregateException which is very generic. With inner exceptions that are internal, handling this exception becomes less straightforward than it would typically be.

System.AggregateException: One or more errors occurred. (Channel is closed
ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab,
RequestStartTime: 2024-05-31T17:07:41.1592990Z, RequestEndTime: 2024-05-31T17:07:47.5512837Z, Number of regions attempted:1
{"systemHistory":[{"dateUtc":"2024-05-31T17:06:53.5338902Z","cpu":0.452,"memory":663574236.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0278,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:03.5440037Z","cpu":0.232,"memory":663573756.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.1174,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:13.5537779Z","cpu":0.124,"memory":663597240.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0793,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:23.5635773Z","cpu":0.295,"memory":663585964.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.2314,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:33.5737057Z","cpu":0.129,"memory":663588336.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0326,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:43.5834609Z","cpu":0.176,"memory":663585136.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.246,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401}]}
RequestStart: 2024-05-31T17:07:41.1595176Z; ResponseTime: 2024-05-31T17:07:47.5512837Z; StoreResult: StorePhysicalAddress: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 503, SubStatusCode: 20006, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: A client transport error occurred: The connection failed. (Time: 2024-05-31T17:07:47.5484179Z, activity ID: 29bda807-7374-4033-a295-dd3ba89246ab, error code: ConnectionBroken [0x0012], base error: socket error TimedOut [0x0000274C], URI: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, connection: 10.0.1.8:47632 -> 10.0.1.17:11000, payload sent: True), BELatencyMs: , ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab, RetryAfterInMs: , ReplicaHealthStatuses: [(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM)], TransportRequestTimeline: {"requestTimeline":[{"event": "Created", "startTimeUtc": "2024-05-31T17:07:41.1594362Z", "durationInMs": 0.0166},{"event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-05-31T17:07:41.1594528Z", "durationInMs": 0.0086},{"event": "Pipelined", "startTimeUtc": "2024-05-31T17:07:41.1594614Z", "durationInMs": 0.0422},{"event": "Transit Time", "startTimeUtc": "2024-05-31T17:07:41.1595036Z", "durationInMs": 6389.3898},{"event": "Failed", "startTimeUtc": "2024-05-31T17:07:47.5488934Z", "durationInMs": 0}],"serviceEndpointStats":{"inflightRequests":8,"openConnections":1},"connectionStats":{"waitforConnectionInit":"False","callsPendingReceive":7,"lastSendAttempt":"2024-05-31T17:07:40.7640721Z","lastSend":"2024-05-31T17:07:40.7640802Z","lastReceive":"2024-05-31T17:07:26.7535939Z"},"requestSizeInBytes":725,"requestBodySizeInBytes":275};
ResourceType: DatabaseAccount, OperationType: MetadataCheckAccess
, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/2.0 cosmos-netstandard-sdk/3.33.1)
---> Microsoft.Azure.Documents.DocumentClientException: Channel is closed
ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab,
RequestStartTime: 2024-05-31T17:07:41.1592990Z, RequestEndTime: 2024-05-31T17:07:47.5512837Z, Number of regions attempted:1
{"systemHistory":[{"dateUtc":"2024-05-31T17:06:53.5338902Z","cpu":0.452,"memory":663574236.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0278,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:03.5440037Z","cpu":0.232,"memory":663573756.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.1174,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:13.5537779Z","cpu":0.124,"memory":663597240.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0793,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:23.5635773Z","cpu":0.295,"memory":663585964.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.2314,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:33.5737057Z","cpu":0.129,"memory":663588336.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0326,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401},{"dateUtc":"2024-05-31T17:07:43.5834609Z","cpu":0.176,"memory":663585136.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.246,"availableThreads":32765,"minThreads":64,"maxThreads":32767},"numberOfOpenTcpConnection":401}]}
RequestStart: 2024-05-31T17:07:41.1595176Z; ResponseTime: 2024-05-31T17:07:47.5512837Z; StoreResult: StorePhysicalAddress: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, LSN: -1, GlobalCommittedLsn: -1, PartitionKeyRangeId: , IsValid: False, StatusCode: 503, SubStatusCode: 20006, RequestCharge: 0, ItemLSN: -1, SessionToken: , UsingLocalLSN: False, TransportException: A client transport error occurred: The connection failed. (Time: 2024-05-31T17:07:47.5484179Z, activity ID: 29bda807-7374-4033-a295-dd3ba89246ab, error code: ConnectionBroken [0x0012], base error: socket error TimedOut [0x0000274C], URI: rntbd://10.0.1.17:11000/apps/3036edb8-a5b7-4779-89ab-9a0eb0a2340f/services/9c308ccc-4819-4bac-ad9a-8078f1783b80/partitions/a97a7a26-06e5-4e9d-92ee-489d1a774bc2/replicas/133574518272615524s, connection: 10.0.1.8:47632 -> 10.0.1.17:11000, payload sent: True), BELatencyMs: , ActivityId: 29bda807-7374-4033-a295-dd3ba89246ab, RetryAfterInMs: , ReplicaHealthStatuses: [(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11300 | status: Unknown | lkt: 5/31/2024 5:07:41 PM),(port: 11000 | status: Unknown | lkt: 5/31/2024 5:07:41 PM)], TransportRequestTimeline: {"requestTimeline":[{"event": "Created", "startTimeUtc": "2024-05-31T17:07:41.1594362Z", "durationInMs": 0.0166},{"event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-05-31T17:07:41.1594528Z", "durationInMs": 0.0086},{"event": "Pipelined", "startTimeUtc": "2024-05-31T17:07:41.1594614Z", "durationInMs": 0.0422},{"event": "Transit Time", "startTimeUtc": "2024-05-31T17:07:41.1595036Z", "durationInMs": 6389.3898},{"event": "Failed", "startTimeUtc": "2024-05-31T17:07:47.5488934Z", "durationInMs": 0}],"serviceEndpointStats":{"inflightRequests":8,"openConnections":1},"connectionStats":{"waitforConnectionInit":"False","callsPendingReceive":7,"lastSendAttempt":"2024-05-31T17:07:40.7640721Z","lastSend":"2024-05-31T17:07:40.7640802Z","lastReceive":"2024-05-31T17:07:26.7535939Z"},"requestSizeInBytes":725,"requestBodySizeInBytes":275};
ResourceType: DatabaseAccount, OperationType: MetadataCheckAccess
, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Linux/2.0 cosmos-netstandard-sdk/3.33.1
at Microsoft.Azure.Cosmos.GatewayStoreClient.ParseResponseAsync(HttpResponseMessage responseMessage, JsonSerializerSettings serializerSettings, DocumentServiceRequest request)
at Microsoft.Azure.Cosmos.GatewayStoreClient.InvokeAsync(DocumentServiceRequest request, ResourceType resourceType, Uri physicalAddress, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.GatewayStoreModel.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.GatewayStoreModel.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Routing.PartitionKeyRangeCache.ExecutePartitionKeyRangeReadChangeFeedAsync(String collectionRid, INameValueCollection headers, ITrace trace, IClientSideRequestStatistics clientSideRequestStatistics, IDocumentClientRetryPolicy retryPolicy)
at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync[TParam,TPolicy](Func1 callbackMethod, Func3 callbackMethodWithParam, Func2 callbackMethodWithPolicy, TParam param, IRetryPolicy retryPolicy, IRetryPolicy1 retryPolicyWithArg, Func1 inBackoffAlternateCallbackMethod, Func2 inBackoffAlternateCallbackMethodWithPolicy, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)
at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)
at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync[TParam,TPolicy](Func1 callbackMethod, Func3 callbackMethodWithParam, Func2 callbackMethodWithPolicy, TParam param, IRetryPolicy retryPolicy, IRetryPolicy1 retryPolicyWithArg, Func1 inBackoffAlternateCallbackMethod, Func2 inBackoffAlternateCallbackMethodWithPolicy, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)
at Microsoft.Azure.Documents.BackoffRetryUtility1.ExecuteRetryAsync[TParam,TPolicy](Func1 callbackMethod, Func3 callbackMethodWithParam, Func2 callbackMethodWithPolicy, TParam param, IRetryPolicy retryPolicy, IRetryPolicy1 retryPolicyWithArg, Func1 inBackoffAlternateCallbackMethod, Func2 inBackoffAlternateCallbackMethodWithPolicy, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action1 preRetryCallback)
at Microsoft.Azure.Cosmos.Routing.PartitionKeyRangeCache.GetRoutingMapForCollectionAsync(String collectionRid, CollectionRoutingMap previousRoutingMap, ITrace trace, IClientSideRequestStatistics clientSideRequestStatistics)
at Microsoft.Azure.Cosmos.AsyncCacheNonBlocking2.AsyncLazyWithRefreshTask1.CreateAndWaitForBackgroundRefreshTaskAsync(Func2 createRefreshTask) at Microsoft.Azure.Cosmos.AsyncCacheNonBlocking2.UpdateCacheAndGetValueFromBackgroundTaskAsync(TKey key, AsyncLazyWithRefreshTask1 initialValue, Func2 callbackDelegate, String operationName)
at Microsoft.Azure.Cosmos.AsyncCacheNonBlocking2.GetAsync(TKey key, Func2 singleValueInitFunc, Func2 forceRefresh) at Microsoft.Azure.Cosmos.Routing.PartitionKeyRangeCache.TryLookupAsync(String collectionRid, CollectionRoutingMap previousValue, DocumentServiceRequest request, ITrace trace) at Microsoft.Azure.Cosmos.Routing.PartitionKeyRangeCache.TryGetOverlappingRangesAsync(String collectionRid, Range1 range, ITrace trace, Boolean forceRefresh)
at Microsoft.Azure.Cosmos.ContainerCore.GetFeedRangesAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](String containerName, String databaseName, OperationType operationType, ITrace trace, Func2 task, Func2 openTelemetry, String operationName, RequestOptions requestOptions)
at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, String containerName, String databaseName, OperationType operationType, RequestOptions requestOptions, Func2 task, Func2 openTelemetry, TraceComponent traceComponent, TraceLevel traceLevel)
--- End of inner exception stack trace ---

@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

@albertofori This seems a failure on the service.

Microsoft.Azure.Cosmos.GatewayStoreClient.ParseResponseAsync(HttpResponseMessage responseMessage, JsonSerializerSettings serializerSettings, DocumentServiceRequest request)

This means that there was a response from the Cosmos DB Gateway endpoint. The content of the response from the Gateway service is the one with the AggregateException. The body of the Gateway response is the being printed here, these TCP errors are not happening on the client. Sounds like the Gateway service is being overly verbose in including the failure.

The response was a ServiceUnavailable error (503)

@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

Using the details on this error, I can see the same exception details on the service logs.

From the SDK side, this is a Service Unavailable and should be treated as such: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications#timeouts-and-connectivity-related-failures-http-408503

But there are no SDK changes that can prevent the service from hitting this error or producing the body containing these DocumentClientExceptions in the body content.

Looking at the service logs, there also appears to be only 2 failures in a 24h period.

@ealsur ealsur added Service Bug The issue is created because of a Cosmos DB service bug. and removed needs-investigation labels Jun 5, 2024
@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

If these are happening more frequently, please file a support ticket, they seem to be transient failures on the service but the volume does not seem to be affecting SLA

@ealsur ealsur closed this as completed Jun 5, 2024
@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

Accidentally closed

@ealsur ealsur reopened this Jun 5, 2024
@albertofori
Copy link
Member Author

@ealsur Thanks!
I guess my question is when do we expect a CosmosException to be thrown by the SDK? I would expect that we have the Gateway's exception in higher level exception like CosmosException.

@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

I would also expect the outcome is a CosmosException and not AggregateException for certain and that is the reason I kept this open.

It seems we do account for similar cases:

But in this case, you are performing a GetFeedRanges call, which is purely a metadata operation. The gap might be that in this case, there is no handling of these potential cases as the operation does not flow through the Handler pipeline.

Reference:

IReadOnlyList<PartitionKeyRange> partitionKeyRanges = await partitionKeyRangeCache.TryGetOverlappingRangesAsync(

In this case, it would be ideal to avoid AggregateException on the AsyncNonBlockingCache or do the conversion on GetFeedRanges. I'll tag this issue appropiately to signal it needs addressing.

@ealsur ealsur added the bug Something isn't working label Jun 5, 2024
@ealsur
Copy link
Member

ealsur commented Jun 5, 2024

The "reference" however (what is mentioned in the title) cannot be removed, as it's part of the content of the Gateway response and internal.

@ealsur ealsur changed the title Reference to exceptions from V2 library in V3 AggregateException. AggregateException being thrown on GetFeedRanges when Gateway fails instead of CosmosException Jun 5, 2024
@ealsur ealsur changed the title AggregateException being thrown on GetFeedRanges when Gateway fails instead of CosmosException AggregateException (instead of CosmosException) being thrown on GetFeedRanges when Gateway fails Jun 5, 2024
@albertofori
Copy link
Member Author

I would also expect the outcome is a CosmosException and not AggregateException for certain and that is the reason I kept this open.

It seems we do account for similar cases:

But in this case, you are performing a GetFeedRanges call, which is purely a metadata operation. The gap might be that in this case, there is no handling of these potential cases as the operation does not flow through the Handler pipeline.

Reference:

IReadOnlyList<PartitionKeyRange> partitionKeyRanges = await partitionKeyRangeCache.TryGetOverlappingRangesAsync(

In this case, it would be ideal to avoid AggregateException on the AsyncNonBlockingCache or do the conversion on GetFeedRanges. I'll tag this issue appropiately to signal it needs addressing.

Sounds good! Getting this as a CosmosException would then make handling of the error more consistent with other errors via the StatusCode property without having to dive into the InnerException Thanks a lot @ealsur!

@albertofori
Copy link
Member Author

The "reference" however (what is mentioned in the title) cannot be removed, as it's part of the content of the Gateway response and internal.

Understood, and I agree with this as it provides more information on the underlying cause.

@albertofori
Copy link
Member Author

Hi @ealsur,

I am wondering if any of the updates after version 3.33.1 addresses this discussed issue. Just wanted to make sure whether this should not be expected with the latest version of the Microsoft.Azure.Cosmos package.

@ealsur
Copy link
Member

ealsur commented Aug 12, 2024

I don't see any PRs that fixed this issue yet.

@ealsur ealsur removed the Service Bug The issue is created because of a Cosmos DB service bug. label Aug 12, 2024
@albertofori
Copy link
Member Author

albertofori commented Aug 12, 2024

@ealsur We are currently seeing more of these errors in our microservices. We have logic to handle inner exceptions of an AggregateException but since the inner DocumentClientException is not public in the v3 sdk, we cannot write appropriate error handling logic for this particular 503 error, and our services crash as a result of this inconsistency.

Since we already agreed on a proposed fix to surface this as a 503 CosmosException (similar to other requests that are not feed range queries), is it possible to commit to this fix in the next release?

@ealsur
Copy link
Member

ealsur commented Aug 12, 2024

@albertofori In our attempts to repro, the only thing we could find is that the GetFeedRangesAsync call would throw a DocumentClientException. Which it should be a CosmosException.

But there is no AggregateException.

Could the AggregateException be coming from the fact that you are receiving the outcome of this call in something like a .ContinueWith block of sorts?

@albertofori
Copy link
Member Author

albertofori commented Aug 12, 2024

Hi @ealsur, indeed, we do use a ContinueWith block so the top-level AggregateException might be based on our flow. So the SDK might be directly throwing a DocumentClientException as you have reproed.

So, as you mentioned, a change to throw CosmosException instead of the DocumentClientException should be sufficient for us.

@albertofori
Copy link
Member Author

albertofori commented Aug 19, 2024

Thanks @ealsur! Please may I know when the next release of the SDK will be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working customer-reported Issue created by a customer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants