Cancellation with Cosmos DB Exception properties #371

jimmyca15 · 2019-06-04T20:04:42Z

Is your feature request related to a problem? Please describe.
We sometimes experience long running requests to Cosmos DB. When collaborating with Cosmos DB support to diagnose the issue, the first piece of information that is requested is a stack trace for the exception. This gives information like Activity ID.

Unfortunately, we never have these stack traces because we limit our requests to 5 seconds. This is a requirement of our service. We need to be able to respond back fast, even in error. Therefore, when Cosmos DB has these long running requests, our stack traces are generic TaskCancelledException stack traces caused by the CancellationToken being cancelled.

The request becomes very difficult to track at this point.

Describe the solution you'd like
I would like a way to limit our requests using the cancellation token and still get a relevant cosmos DB exception that contains the activity id, request id, correlation id etc... when the cancellation token has been cancelled.

Describe alternatives you've considered
An alternative, but not as desirable solution would be to allow the client to set a 5 second timeout which would still preserve the detailed exceptions of cosmos DB. This isn't as desirable because a 5 second timeout on the client itself is not as granular of control as passing a cancellation token. Cancellations can be linked and what not.

A workaround of sorts that we thought of would be to set request IDs before talking to cosmos DB and then logging them, but there is no way to set this with the current version of the sdk that we use, plus it requires extra code.

j82w · 2019-06-05T15:09:41Z

This is a rather interesting scenario. Here are the possible solutions that I can think of.

I think this is the best solution for your issue is to not cancel the operation. Instead start another task that does the same call. This way you can take which ever request returns the fastest. Then log the request diagnostics and other information of the long running request.
Look at the Trace logs to figure out the request information. This will probably be painful to figure out which request was having the issue.
New feature where this scenario is better handled. @kirankumarkolli this is an interesting scenario that should be included for the new diagnostics that will be implemented in the v3 SDK.

jimmyca15 · 2019-06-05T15:14:45Z

We can't start another task if this is a write operation. It's not necessarily idempotent. Also if we spin up parallel requests we are sacrificing resources for no reason.
What trace logs do you mean?

j82w · 2019-06-05T17:56:41Z

Here are the trace logs for the v2 SDK. https://github.com/Azure/azure-cosmos-dotnet-v2/blob/master/docs/documentdb-sdk_capture_etl.md

jimmyca15 · 2019-06-05T17:59:02Z

@j82w we run on linux containers. Doesn't look like this is an option.

j82w · 2019-06-05T19:18:58Z

The best fix for this issue is going to be a new feature to make sure this information is accessible.

You could try using ConnectionPolicy.RequestTimeout to cancel the process. The one downside to this is it applies to the entire SDK.

It's possible the SDK is doing retries for throttling and other common exceptions. Do you see any errors in the portal metrics?

jimmyca15 · 2019-06-05T19:27:43Z

We have a hard 5 second request limit on all requests, so configuring that as the maximum on the SDK level may be fine. It isn't a complete solution though, since we need some requests to cancel even shorter. Does it apply to TCP/Direct connections?

We turn of all retries on the SDK and do retries ourselves using our own invoker implementation. In most incidents we don't see any errors in the portal metrics.

j82w · 2021-03-26T13:52:57Z

This was fixed in #1550

j82w closed this as completed Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancellation with Cosmos DB Exception properties #371

Cancellation with Cosmos DB Exception properties #371

jimmyca15 commented Jun 4, 2019

j82w commented Jun 5, 2019

jimmyca15 commented Jun 5, 2019

j82w commented Jun 5, 2019

jimmyca15 commented Jun 5, 2019

j82w commented Jun 5, 2019

jimmyca15 commented Jun 5, 2019

j82w commented Mar 26, 2021

Cancellation with Cosmos DB Exception properties #371

Cancellation with Cosmos DB Exception properties #371

Comments

jimmyca15 commented Jun 4, 2019

j82w commented Jun 5, 2019

jimmyca15 commented Jun 5, 2019

j82w commented Jun 5, 2019

jimmyca15 commented Jun 5, 2019

j82w commented Jun 5, 2019

jimmyca15 commented Jun 5, 2019

j82w commented Mar 26, 2021