Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancellation with Cosmos DB Exception properties #371

Closed
jimmyca15 opened this issue Jun 4, 2019 · 7 comments
Closed

Cancellation with Cosmos DB Exception properties #371

jimmyca15 opened this issue Jun 4, 2019 · 7 comments

Comments

@jimmyca15
Copy link
Member

Is your feature request related to a problem? Please describe.
We sometimes experience long running requests to Cosmos DB. When collaborating with Cosmos DB support to diagnose the issue, the first piece of information that is requested is a stack trace for the exception. This gives information like Activity ID.

Unfortunately, we never have these stack traces because we limit our requests to 5 seconds. This is a requirement of our service. We need to be able to respond back fast, even in error. Therefore, when Cosmos DB has these long running requests, our stack traces are generic TaskCancelledException stack traces caused by the CancellationToken being cancelled.

The request becomes very difficult to track at this point.

Describe the solution you'd like
I would like a way to limit our requests using the cancellation token and still get a relevant cosmos DB exception that contains the activity id, request id, correlation id etc... when the cancellation token has been cancelled.

Describe alternatives you've considered
An alternative, but not as desirable solution would be to allow the client to set a 5 second timeout which would still preserve the detailed exceptions of cosmos DB. This isn't as desirable because a 5 second timeout on the client itself is not as granular of control as passing a cancellation token. Cancellations can be linked and what not.

A workaround of sorts that we thought of would be to set request IDs before talking to cosmos DB and then logging them, but there is no way to set this with the current version of the sdk that we use, plus it requires extra code.

@j82w
Copy link
Contributor

j82w commented Jun 5, 2019

This is a rather interesting scenario. Here are the possible solutions that I can think of.

  1. I think this is the best solution for your issue is to not cancel the operation. Instead start another task that does the same call. This way you can take which ever request returns the fastest. Then log the request diagnostics and other information of the long running request.

  2. Look at the Trace logs to figure out the request information. This will probably be painful to figure out which request was having the issue.

  3. New feature where this scenario is better handled. @kirankumarkolli this is an interesting scenario that should be included for the new diagnostics that will be implemented in the v3 SDK.

@jimmyca15
Copy link
Member Author

  1. We can't start another task if this is a write operation. It's not necessarily idempotent. Also if we spin up parallel requests we are sacrificing resources for no reason.

  2. What trace logs do you mean?

@j82w
Copy link
Contributor

j82w commented Jun 5, 2019

@jimmyca15
Copy link
Member Author

@j82w we run on linux containers. Doesn't look like this is an option.

@j82w
Copy link
Contributor

j82w commented Jun 5, 2019

The best fix for this issue is going to be a new feature to make sure this information is accessible.

You could try using ConnectionPolicy.RequestTimeout to cancel the process. The one downside to this is it applies to the entire SDK.

It's possible the SDK is doing retries for throttling and other common exceptions. Do you see any errors in the portal metrics?

@jimmyca15
Copy link
Member Author

We have a hard 5 second request limit on all requests, so configuring that as the maximum on the SDK level may be fine. It isn't a complete solution though, since we need some requests to cancel even shorter. Does it apply to TCP/Direct connections?

We turn of all retries on the SDK and do retries ourselves using our own invoker implementation. In most incidents we don't see any errors in the portal metrics.

@j82w
Copy link
Contributor

j82w commented Mar 26, 2021

This was fixed in #1550

@j82w j82w closed this as completed Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants