Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CosmosException message sometimes contains full stack traces #1507

Open
majastrz opened this issue May 13, 2020 · 7 comments
Open

CosmosException message sometimes contains full stack traces #1507

majastrz opened this issue May 13, 2020 · 7 comments

Comments

@majastrz
Copy link
Member

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug
We have noticed that in throttling conditions when the built-in retries are exhausted, the Cosmos DB SDK sometimes throws a CosmosException with the message containing full stack traces.

To Reproduce
I don't have a minimal repro here and I don't fully understand what causes the SDK to return the stack traces, but it doesn't always happen.

The issue actually caused an outage in our service. We have several background jobs that append their status (including exception messages) to a document in our collection. In throttling conditions, we would append the "final" CosmosException message to that document. Unfortunately, we saw that this exception message would sometimes exceed 100KB. Combined with further retries and failures, it would grow the document to over 600KB making it more and more expensive to upsert and prolonging the outage.

Expected behavior
Exception messages should not be exceeding 100KB and should not contain full stack traces. (We truncate the exception messages before saving them, but we should not have had to do that.)

Actual behavior
In some cases, the SDK throws a CosmosException with messages over 100KB containing full stack traces.

Environment summary
SDK Version: 3.5.1
OS Version: Windows

Additional context
N/A

@j82w
Copy link
Contributor

j82w commented May 13, 2020

By any chance do you know what operation is causing it? @ealsur is this possibly the same bug you already fixed with the task.yield change in the direct package?

@ealsur
Copy link
Member

ealsur commented May 13, 2020

Task.Yield would not affect the actual stack trace produced (just how it is maintained in the stack/heap), and 429s are exception-less.

@majastrz Are you changing the retry configuration?

Our retries are based on async/await, so if a 429 happens, we retry, but 429s don't produce an exception in themselves. The Retry handlers inspect the Response and if it's a 429, they retry it.
What might be happening though is that the stack grows because basically:

CreateItemAsync
-- RetryHandler (simplifying, there are other handlers involved)
-- Transport (sends payload)
-- RetryHandler (receives 429, decides to retry)
-- Transport (sends payload)
(... this repeats X times, until...)
-- RetryHandler (receives 429, exhausted all retry options)
-- throws CosmosException

So the exception might contain the complete stack, since the CreateItemAsync (or whatever operation) was initiated. The thing here is that there could be other Retries because of other reasons (connection blip, partition split, etc), so truncating the stack is really tricky (what do you truncate and why?).

@majastrz
Copy link
Member Author

@ealsur No, we are using the default retry configuration currently. We didn't add any custom handlers to the client's pipeline, either.

Is it possible/feasible for you to use InnerException or create an InnerExceptions property (like on AggregateException) to store all of the attempts instead of dumping it into the message?

And would disabling the default retry policy be a viable workaround for this behavior?

@ealsur
Copy link
Member

ealsur commented May 13, 2020

@j82w are you dumping the stack trace on the Message or is it the ToString()

@majastrz You can disable the retry policy, but you will get an exception on the first 429, you can retry on your end though.

@marcre
Copy link

marcre commented May 13, 2020

I still have one of the captured exception messages. Here's the start:

Response status code does not indicate success: 429 Substatus: 3200 Reason: (Microsoft.Azure.Cosmos.Query.Core.Monads.ExceptionWithStackTraceException: TryCatch resulted in an exception. ---> Microsoft.Azure.Cosmos.Query.Core.Monads.ExceptionWithStackTraceException: TryCatch resulted in an exception. ---> Microsoft.Azure.Cosmos.Query.Core.Monads.ExceptionWithStackTraceException: TryCatch resulted in an exception. ---> Microsoft.Azure.Cosmos.Query.Core.Monads.ExceptionWithStackTraceException: TryCatch resulted in an exception. ---> Microsoft.Azure.Cosmos.CosmosException : Exception of type 'Microsoft.Azure.Cosmos.CosmosException' was thrown.\r\nStatusCode = 429;\r\nSubStatusCode = 3200;\r\nActivityId = dd8b6358-5ee5-4ba6-990f-4ec4a25f6445;\r\nRequestCharge = 30.01;\r\n\r\n --- End of inner exception stack trace ---\r\n at Microsoft.Azure.Cosmos.Query.Core

@ealsur
Copy link
Member

ealsur commented May 13, 2020

That's a throttle too, seems to come from a query.

@j82w
Copy link
Contributor

j82w commented May 13, 2020

This might be a result of the query exception handling where it is creating a stack trace.

@ealsur this was changed in version 3.7.1 with #1298. I believe previous version do contain the stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants