-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NullReferenceException in GoneAndRetryWithRequestRetryPolicy.TryHandleResponseSynchronously in SDK 3.18 #2485
Comments
@craigjensen Is this on the Emulator or a live account? If it's on a live account, which consistency level are you using? Does the account have Firewall/VNET/Private Link enabled? |
It's a live account. The account is configured with Bounded Staleness but we're overriding with Session on each request. No Firewall/VNET/PL enabled |
@craigjensen In which platform are you running? NET Framework? NET Core? Do you have traces enabled? |
NetFx in an Azure WebApp so tracing not enabled |
@craigjensen does it reproduce with the latest 3.18.0? |
Yes - It started with 3.18.0. We didn't see it until we recently deployed 3.18.0 |
Any chance you can use a private nuget drop with additional telemetry? |
This is a production service and we're in lockdown for //Build next week so not likely, at least not anytime soon. |
@craigjensen - Is your service running in EUAP? |
Do you have any way to consistently reproduce the issue? Does it not show up in testing environments? |
@ealsur Yes, we do run the service in eus2euap but there's very little traffic there and we haven't seen the exception in that region. Most of the exceptions are coming from France Central but we have seen it in several US regions as well. |
@j82w We don't have a consistent repo. It happens in bursts - here's what we're seeing over the last day: |
Oh, and we haven't seen it in any of our test/integration environments |
The spikes in exceptions are correlated with bursts of activity on our servers that caused the CPU to max out so they're likely an artifact of resource starvation on the client. I would expect errors in this situation but not NullRefExceptions. |
We are also seeing a very similar behavior. It seems to be very intermittent and also appears to happen in "bursts". Is there a version of the SDK that we can go back to that does not have this issue? |
@rjimlle2 we are still in the process of root causing this issue so we don't know which version are impacted or even if the bug is caused by some change in the service. Can you use a private SDK drop with additional logging? |
We have not see this reproducible in our "testing" environments. It seems to only be happening in production (higher loads) and I don't think we would be able to put a private SDK build into production, but I will ask. |
@j82w also wanted to let you know we did not see this issue at all until we went from 3.17.1 to 3.18.0. So I think this was introduced with that build. |
@rjmille2 what consistency is your account/sdk? What operations do you see the exception? |
@j82w we are using session consistency on the account and I don't think we are modifying it at all in the sdk. It seems to be happening on an UpsertAsync operation. |
@j82w it seems that we have EnableContentResponseOnWrite = false in the ItemRequestOptions. Not sure if that could be causing the issue since it says "Setting the option to false will cause the response to have a null resource. This reduces networking and CPU load by not sending the resource back over the network and serializing it on the client." |
@rjmille2 can you try upgrading to the 3.19.0 which was just released or enable tracing? It adds diagnostics to the null reference exception to help root cause these type of issues. |
@j82w We'll go ahead and update to 3.19.0 but it won't be fully released until sometime next week at the earliest |
@j82w & @craigjensen we were able to repro this issue in our pre-production environment and have a stack trace that seems to have more information in it. Would you be able to get the information out of the support ticket (2105250010002841) |
@j82w, @craigjensen were you able to retrieve the additional information from the support ticket? There seems to be quite a bit of additional information in the exceptions and I'm not sure if any of it is "private" so I don't really want to post it in a public forum. Could you let me know if you're able to retrieve it from the support ticket or give me another method to send it to you directly? |
@rjmille2 Can you confirm what is the CPU usage during the times you are getting this error? Please check MAX CPU, not AVG, on the machine running the code. Do you see an increase in the MAX CPU during these times? |
@rjmille2 Thanks. CPU is rather on the high end of the spectrum, which might cause TCP timeouts. I wonder if that is the case. Do you have any times when CPU goes lower and exceptions dissapears? |
There is a custom built Nuget version |
@ealsur I tried to build the project with the 3.19.0-debug nuget and now it is having a problem with the package or something. This is a service fabric project and it utilizes fabutil during the build process. If I use the 3.19.0 stable version I don't get this error. I'm wondering if maybe your debug nuget doesn't dual target full FW and .NET Core? Here is the error message: |
@ealsur it seems like even though I've updated to a "local" nuget package reference, the FabActUtil is still loading 3.19.0 signed version. I'm not sure how we can get the FabActUtil to load a different version? I got this data using fuslogvw to trace .net assembly bindings. *** Assembly Binder Log Entry (6/1/2021 @ 4:19:21 PM) *** The operation was successful. Assembly manager loaded from: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\clr.dll === Pre-bind state information ===
|
I don't know what "fabutil" is, but this is an unsigned local package, it is not published in Nuget. If "fabutil" relies on Nuget to resolve references, that might be the issue? Or do you have other components that are also referencing the SDK but not the custom built version but the Nuget package? Unsigned, unpublished nugets might be only used locally, unless you are publishing the DLLs directly to the instances and they can use them? |
@ealsur FabActUtil is part of a service fabric build process. I think it is doing some magic to call assembly.load during the Service Fabric build process. So I guess, if we can't get the SF team involved to try to figure out why it continues to load the signed version, then I'm not going to be able to try out your -debug version of the SDK. |
We're still seeing these exceptions in 3.19.0. Here's the latest exception message: "outerType": Microsoft.Azure.Cosmos.CosmosNullReferenceException, |
@craigjensen can you also include the stack trace? |
"innermostType": System.NullReferenceException, |
Seems like timeouts induced by CPU exhaustion: The data in those diagnostics show that CPU is going from 30 to 100% during the operation: (2021-06-15T13:27:28.9806276Z 30.321), (2021-06-15T13:27:38.9733279Z 29.867), (2021-06-15T13:27:48.9807079Z 33.529), (2021-06-15T13:27:58.9843369Z 26.371), (2021-06-15T13:28:08.9789417Z 29.350), (2021-06-15T13:28:19.0090321Z 64.253), |
Yes - these are related to high cpu on the client. As I mentioned above, I would expect errors in this situation but not NullReferenceExceptions. |
Here is another data point for yall to consider. System.NullReferenceException: Object reference not set to an instance of an object. |
We figured out the issue. It was introduced in the following PR. The original RegionsContacted hashset is never getting set so it is always null. There are certain service unavailable code paths that check if multiple regions were contacted. Since the hashset is never set it is null and causing the null reference exceptions. Line 75 in 867c8c8
Adding StackTrace with line numbers: |
We are working on a fix. It will be included in the next SDK release which should be done in the next week or so. The null reference exception should only happen in scenarios where it was going to be a service unavailable exception. Most likely cause by high CPU or port exhaustion. The null reference will not happen in other scenarios. |
3.20.0 was released with the fix. |
We're seeing a null ref exception occasionally in production code running in Azure on SDK 3.18. It seems to happen intermittently in bursts (we saw ~1200 in a 1 min interval at 2021-05-19T15:55:00Z) across different operations (Read, Create, Query).
Stack:
"innermostType": System.NullReferenceException,
"innermostMessage": Object reference not set to an instance of an object.,
"details": at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy
1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) at Microsoft.Azure.Documents.RequestRetryUtility.<ProcessRequestAsync>d__2
2.MoveNext()--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Documents.StoreClient.d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RouterHandler.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.d__8.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ContainerCore.d__87.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ContainerCore.d__55.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ClientContextCore.d__38
1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.ClientContextCore.<OperationHelperWithRootTraceAsync>d__29
1.MoveNext()--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Intercom.Azure.Helpers.CosmosDB.CosmosDBSqlClient`1.d__42.MoveNext() in C:__w\1\s\Utilities\Intercom.Azure.Helpers.NetStd\CosmosDB\CosmosDBSqlClient.cs:line 442
The text was updated successfully, but these errors were encountered: