Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullReferenceException in GoneAndRetryWithRequestRetryPolicy.TryHandleResponseSynchronously in SDK 3.18 #2485

Closed
craigjensen opened this issue May 19, 2021 · 42 comments · Fixed by #2562

Comments

@craigjensen
Copy link

craigjensen commented May 19, 2021

We're seeing a null ref exception occasionally in production code running in Azure on SDK 3.18. It seems to happen intermittently in bursts (we saw ~1200 in a 1 min interval at 2021-05-19T15:55:00Z) across different operations (Read, Create, Query).

Stack:
"innermostType": System.NullReferenceException,
"innermostMessage": Object reference not set to an instance of an object.,
"details": at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) at Microsoft.Azure.Documents.RequestRetryUtility.<ProcessRequestAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Documents.StoreClient.d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RouterHandler.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.d__8.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ContainerCore.d__87.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ContainerCore.d__55.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ClientContextCore.d__381.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.Azure.Cosmos.ClientContextCore.<OperationHelperWithRootTraceAsync>d__291.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Intercom.Azure.Helpers.CosmosDB.CosmosDBSqlClient`1.d__42.MoveNext() in C:__w\1\s\Utilities\Intercom.Azure.Helpers.NetStd\CosmosDB\CosmosDBSqlClient.cs:line 442

@ealsur
Copy link
Member

ealsur commented May 19, 2021

@craigjensen Is this on the Emulator or a live account? If it's on a live account, which consistency level are you using? Does the account have Firewall/VNET/Private Link enabled?

@craigjensen
Copy link
Author

It's a live account. The account is configured with Bounded Staleness but we're overriding with Session on each request. No Firewall/VNET/PL enabled

@ealsur
Copy link
Member

ealsur commented May 20, 2021

@craigjensen In which platform are you running? NET Framework? NET Core? Do you have traces enabled?

@craigjensen
Copy link
Author

NetFx in an Azure WebApp so tracing not enabled

@j82w
Copy link
Contributor

j82w commented May 20, 2021

@craigjensen does it reproduce with the latest 3.18.0?

@craigjensen
Copy link
Author

Yes - It started with 3.18.0. We didn't see it until we recently deployed 3.18.0

@j82w
Copy link
Contributor

j82w commented May 20, 2021

Any chance you can use a private nuget drop with additional telemetry?

@craigjensen
Copy link
Author

This is a production service and we're in lockdown for //Build next week so not likely, at least not anytime soon.

@ealsur
Copy link
Member

ealsur commented May 20, 2021

@craigjensen - Is your service running in EUAP?

@j82w
Copy link
Contributor

j82w commented May 20, 2021

Do you have any way to consistently reproduce the issue? Does it not show up in testing environments?

@craigjensen
Copy link
Author

@ealsur Yes, we do run the service in eus2euap but there's very little traffic there and we haven't seen the exception in that region. Most of the exceptions are coming from France Central but we have seen it in several US regions as well.

@craigjensen
Copy link
Author

@j82w We don't have a consistent repo. It happens in bursts - here's what we're seeing over the last day:

image

@craigjensen
Copy link
Author

Oh, and we haven't seen it in any of our test/integration environments

@craigjensen
Copy link
Author

The spikes in exceptions are correlated with bursts of activity on our servers that caused the CPU to max out so they're likely an artifact of resource starvation on the client. I would expect errors in this situation but not NullRefExceptions.

@rjmille2
Copy link

We are also seeing a very similar behavior. It seems to be very intermittent and also appears to happen in "bursts". Is there a version of the SDK that we can go back to that does not have this issue?

@j82w
Copy link
Contributor

j82w commented May 25, 2021

@rjimlle2 we are still in the process of root causing this issue so we don't know which version are impacted or even if the bug is caused by some change in the service.

Can you use a private SDK drop with additional logging?

@rjmille2
Copy link

We have not see this reproducible in our "testing" environments. It seems to only be happening in production (higher loads) and I don't think we would be able to put a private SDK build into production, but I will ask.

@rjmille2
Copy link

@j82w also wanted to let you know we did not see this issue at all until we went from 3.17.1 to 3.18.0. So I think this was introduced with that build.

@j82w
Copy link
Contributor

j82w commented May 25, 2021

@rjmille2 what consistency is your account/sdk? What operations do you see the exception?

@rjmille2
Copy link

@j82w we are using session consistency on the account and I don't think we are modifying it at all in the sdk. It seems to be happening on an UpsertAsync operation.

@rjmille2
Copy link

@j82w it seems that we have EnableContentResponseOnWrite = false in the ItemRequestOptions. Not sure if that could be causing the issue since it says "Setting the option to false will cause the response to have a null resource. This reduces networking and CPU load by not sending the resource back over the network and serializing it on the client."

@j82w
Copy link
Contributor

j82w commented May 25, 2021

@rjmille2 can you try upgrading to the 3.19.0 which was just released or enable tracing? It adds diagnostics to the null reference exception to help root cause these type of issues.

@craigjensen
Copy link
Author

@j82w We'll go ahead and update to 3.19.0 but it won't be fully released until sometime next week at the earliest

@rjmille2
Copy link

@j82w & @craigjensen we were able to repro this issue in our pre-production environment and have a stack trace that seems to have more information in it. Would you be able to get the information out of the support ticket (2105250010002841)

@rjmille2
Copy link

@j82w, @craigjensen were you able to retrieve the additional information from the support ticket? There seems to be quite a bit of additional information in the exceptions and I'm not sure if any of it is "private" so I don't really want to post it in a public forum. Could you let me know if you're able to retrieve it from the support ticket or give me another method to send it to you directly?

@ealsur
Copy link
Member

ealsur commented May 27, 2021

@rjmille2 Can you confirm what is the CPU usage during the times you are getting this error? Please check MAX CPU, not AVG, on the machine running the code. Do you see an increase in the MAX CPU during these times?

@rjmille2
Copy link

As you can see from this chart, here is the Max CPU for the machines that were seeing the most exceptions at the time. Please let me know if there is anything else that we can try to provide you with additional information.
image

@ealsur
Copy link
Member

ealsur commented May 28, 2021

@rjmille2 Thanks. CPU is rather on the high end of the spectrum, which might cause TCP timeouts. I wonder if that is the case. Do you have any times when CPU goes lower and exceptions dissapears?

@ealsur
Copy link
Member

ealsur commented May 28, 2021

There is a custom built Nuget version 3.19.0-debug available at https://1drv.ms/u/s!AlUdB2KlFJzvidAB7OABiHbiIi4wwg?e=o8RvAT that contains extra information on the source file of the NRE. If anyone can install an unsigned Nuget package and capture the error, it would give us more insight on the source of the issue.

@rjmille2
Copy link

@ealsur I tried to build the project with the 3.19.0-debug nuget and now it is having a problem with the package or something. This is a service fabric project and it utilizes fabutil during the build process. If I use the 3.19.0 stable version I don't get this error. I'm wondering if maybe your debug nuget doesn't dual target full FW and .NET Core?

Here is the error message:
11>C:\Users\xxxxx.nuget\packages\microsoft.servicefabric.actors\4.2.477\build\net45\Microsoft.ServiceFabric.Actors.targets(21,5): error MSB3073: The command ""C:\Users\xxxxxx.nuget\packages\microsoft.servicefabric.actors\4.2.477\build\net45\FabActUtil.exe" /spp:"PackageRoot" /t:manifest /sp:"yyyyyyyyActor" /in:"bin\Debug\net48\win7-x64\yyyyyyyActor.Exe" /arp:"c:[path redacted]bin\Debug\net48\win7-x64\" " exited with code -1.
11>Done building project "yyyyyActor.csproj" -- FAILED.
15>------ Rebuild All started: Project: yyyyyyApp (ServiceFabricApps\yyyyyyApp), Configuration: Debug x64 ------
15>System.Reflection.ReflectionTypeLoadException: Unable to load one or more of the requested types. Retrieve the LoaderExceptions property for more information.
15> at System.Reflection.RuntimeModule.GetTypes(RuntimeModule module)
15> at System.Reflection.Assembly.GetTypes()
15> at FabActUtil.Tool.LoadActors(Assembly inputAssembly, IList1 actorFilters, IList1 actorTypes)
15> at FabActUtil.Tool.LoadActors(ToolContext context)
15> at FabActUtil.Tool.ProcessInput(ToolContext context)
15> at FabActUtil.Tool.Run(ToolArguments arguments)
15> at FabActUtil.Program.Main(String[] args)

@rjmille2
Copy link

rjmille2 commented Jun 1, 2021

@ealsur it seems like even though I've updated to a "local" nuget package reference, the FabActUtil is still loading 3.19.0 signed version. I'm not sure how we can get the FabActUtil to load a different version? I got this data using fuslogvw to trace .net assembly bindings.

*** Assembly Binder Log Entry (6/1/2021 @ 4:19:21 PM) ***

The operation was successful.
Bind result: hr = 0x0. The operation completed successfully.

Assembly manager loaded from: C:\Windows\Microsoft.NET\Framework64\v4.0.30319\clr.dll
Running under executable C:\Users\randy.miller.nuget\packages\microsoft.servicefabric.actors\4.2.477\build\net45\FabActUtil.exe
--- A detailed error log follows.

=== Pre-bind state information ===
LOG: DisplayName = Microsoft.Azure.Cosmos.Client, Version=3.19.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35
(Fully-specified)
LOG: Appbase = file:///C:/Users/randy.miller/.nuget/packages/microsoft.servicefabric.actors/4.2.477/build/net45/
LOG: Initial PrivatePath = NULL
LOG: Dynamic Base = NULL
LOG: Cache Base = NULL
LOG: AppName = FabActUtil.exe
Calling assembly : Revint.AnalysisActor, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null.

LOG: This bind starts in LoadFrom load context.
WRN: Native image will not be probed in LoadFrom context. Native image will only be probed in default load context, like with Assembly.Load().
LOG: No application configuration file found.
LOG: Using host configuration file:
LOG: Using machine configuration file from C:\Windows\Microsoft.NET\Framework64\v4.0.30319\config\machine.config.
LOG: Post-policy reference: Microsoft.Azure.Cosmos.Client, Version=3.19.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35
LOG: GAC Lookup was unsuccessful.
LOG: Attempting download of new URL file:///C:/Users/randy.miller/.nuget/packages/microsoft.servicefabric.actors/4.2.477/build/net45/Microsoft.Azure.Cosmos.Client.DLL.
LOG: Attempting download of new URL file:///C:/Users/randy.miller/.nuget/packages/microsoft.servicefabric.actors/4.2.477/build/net45/Microsoft.Azure.Cosmos.Client/Microsoft.Azure.Cosmos.Client.DLL.
LOG: Attempting download of new URL file:///C:/Users/randy.miller/.nuget/packages/microsoft.servicefabric.actors/4.2.477/build/net45/Microsoft.Azure.Cosmos.Client.EXE.
LOG: Attempting download of new URL file:///C:/Users/randy.miller/.nuget/packages/microsoft.servicefabric.actors/4.2.477/build/net45/Microsoft.Azure.Cosmos.Client/Microsoft.Azure.Cosmos.Client.EXE.
LOG: Attempting download of new URL file:///c:/dev/Service Fabric File Ingestion/ServiceFabric.FileIngestion/AnalysisActor/bin/Debug/net48/win7-x64/Microsoft.Azure.Cosmos.Client.DLL.
LOG: Assembly download was successful. Attempting setup of file: c:\dev\Service Fabric File Ingestion\ServiceFabric.FileIngestion\AnalysisActor\bin\Debug\net48\win7-x64\Microsoft.Azure.Cosmos.Client.dll
LOG: Entering run-from-source setup phase.
LOG: Assembly Name is: Microsoft.Azure.Cosmos.Client, Version=3.19.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35
LOG: Where-ref bind Codebase does not match what is found in default context. Keep the result in LoadFrom context.
LOG: Binding succeeds. Returns assembly from c:\dev\Service Fabric File Ingestion\ServiceFabric.FileIngestion\AnalysisActor\bin\Debug\net48\win7-x64\Microsoft.Azure.Cosmos.Client.dll.
LOG: Assembly is loaded in LoadFrom load context.

@ealsur
Copy link
Member

ealsur commented Jun 1, 2021

I don't know what "fabutil" is, but this is an unsigned local package, it is not published in Nuget. If "fabutil" relies on Nuget to resolve references, that might be the issue? Or do you have other components that are also referencing the SDK but not the custom built version but the Nuget package?

Unsigned, unpublished nugets might be only used locally, unless you are publishing the DLLs directly to the instances and they can use them?

@rjmille2
Copy link

rjmille2 commented Jun 1, 2021

@ealsur FabActUtil is part of a service fabric build process. I think it is doing some magic to call assembly.load during the Service Fabric build process. So I guess, if we can't get the SF team involved to try to figure out why it continues to load the signed version, then I'm not going to be able to try out your -debug version of the SDK.

@craigjensen
Copy link
Author

We're still seeing these exceptions in 3.19.0. Here's the latest exception message:

"outerType": Microsoft.Azure.Cosmos.CosmosNullReferenceException,
"outerAssembly": ,
"outerMethod": ,
"outerMessage": Object reference not set to an instance of an object.{"name":"ReadItemStreamAsync","id":"32ef07af-36ac-48c6-8075-5974a1002787","caller info":{"member":"OperationHelperWithRootTraceAsync","file":"ClientContextCore.cs","line":219},"start time":"01:29:20:102","duration in milliseconds":42020.9534,"data":{"Client Configuration":{"Client Created Time Utc":"2021-06-09T20:08:08.7524681Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.19.0|3.19.1|02|X64|Microsoft Windows 10.0.14393 |.NET Framework 4.8.4311.0|S|AzureBotService","ConnectionConfig":{"gw":"(cps:50, urto:10, p:False, httpf: False)","rntbd":"(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: False, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: Session, prgns:[])"}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"d14b34c4-64fa-481d-b11e-346754a2b403","start time":"01:29:20:102","duration in milliseconds":42020.6505,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"b658af4b-6aa1-41ab-b2a7-7277b79e3f42","start time":"01:29:20:102","duration in milliseconds":42020.519700000004,"data":{"CPU Load History":{"CPU History":"(2021-06-15T13:28:29.0088950Z 99.837), (2021-06-15T13:28:39.8987703Z 99.749), (2021-06-15T13:28:49.1411011Z 99.958), (2021-06-15T13:28:59.4259758Z 99.810), (2021-06-15T13:29:09.5918395Z 100.000), (2021-06-15T13:29:19.1021359Z 98.314)"}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"47b76c44-14cc-410d-ba49-ac1bb5f7dd67","start time":"01:29:20:102","duration in milliseconds":42020.5008,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"4c8cdb92-9dbe-47a1-b52c-4554271502d8","start time":"01:29:20:102","duration in milliseconds":42020.446800000005,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"8ea6a412-f6e7-405d-a704-43b09f6e14fa","start time":"01:29:20:102","duration in milliseconds":42020.4306,"children":[{"name":"Microsoft.Azure.Documents.ServerStoreModel Transport Request","id":"77a09b82-a92a-4635-b413-34989b07e8da","caller info":{"member":"ProcessMessageAsync","file":"TransportHandler.cs","line":109},"start time":"01:29:20:102","duration in milliseconds":42020.3686,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":["https://bc-docdb-prod-westeurope.documents.azure.com/"],"FailedReplicas":["rntbd://cdb-ms-prod-westeurope1-fd26-az02.documents.azure.com:14005/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677695703307688s","rntbd://cdb-ms-prod-westeurope1-fd26-az03.documents.azure.com:14331/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132647290109761501s","rntbd://cdb-ms-prod-westeurope1-fd26-az01.documents.azure.com:14037/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677574790212391s"],"AddressResolutionStatistics":[],"StoreResponseStatistics":[{"ResponseTimeUTC":"2021-06-15T13:29:31.9546243Z","ResourceType":"Document","OperationType":"Read","LocationEndpoint":"https://bc-docdb-prod-westeurope.documents.azure.com/","StoreResult":{"ActivityId":"c15138b1-c846-4f50-9121-87378d334585","StatusCode":"Gone","SubStatusCode":"Unknown","LSN":-1,"PartitionKeyRangeId":null,"GlobalCommittedLSN":-1,"ItemLSN":-1,"UsingLocalLSN":true,"QuorumAckedLSN":-1,"SessionToken":null,"CurrentWriteQuorum":-1,"CurrentReplicaSetSize":-1,"NumberOfReadRegions":-1,"IsClientCpuOverloaded":false,"IsValid":false,"StorePhysicalAddress":"rntbd://cdb-ms-prod-westeurope1-fd26-az02.documents.azure.com:14005/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677695703307688s","RequestCharge":0,"BELatencyInMs":null,"TransportException":"A client transport error occurred: The request timed out while waiting for a server response. (Time: 2021-06-15T13:29:31.9546243Z, activity ID: c15138b1-c846-4f50-9121-87378d334585, error code: ReceiveTimeout [0x0010], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd26-az02.documents.azure.com:14005/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677695703307688s, connection: 10.0.3.228:50349 -> 52.236.189.0:14005, payload sent: True, CPU history: (2021-06-15T13:27:28.9806276Z 30.321), (2021-06-15T13:27:38.9733279Z 29.867), (2021-06-15T13:27:48.9807079Z 33.529), (2021-06-15T13:27:58.9843369Z 26.371), (2021-06-15T13:28:08.9789417Z 29.350), (2021-06-15T13:28:19.0090321Z 64.253), CPU count: 4)"}},{"ResponseTimeUTC":"2021-06-15T13:29:42.0152199Z","ResourceType":"Document","OperationType":"Read","LocationEndpoint":"https://bc-docdb-prod-westeurope.documents.azure.com/","StoreResult":{"ActivityId":"c15138b1-c846-4f50-9121-87378d334585","StatusCode":"Gone","SubStatusCode":"Unknown","LSN":-1,"PartitionKeyRangeId":null,"GlobalCommittedLSN":-1,"ItemLSN":-1,"UsingLocalLSN":true,"QuorumAckedLSN":-1,"SessionToken":null,"CurrentWriteQuorum":-1,"CurrentReplicaSetSize":-1,"NumberOfReadRegions":-1,"IsClientCpuOverloaded":false,"IsValid":false,"StorePhysicalAddress":"rntbd://cdb-ms-prod-westeurope1-fd26-az03.documents.azure.com:14331/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132647290109761501s","RequestCharge":0,"BELatencyInMs":null,"TransportException":"A client transport error occurred: The request timed out while waiting for a server response. (Time: 2021-06-15T13:29:42.0152199Z, activity ID: c15138b1-c846-4f50-9121-87378d334585, error code: ReceiveTimeout [0x0010], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd26-az03.documents.azure.com:14331/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132647290109761501s, connection: 10.0.3.228:53418 -> 13.69.112.6:14331, payload sent: True, CPU history: (2021-06-15T13:27:28.9806276Z 30.321), (2021-06-15T13:27:38.9733279Z 29.867), (2021-06-15T13:27:48.9807079Z 33.529), (2021-06-15T13:27:58.9843369Z 26.371), (2021-06-15T13:28:08.9789417Z 29.350), (2021-06-15T13:28:19.0090321Z 64.253), CPU count: 4)"}},{"ResponseTimeUTC":"2021-06-15T13:29:52.0580523Z","ResourceType":"Document","OperationType":"Read","LocationEndpoint":"https://bc-docdb-prod-westeurope.documents.azure.com/","StoreResult":{"ActivityId":"c15138b1-c846-4f50-9121-87378d334585","StatusCode":"Gone","SubStatusCode":"Unknown","LSN":-1,"PartitionKeyRangeId":null,"GlobalCommittedLSN":-1,"ItemLSN":-1,"UsingLocalLSN":true,"QuorumAckedLSN":-1,"SessionToken":null,"CurrentWriteQuorum":-1,"CurrentReplicaSetSize":-1,"NumberOfReadRegions":-1,"IsClientCpuOverloaded":false,"IsValid":false,"StorePhysicalAddress":"rntbd://cdb-ms-prod-westeurope1-fd26-az01.documents.azure.com:14037/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677574790212391s","RequestCharge":0,"BELatencyInMs":null,"TransportException":"A client transport error occurred: The request timed out while waiting for a server response. (Time: 2021-06-15T13:29:52.0580523Z, activity ID: c15138b1-c846-4f50-9121-87378d334585, error code: ReceiveTimeout [0x0010], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd26-az01.documents.azure.com:14037/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677574790212391s, connection: 10.0.3.228:50326 -> 13.69.66.134:14037, payload sent: True, CPU history: (2021-06-15T13:27:28.9806276Z 30.321), (2021-06-15T13:27:38.9733279Z 29.867), (2021-06-15T13:27:48.9807079Z 33.529), (2021-06-15T13:27:58.9843369Z 26.371), (2021-06-15T13:28:08.9789417Z 29.350), (2021-06-15T13:28:19.0090321Z 64.253), CPU count: 4)"}},{"ResponseTimeUTC":"2021-06-15T13:30:02.1237549Z","ResourceType":"Document","OperationType":"Read","LocationEndpoint":"https://bc-docdb-prod-westeurope.documents.azure.com/","StoreResult":{"ActivityId":"c15138b1-c846-4f50-9121-87378d334585","StatusCode":"Gone","SubStatusCode":"Unknown","LSN":-1,"PartitionKeyRangeId":null,"GlobalCommittedLSN":-1,"ItemLSN":-1,"UsingLocalLSN":true,"QuorumAckedLSN":-1,"SessionToken":null,"CurrentWriteQuorum":-1,"CurrentReplicaSetSize":-1,"NumberOfReadRegions":-1,"IsClientCpuOverloaded":false,"IsValid":false,"StorePhysicalAddress":"rntbd://cdb-ms-prod-westeurope1-fd26-az02.documents.azure.com:14005/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677695703307688s","RequestCharge":0,"BELatencyInMs":null,"TransportException":"A client transport error occurred: The request timed out while waiting for a server response. (Time: 2021-06-15T13:30:02.1237549Z, activity ID: c15138b1-c846-4f50-9121-87378d334585, error code: ReceiveTimeout [0x0010], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-westeurope1-fd26-az02.documents.azure.com:14005/apps/d263ca2c-53c8-40ca-a394-04da3472bfee/services/ff4defca-776f-43ee-b354-ffb6e8e8d93c/partitions/eabd8bf1-206e-4387-8b5a-973076339232/replicas/132677695703307688s, connection: 10.0.3.228:50349 -> 52.236.189.0:14005, payload sent: True, CPU history: (2021-06-15T13:27:28.9806276Z 30.321), (2021-06-15T13:27:38.9733279Z 29.867), (2021-06-15T13:27:48.9807079Z 33.529), (2021-06-15T13:27:58.9843369Z 26.371), (2021-06-15T13:28:08.9789417Z 29.350), (2021-06-15T13:28:19.0090321Z 64.253), CPU count: 4)"}}]}}}]}]}]}]}]}]},

@j82w
Copy link
Contributor

j82w commented Jun 16, 2021

@craigjensen can you also include the stack trace?

@craigjensen
Copy link
Author

"innermostType": System.NullReferenceException,
"innermostMessage": Object reference not set to an instance of an object.,
"details": at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) at Microsoft.Azure.Documents.RequestRetryUtility.<ProcessRequestAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Documents.StoreClient.d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.TransportHandler.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RouterHandler.d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.RequestHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.d__8.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ContainerCore.d__87.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ContainerCore.d__55.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.Cosmos.ClientContextCore.d__38`1.MoveNext(),

@ealsur
Copy link
Member

ealsur commented Jun 16, 2021

Seems like timeouts induced by CPU exhaustion:

The data in those diagnostics show that CPU is going from 30 to 100% during the operation:

(2021-06-15T13:27:28.9806276Z 30.321), (2021-06-15T13:27:38.9733279Z 29.867), (2021-06-15T13:27:48.9807079Z 33.529), (2021-06-15T13:27:58.9843369Z 26.371), (2021-06-15T13:28:08.9789417Z 29.350), (2021-06-15T13:28:19.0090321Z 64.253),
(2021-06-15T13:28:29.0088950Z 99.837), (2021-06-15T13:28:39.8987703Z 99.749), (2021-06-15T13:28:49.1411011Z 99.958), (2021-06-15T13:28:59.4259758Z 99.810), (2021-06-15T13:29:09.5918395Z 100.000), (2021-06-15T13:29:19.1021359Z 98.314)

@craigjensen
Copy link
Author

Yes - these are related to high cpu on the client. As I mentioned above, I would expect errors in this situation but not NullReferenceExceptions.

@Blackbaud-TomRamsey
Copy link

Here is another data point for yall to consider.

System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback)
at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func2 prepareRequestAsyncDelegate) at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func1 callbackMethod, Func3 callShouldRetry, Func3 callShouldRetryException, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
at Blackbaud.Cosmos.TracingRequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates1 shouldRetryResultPredicates, Func5 onRetryAsync, Int32 permittedRetryCount, IEnumerable1 sleepDurationsEnumerable, Func4 sleepDurationProvider, Boolean continueOnCapturedContext) at Polly.AsyncPolicy1.ExecuteAsync(Func3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext) at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action1 requestEnricher, ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.CosmosQueryClientCore.ExecuteItemQueryAsync(String resourceUri, ResourceType resourceType, OperationType operationType, Guid clientQueryCorrelationId, FeedRange feedRange, QueryRequestOptions requestOptions, SqlQuerySpec sqlQuerySpec, String continuationToken, Boolean isContinuationExpected, Int32 pageSize, ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Pagination.NetworkAttachedDocumentContainer.MonadicQueryAsync(SqlQuerySpec sqlQuerySpec, FeedRangeState1 feedRangeState, QueryPaginationOptions queryPaginationOptions, ITrace trace, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Pagination.PartitionRangePageAsyncEnumerator2.MoveNextAsync(ITrace trace)
at Microsoft.Azure.Cosmos.Pagination.BufferedPartitionRangePageAsyncEnumerator2.PrefetchAsync(ITrace trace, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.Pagination.BufferedPartitionRangePageAsyncEnumerator2.GetNextPageAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.Pagination.PartitionRangePageAsyncEnumerator2.MoveNextAsync(ITrace trace) at Microsoft.Azure.Cosmos.Pagination.CrossPartitionRangePageAsyncEnumerator2.MoveNextAsync(ITrace trace)
at Microsoft.Azure.Cosmos.Query.Core.Pipeline.CrossPartition.Parallel.ParallelCrossPartitionQueryPipelineStage.MoveNextAsync(ITrace trace)
at Microsoft.Azure.Cosmos.Query.Core.Pipeline.LazyQueryPipelineStage.MoveNextAsync(ITrace trace)
at Microsoft.Azure.Cosmos.Query.Core.Pipeline.NameCacheStaleRetryQueryPipelineStage.MoveNextAsync(ITrace trace)
at Microsoft.Azure.Cosmos.Query.Core.Pipeline.CatchAllQueryPipelineStage.MoveNextAsync(ITrace trace)
at Microsoft.Azure.Cosmos.Query.QueryIterator.ReadNextAsync(ITrace trace, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.FeedIteratorCore1.ReadNextAsync(ITrace trace, CancellationToken cancellationToken) at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task)
CosmosDiagnostics: {"name":"Typed FeedIterator ReadNextAsync","id":"66edaf33-5c8c-4889-97d8-217c3f2001ae","caller info":{"member":"OperationHelperWithRootTraceAsync","file":"ClientContextCore.cs","line":219},"start time":"07:03:24:341","duration in milliseconds":67228.8279,"data":{"Client Configuration":{"Client Created Time Utc":"2021-06-15T19:02:12.0762967Z","NumberOfClientsCreated":1,"User Agent":"cosmos-netstandard-sdk/3.19.0|3.19.1|02|X64|Linux 5.4.0-1048-azure 50 18.|.NET Core 3.1.13|S|","ConnectionConfig":{"gw":"(cps:50, urto:10, p:False, httpf: False)","rntbd":"(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: False, pr: ReuseUnicastPort)","other":"(ed:False, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[])"}},"children":[{"name":"Create Query Pipeline","id":"9c2e58d5-f5ab-4930-a20a-b752c6c524ae","caller info":{"member":"TryCreateCoreContextAsync","file":"CosmosQueryExecutionContextFactory.cs","line":85},"start time":"07:03:24:341","duration in milliseconds":18738.5577,"children":[{"name":"Get Container Properties","id":"92eedb99-6389-47a2-b15d-0579aef715cd","caller info":{"member":"GetCachedContainerPropertiesAsync","file":"ClientContextCore.cs","line":366},"start time":"07:03:24:341","duration in milliseconds":0.0069,"children":[{"name":"Get Collection Cache","id":"4012ae29-1cc5-454c-828d-00b3c7dc4763","caller info":{"member":"GetCollectionCacheAsync","file":"DocumentClient.cs","line":546},"start time":"07:03:24:341","duration in milliseconds":0.0007}]},{"name":"Gateway QueryPlan","id":"603dbd56-3428-42fd-8b92-766797e7e4fe","caller info":{"member":"GetQueryPlanThroughGatewayAsync","file":"QueryPlanRetriever.cs","line":112},"start time":"07:03:24:341","duration in milliseconds":0.2206},{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"8b7d945f-04b9-40f9-b1f4-0cf4bf30d2e6","start time":"07:03:24:341","duration in milliseconds":18738.3906,"children":[{"name":"Blackbaud.Cosmos.RetryRequestHandler","id":"5a8066ef-511d-4982-8208-456af0194fb2","start time":"07:03:24:341","duration in milliseconds":18738.3727,"children":[{"name":"Blackbaud.Cosmos.RetryCounterMetricRequestHandler","id":"5bf1a82d-5f56-4e6d-9304-e129f20ee87e","start time":"07:03:24:341","duration in milliseconds":18738.3626,"children":[{"name":"Blackbaud.Cosmos.TracingRequestHandler","id":"fe8f3499-5a66-4cdd-acf2-057919e72d16","start time":"07:03:24:341","duration in milliseconds":18738.3604,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"23986415-709b-4af4-bfed-f74718406017","start time":"07:03:24:341","duration in milliseconds":18738.3289,"data":{"CPU Load History":{"CPU History":"(2021-06-15T19:02:45.9521434Z 8.642), (2021-06-15T19:02:45.9566532Z 13.333), (2021-06-15T19:02:55.9541469Z 8.590), (2021-06-15T19:03:05.9518409Z 5.033), (2021-06-15T19:03:05.9561968Z 7.143), (2021-06-15T19:03:15.9565371Z 8.153)"}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"6e2457a1-d18d-4fad-9960-bc0a91c97a57","start time":"07:03:24:341","duration in milliseconds":18738.3263,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"aa7137dd-a244-427a-afbf-dcbe4c4980d2","start time":"07:03:24:341","duration in milliseconds":18738.3187,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"7bd8f4d5-6cef-4370-99b8-48f413801ea9","start time":"07:03:24:341","duration in milliseconds":18738.3159,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","id":"71954242-69cf-4327-a08d-bac4a5ba3ba6","caller info":{"member":"ProcessMessageAsync","file":"TransportHandler.cs","line":109},"start time":"07:03:24:341","duration in milliseconds":18738.2934,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":[],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2021-06-15T19:03:43.0796150Z","EndTimeUTC":"2021-06-15T19:03:43.0796154Z","RequestUri":"https://s21pusa01aueav2blkbdoc01-eastus2.documents.azure.com/dbs/UserEntitlementsAffinity/colls/UserEntitlementsAffinity/docs","ResourceType":"Document","HttpMethod":"POST","ActivityId":"c55c7220-b42c-4167-a10f-c9c86ef7d199","StatusCode":"OK"}]},"PointOperationStatisticsTraceDatum":{"Id":"PointOperationStatistics","ActivityId":null,"ResponseTimeUtc":"2021-06-15T19:03:43.0796623Z","StatusCode":200,"SubStatusCode":0,"RequestCharge":0,"RequestUri":"dbs/UserEntitlementsAffinity/colls/UserEntitlementsAffinity","ErrorMessage":null,"RequestSessionToken":null,"ResponseSessionToken":null,"BELatencyInMs":null}}}]}]}]}]}]}]}]}]},{"name":"Get Partition Key Ranges","id":"04811f42-024b-476d-ab7c-421327d869c6","caller info":{"member":"GetTargetPartitionKeyRangesAsync","file":"CosmosQueryClientCore.cs","line":242},"start time":"07:03:43:079","duration in milliseconds":0.0362,"children":[{"name":"Try Get Overlapping Ranges","id":"073c0804-bca5-4112-acf8-494aeefa13bd","caller info":{"member":"TryGetOverlappingRangesAsync","file":"PartitionKeyRangeCache.cs","line":53},"start time":"07:03:43:079","duration in milliseconds":0.0192}]}]},{"name":"MoveNextAsync","id":"bc22fc70-e3e8-433d-9e9c-4dc14af5ded4","caller info":{"member":"MoveNextAsync","file":"CrossPartitionRangePageAsyncEnumerator.cs","line":113},"start time":"07:03:43:079","duration in milliseconds":48489.6489,"children":[{"name":"Prefetching","id":"b00f6160-b279-4aa7-9f7e-101b9159f7ff","caller info":{"member":"PrefetchInParallelAsync","file":"ParallelPrefetch.cs","line":31},"start time":"07:03:43:079","duration in milliseconds":0.0009},{"name":"[,FF) move next","id":"b04edff1-cd8a-4fda-b119-49076ea9a1ff","caller info":{"member":"MoveNextAsync","file":"PartitionRangePageAsyncEnumerator.cs","line":49},"start time":"07:03:43:079","duration in milliseconds":48419.7811,"children":[{"name":"Prefetch","id":"bb12f233-3a64-4f40-b2f8-ff069754f889","caller info":{"member":"PrefetchAsync","file":"BufferedPartitionRangePageAsyncEnumerator.cs","line":50},"start time":"07:03:43:079","duration in milliseconds":48419.6799,"children":[{"name":"[,FF) move next","id":"7c3fa4f6-599f-4bf1-a504-df93ac5b4b2b","caller info":{"member":"MoveNextAsync","file":"PartitionRangePageAsyncEnumerator.cs","line":49},"start time":"07:03:43:079","duration in milliseconds":48419.6499,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","id":"1896dc9a-2fb6-432e-8f98-eb5ecab82f10","start time":"07:03:43:079","duration in milliseconds":48418.683,"children":[{"name":"Get Collection Cache","id":"a24547f5-af75-4dd5-8358-940403788fb5","caller info":{"member":"GetCollectionCacheAsync","file":"DocumentClient.cs","line":546},"start time":"07:03:43:079","duration in milliseconds":0.0005},{"name":"Get Partition Key Range Cache","id":"b91f0ee0-116b-40b9-9503-96024f52d5fb","caller info":{"member":"GetPartitionKeyRangeCacheAsync","file":"DocumentClient.cs","line":555},"start time":"07:03:43:079","duration in milliseconds":0.0004},{"name":"Try Get Overlapping Ranges","id":"9a9d107e-2c03-4a11-b9b9-112afb1a1e9b","caller info":{"member":"TryGetOverlappingRangesAsync","file":"PartitionKeyRangeCache.cs","line":53},"start time":"07:03:43:079","duration in milliseconds":0.0028},{"name":"Blackbaud.Cosmos.RetryRequestHandler","id":"d83b6a6f-329d-4c87-854c-4eaa7953d5cc","start time":"07:03:43:079","duration in milliseconds":48418.5897,"children":[{"name":"Blackbaud.Cosmos.RetryCounterMetricRequestHandler","id":"0777a118-e99c-4ecb-b2e3-00be897180cc","start time":"07:03:43:079","duration in milliseconds":48418.3415,"children":[{"name":"Blackbaud.Cosmos.TracingRequestHandler","id":"3bbf22ac-d367-4d79-80da-b6048f669727","start time":"07:03:43:079","duration in milliseconds":48417.5916,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","id":"78e8a9cf-23f0-4f0b-876e-aff22adf04d9","start time":"07:03:43:079","duration in milliseconds":48417.4143,"data":{"CPU Load History":{"CPU History":"(2021-06-15T19:02:45.9521434Z 8.642), (2021-06-15T19:02:45.9566532Z 13.333), (2021-06-15T19:02:55.9541469Z 8.590), (2021-06-15T19:03:05.9518409Z 5.033), (2021-06-15T19:03:05.9561968Z 7.143), (2021-06-15T19:03:15.9565371Z 8.153)"}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","id":"b7d3e7a5-6807-4110-a088-e74e3f4daa7c","start time":"07:03:43:079","duration in milliseconds":48417.3459,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","id":"226225d7-4899-4332-9651-8da1967afe16","start time":"07:03:43:079","duration in milliseconds":48415.5719,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","id":"b08676d7-1d99-413c-9ea3-636e59884e13","start time":"07:03:43:079","duration in milliseconds":48415.5526,"children":[{"name":"Microsoft.Azure.Documents.ServerStoreModel Transport Request","id":"f2d65179-4ad6-4745-aeb1-ffa0f28ad988","caller info":{"member":"ProcessMessageAsync","file":"TransportHandler.cs","line":109},"start time":"07:03:43:080","duration in milliseconds":48415.4652,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":["https://s21pusa01aueav2blkbdoc01-eastus2.documents.azure.com/"],"FailedReplicas":["rntbd://cdb-ms-prod-eastus2-fd20.documents.azure.com:14310/apps/01d73e15-17c9-41a2-8281-5aba5e859c92/services/e1930bec-7e02-408c-85da-1aadb1b93fa8/partitions/8b1372d2-2901-49f1-bdb6-11419ee9a609/replicas/132643672265585693s/"],"AddressResolutionStatistics":[],"StoreResponseStatistics":[{"ResponseTimeUTC":"2021-06-15T19:04:20.7733357Z","ResourceType":"Document","OperationType":"Query","LocationEndpoint":"https://s21pusa01aueav2blkbdoc01-eastus2.documents.azure.com/","StoreResult":{"ActivityId":"c55c7220-b42c-4167-a10f-c9c86ef7d199","StatusCode":"Gone","SubStatusCode":"Unknown","LSN":-1,"PartitionKeyRangeId":null,"GlobalCommittedLSN":-1,"ItemLSN":-1,"UsingLocalLSN":true,"QuorumAckedLSN":-1,"SessionToken":null,"CurrentWriteQuorum":-1,"CurrentReplicaSetSize":-1,"NumberOfReadRegions":-1,"IsClientCpuOverloaded":true,"IsValid":false,"StorePhysicalAddress":"rntbd://cdb-ms-prod-eastus2-fd20.documents.azure.com:14310/apps/01d73e15-17c9-41a2-8281-5aba5e859c92/services/e1930bec-7e02-408c-85da-1aadb1b93fa8/partitions/8b1372d2-2901-49f1-bdb6-11419ee9a609/replicas/132643672265585693s/","RequestCharge":0,"BELatencyInMs":null,"TransportException":"A client transport error occurred: The request timed out while waiting for a server response. (Time: 2021-06-15T19:03:54.0839232Z, activity ID: c55c7220-b42c-4167-a10f-c9c86ef7d199, error code: ReceiveTimeout [0x0010], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-prod-eastus2-fd20.documents.azure.com:14310/apps/01d73e15-17c9-41a2-8281-5aba5e859c92/services/e1930bec-7e02-408c-85da-1aadb1b93fa8/partitions/8b1372d2-2901-49f1-bdb6-11419ee9a609/replicas/132643672265585693s/, connection: 10.100.12.192:46290 -> 104.208.231.10:14310, payload sent: True, CPU history: (2021-06-15T19:02:54.3687915Z 7.257), (2021-06-15T19:03:04.3695896Z 6.080), (2021-06-15T19:03:14.3684258Z 5.419), (2021-06-15T19:03:24.3845539Z 9.971), (2021-06-15T19:03:43.0760444Z 65.951), (2021-06-15T19:03:44.3702484Z 91.244), CPU count: 2)"}}]}}}]}]}]}]}]}]}]}]}]}]}]}]}]}

@j82w
Copy link
Contributor

j82w commented Jun 16, 2021

We figured out the issue. It was introduced in the following PR.

#2312

The original RegionsContacted hashset is never getting set so it is always null. There are certain service unavailable code paths that check if multiple regions were contacted. Since the hashset is never set it is null and causing the null reference exceptions.

Adding StackTrace with line numbers:
System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Azure.Documents.GoneAndRetryWithRequestRetryPolicy1.TryHandleResponseSynchronously(DocumentServiceRequest request, TResponse response, Exception exception, ShouldRetryResult& shouldRetryResult) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\direct\GoneAndRetryWithRequestRetryPolicy.cs:line 179 at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func1 executeAsync, Func1 prepareRequest, IRequestRetryPolicy2 policy, CancellationToken cancellationToken, Func1 inBackoffAlternateCallbackMethod, Nullable1 minBackoffForInBackoffCallback) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\direct\RequestRetryUtility.cs:line 88
at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy, Func2 prepareRequestAsyncDelegate) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\direct\StoreClient.cs:line 116 at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\TransportHandler.cs:line 117 at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\TransportHandler.cs:line 33 at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RouterHandler.cs:line 42 at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestHandler.cs:line 59 at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func1 callbackMethod, Func3 callShouldRetry, Func3 callShouldRetryException, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\AbstractRetryHandler.cs:line 75
at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\AbstractRetryHandler.cs:line 28
at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestHandler.cs:line 59
at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestHandler.cs:line 59
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestInvokerHandler.cs:line 78
at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action1 requestEnricher, ITrace trace, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Handler\RequestInvokerHandler.cs:line 285 at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\Container\ContainerCore.Items.cs:line 1011
at Microsoft.Azure.Cosmos.ContainerCore.ReadItemAsync[T](String id, PartitionKey partitionKey, ITrace trace, ItemRequestOptions requestOptions, CancellationToken cancellationToken) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\Container\ContainerCore.Items.cs:line 118
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func2 task) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\ClientContextCore.cs:line 430 at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, RequestOptions requestOptions, Func2 task, TraceComponent traceComponent, TraceLevel traceLevel) in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\src\Resource\ClientContextCore.cs:line 223
at Microsoft.Azure.Cosmos.SDK.EmulatorTests.CosmosItemTests.CreateDropItemTest() in C:\azure-cosmos-dotnet-v3\Microsoft.Azure.Cosmos\tests\Microsoft.Azure.Cosmos.EmulatorTests\CosmosItemTests.cs:line 178

@j82w
Copy link
Contributor

j82w commented Jun 16, 2021

We are working on a fix. It will be included in the next SDK release which should be done in the next week or so.

The null reference exception should only happen in scenarios where it was going to be a service unavailable exception. Most likely cause by high CPU or port exhaustion. The null reference will not happen in other scenarios.

@j82w
Copy link
Contributor

j82w commented Jun 23, 2021

3.20.0 was released with the fix.

ductranminh pushed a commit to bookbot-kids/azure-authentication that referenced this issue Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants