Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Diagnostics]Add ContinuationToken for pkRanges request in Diagnostics #3167

Closed
xinlian12 opened this issue May 2, 2022 · 0 comments · Fixed by #3180
Closed

[Diagnostics]Add ContinuationToken for pkRanges request in Diagnostics #3167

xinlian12 opened this issue May 2, 2022 · 0 comments · Fixed by #3180
Labels
Diagnostics Issues around diagnostics and troubleshooting feature-request New feature or request
Milestone

Comments

@xinlian12
Copy link
Member

xinlian12 commented May 2, 2022

Add PkRanges request CT for pkRanges in the diagnostics to help investigate the following exceptions/scenarios.

Request failed at:

{""Id"":""PointOperationStatistics"",""ActivityId"":""e035d1b0-7646-4a09-97cc-002534d7b4c4"",""ResponseTimeUtc"":""2022-04-28T17:18:49.2581143Z"",""StatusCode"":404,""SubStatusCode"":0,""RequestCharge"":0,""RequestUri"":""dbs/usersettings/colls/usersettings"",""ErrorMessage"":""Microsoft.Azure.Documents.NotFoundException: Entity with the specified id does not exist in the system. More info: https://aka.ms/cosmosdb-tsg-not-found\r\nActivityId: e035d1b0-7646-4a09-97cc-002534d7b4c4, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Windows/10.0.17763 cosmos-netstandard-sdk/3.19.3\r\n   at Microsoft.Azure.Cosmos.AddressResolver.<ResolveAddressesAndIdentityAsync>d__12.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Cosmos.AddressResolver.<ResolveAsync>d__9.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Cosmos.Routing.GlobalAddressResolver.<ResolveAsync>d__14.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.AddressSelector.<ResolveAddressesAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.AddressSelector.<ResolveAllTransportAddressUriAsync>d__3.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.StoreReader.<ReadMultipleReplicasInternalAsync>d__12.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.StoreReader.<ReadMultipleReplicaAsync>d__10.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__13.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\r\n   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at 

Request flow is:

Will use pkRangeId 125 as exmaple. First of all, when customer is using direct mode, in order for SDK to find which server to send the request to, there are few critical information we need to get back from gateway, one is pkRanges, one is addresses for a certain partition.

- At some time, SDK get pkRanges list from gateway which includes 125
- Split happend for pkRange 125
- Request come in, SDK use the existing pkRanges from the cache to resolve which partition the request should be routed to, which resulted as 125
- SDK trying to get address list from gateway for pkrange 125. But gateway encountered ServiceFabricNotFoundException because the service has been deleted as part of the split process, so gateway return empty list in this case
- Since it is empty list, SDK has tried to refresh its internal status, including to get any latest changes of pkRanges from gateway. However, SDK get NotModified result back from gateway
- Step #3 and #4 got repeated, and then NotFoundException returned.

Due to few informations missing in the current diagnostics, we are not able to reason about what were the updates to the pkranges cache in client side, why we are getting NotModified from gateway team and why SDK has tried to get addresses for pkRange 125 again (instead of the new child ranges).

Based on the investigation above, there are two piece information will be helpful for the investigation in the future.

  1. ContinuationToken for pkRanges
  2. For change feed pkRanges request, log related changes -- [Diagnostics]Log pkRanges change #3178
@xinlian12 xinlian12 added the Diagnostics Issues around diagnostics and troubleshooting label May 2, 2022
@xinlian12 xinlian12 changed the title Add ContinuationToken for pkRanges request in Diagnostics [Diagnostics]Add ContinuationToken for pkRanges request in Diagnostics May 2, 2022
@j82w j82w added the feature-request New feature or request label May 2, 2022
@j82w j82w added this to the Triage milestone May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Diagnostics Issues around diagnostics and troubleshooting feature-request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants