Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Diagnostics]Log pkRanges change #3178

Open
xinlian12 opened this issue May 6, 2022 · 0 comments
Open

[Diagnostics]Log pkRanges change #3178

xinlian12 opened this issue May 6, 2022 · 0 comments
Assignees
Labels
Diagnostics Issues around diagnostics and troubleshooting discussion-wanted Need a discussion on an area feature-request New feature or request
Milestone

Comments

@xinlian12
Copy link
Member

Add PkRanges request CT for pkRanges in the diagnostics to help investigate the following exceptions/scenarios.

Request failed at:

{""Id"":""PointOperationStatistics"",""ActivityId"":""e035d1b0-7646-4a09-97cc-002534d7b4c4"",""ResponseTimeUtc"":""2022-04-28T17:18:49.2581143Z"",""StatusCode"":404,""SubStatusCode"":0,""RequestCharge"":0,""RequestUri"":""dbs/usersettings/colls/usersettings"",""ErrorMessage"":""Microsoft.Azure.Documents.NotFoundException: Entity with the specified id does not exist in the system. More info: https://aka.ms/cosmosdb-tsg-not-found\r\nActivityId: e035d1b0-7646-4a09-97cc-002534d7b4c4, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Windows/10.0.17763 cosmos-netstandard-sdk/3.19.3\r\n   at Microsoft.Azure.Cosmos.AddressResolver.<ResolveAddressesAndIdentityAsync>d__12.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Cosmos.AddressResolver.<ResolveAsync>d__9.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Cosmos.Routing.GlobalAddressResolver.<ResolveAsync>d__14.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.AddressSelector.<ResolveAddressesAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.AddressSelector.<ResolveAllTransportAddressUriAsync>d__3.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.StoreReader.<ReadMultipleReplicasInternalAsync>d__12.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.StoreReader.<ReadMultipleReplicaAsync>d__10.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__13.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\r\n   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at 

Request flow is:

Will use pkRangeId 125 as exmaple. First of all, when customer is using direct mode, in order for SDK to find which server to send the request to, there are few critical information we need to get back from gateway, one is pkRanges, one is addresses for a certain partition.

- At some time, SDK get pkRanges list from gateway which includes 125
- Split happend for pkRange 125
- Request come in, SDK use the existing pkRanges from the cache to resolve which partition the request should be routed to, which resulted as 125
- SDK trying to get address list from gateway for pkrange 125. But gateway encountered ServiceFabricNotFoundException because the service has been deleted as part of the split process, so gateway return empty list in this case
- Since it is empty list, SDK has tried to refresh its internal status, including to get any latest changes of pkRanges from gateway. However, SDK get NotModified result back from gateway
- Step #3 and #4 got repeated, and then NotFoundException returned.

Due to few informations missing in the current diagnostics, we are not able to reason about what were the updates to the pkranges cache in client side, why we are getting NotModified from gateway team and why SDK has tried to get addresses for pkRange 125 again (instead of the new child ranges).

Based on the investigation above, there are two piece information will be helpful for the investigation in the future.

  1. ContinuationToken for pkRanges -- [Diagnostics]Add ContinuationToken for pkRanges request in Diagnostics #3167
  2. For change feed pkRanges request, log related changes
@xinlian12 xinlian12 added Diagnostics Issues around diagnostics and troubleshooting feature-request New feature or request labels May 6, 2022
@sourabh1007 sourabh1007 added the discussion-wanted Need a discussion on an area label May 6, 2022
@kirankumarkolli kirankumarkolli modified the milestones: Triage, Backlog Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Diagnostics Issues around diagnostics and troubleshooting discussion-wanted Need a discussion on an area feature-request New feature or request
Projects
Status: Triage
Development

No branches or pull requests

3 participants