Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability: Fixes CancellationToken evaluation during failover #2383

Merged
merged 15 commits into from
Apr 21, 2021

Conversation

ealsur
Copy link
Member

@ealsur ealsur commented Apr 12, 2021

When using CancellationTokens, it is currently possible that a cancelling token stops the SDK from triggering the failover mechanics and marking a region unavailable if the token is canceled inbetween.

In a normal failover scenario, a request might face certain errors (HttpRequestException, 403.3, 403.1008, etc) which would trigger the retry policies. In these cases, the retry policy would mark the region unavailable and retry the request.

If a user provided a CancellationToken that canceled right after the error happened, the flow was checking the token status before initiating the failover process.

In this case, the client could be stuck because the token is preventing the SDK from failing over and new requests would face the same fate.

For example:

  1. User initiated the operation
  2. Operation failed with some failover-related error and CancellationToken canceled (for example, the token has a 5 second time and the error was an HTTP connect-timeout due to regional endpoint going down).
  3. Check for token cancellation throws OperationCanceledException
  4. Next request comes and faces the same error, with the same token expiration
  5. Check for token cancellation throws OperationCanceledException
  6. This cycle repeats for any future operation...

The ideal flow, which this PR addresses, is:

  1. User initiated the operation
  2. Operation failed with some failover-related error and CancellationToken canceled (for example, the token has a 5 second time and the error was an HTTP connect-timeout due to regional endpoint going down).
  3. Retry policy semantics kick in and mark the region unavailable even though the token is cancelled
  4. Retry flow evaluates if it can retry, but sees the canceled token, throws OperationCanceledException
  5. Next request comes and sees the new region endpoint.

We had several policies that were checking the token before applying, which are fixed also in the same PR.

Tests

The PR adds an integration test that mimics what happens in a TCP timeout scenario when the token cancels during the address refresh, which is what happens when a region goes offline:

  1. Request is initiated.
  2. The client needs to get Addresses from AddressResolver (forceRefresh false)
  3. TCP request is done -> Throws 410 with TransportException to simulate TCP connection error due to backend machine unavailability
  4. Address refresh is initiated trying to discover the new partition addresses
  5. AddressResolver throws HttpRequestException (forceRefresh true) simulating an HTTP connection error and cancels the token.
  6. ClientRetryPolicy should get the error, mark the region unavailable on the GlobalEndpointManager, even though the token is canceled.
  7. Initiate retry, which should be canceled because the token was canceled.
  8. OperationCanceledException is thrown to the user but the marking of the unavailable regions and refresh of account information was triggered.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

@ealsur ealsur added the bug Something isn't working label Apr 12, 2021
@ealsur ealsur self-assigned this Apr 12, 2021
j82w
j82w previously approved these changes Apr 12, 2021
j82w
j82w previously approved these changes Apr 19, 2021
@ealsur ealsur dismissed stale reviews from kirankumarkolli and j82w via c81a3f1 April 20, 2021 21:54
@ealsur ealsur merged commit 67067ee into master Apr 21, 2021
@ealsur ealsur deleted the users/ealsur/canceltokenpolicy branch April 21, 2021 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants