Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability: Fixes region failover logic on control plane hot path when gateway hangs #2279

Merged
merged 8 commits into from
Mar 9, 2021

Conversation

ealsur
Copy link
Member

@ealsur ealsur commented Mar 3, 2021

If Gateway is a hung state where the HttpRequestException only happens after 60 seconds, the current Hot Path policy was only waiting up to 10 seconds and surfacing a RequestTimeout, and we could not apply the failover mechanics.

This only affected the hot path (address resolution or query plans) and not other Gateway calls.

Introduced in 3.16.0 in PR #1954

Type of change

  • Bug fix (non-breaking change which fixes an issue)

@j82w j82w changed the title Availability: Fixes detection for hot path during gateway hang Availability: Fixes region failover logic on control plane hot path when gateway hangs Mar 3, 2021
Copy link
Member

@kirankumarkolli kirankumarkolli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets please wait merging this change.

Latency is also equally very important as well.
May be we need to revisit basing all fail-over on just ReqeustException

@j82w
Copy link
Contributor

j82w commented Mar 4, 2021

Lets please wait merging this change.

Latency is also equally very important as well.
May be we need to revisit basing all fail-over on just ReqeustException

@kirankumarkolli I don't see any reason to block this PR. If it is decided that we want to revisit fail-over design it should be a separate PR after the current model is fixed.

@ealsur
Copy link
Member Author

ealsur commented Mar 4, 2021

I don't see how latency is involved here. We are already retrying 3 times, with .5, and 5 seconds, before the 65 seconds. If Gateway has a latency spike, it will be clear from the diagnostics. If Gateway had a 20 seconds latency spike, previously we were throwing a RequestTimeout, so the operation failed (not helping latency).

The 3 retries are to work around Gateway upgrades and try to reach other instances, not for latency (rather availability).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants