Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-2.1: sql,rpc/nodedialer: improve distsql node health checks #31014

Merged
merged 2 commits into from
Oct 5, 2018

Conversation

petermattis
Copy link
Collaborator

Backport 2/2 commits from #30987.

/cc @cockroachdb/release


Improve distsql node health checks so that the presence of an open
circuit breaker is consider. Previously it was possible for distsql to
plan a processor on a node with an open circuit breaker which ensured an
"unable to dial" error when the plan was run.

Fixes #29149
Fixes #28704

Release note: None

`Dialer.DialInternalClient` does not check the circuit breaker but
blindly attempts a connection and can succeed, leaving the system in a
state where there is a healthy connection to a node, but the circuit
breaker used for dialing is open. DistSQL checks for connection health
when scheduling processors, but the connection health check does not
examine the breaker. So DistSQL will proceed to schedule a processor on
a node but then be unable to use the connection to that node because
`Dialer.Dial` will return with a `breaker open` error. The code contains
a TODO to reconcile the handling of circuit breakers in the various
`Dialer` methods, but changing the handling is risky in the short
term. As a stop-gap, we reset the breaker after a connection is
successfully opened.

Fixes cockroachdb#29149

Release note: None
Change `DistSQLPlanner.checkNodeHealth` so that it uses
`nodedialer.Dialer.ConnHealth` instead of `rpc.Context.ConnHealth`. The
former is the right method to be calling to check a node's connection
health.

Refactor `DistSQLPlanner.checkNodeHealth` into a `distSQLNodeHealth`
struct. This removed the need for `DistSQLPlannerTestingKnobs`.

Enhance `nodedialer.Dialer.ConnHealth` to mark connections as unhealthy
if the circuit breaker is open. This prevents DistSQL from planning
processors on such nodes.

Release note: None
@petermattis petermattis requested review from tbg and a team October 5, 2018 17:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Oct 5, 2018 via email

@petermattis petermattis merged commit 00ee2a9 into cockroachdb:release-2.1 Oct 5, 2018
@petermattis petermattis deleted the backport2.1-30987 branch October 5, 2018 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants