Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,rpc/nodedialer: improve distsql node health checks #30987

Merged
merged 2 commits into from
Oct 5, 2018

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Oct 4, 2018

Improve distsql node health checks so that the presence of an open
circuit breaker is consider. Previously it was possible for distsql to
plan a processor on a node with an open circuit breaker which ensured an
"unable to dial" error when the plan was run.

Fixes #29149
Fixes #28704

Release note: None

`Dialer.DialInternalClient` does not check the circuit breaker but
blindly attempts a connection and can succeed, leaving the system in a
state where there is a healthy connection to a node, but the circuit
breaker used for dialing is open. DistSQL checks for connection health
when scheduling processors, but the connection health check does not
examine the breaker. So DistSQL will proceed to schedule a processor on
a node but then be unable to use the connection to that node because
`Dialer.Dial` will return with a `breaker open` error. The code contains
a TODO to reconcile the handling of circuit breakers in the various
`Dialer` methods, but changing the handling is risky in the short
term. As a stop-gap, we reset the breaker after a connection is
successfully opened.

Fixes cockroachdb#29149

Release note: None
@petermattis petermattis requested review from a team October 4, 2018 21:35
@jordanlewis
Copy link
Member

Also closes #28704, right? This is great - thanks for taking it on.

@petermattis
Copy link
Collaborator Author

@jordanlewis Yep.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, 6 of 6 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):

	// RPCs fail when dial fails due to an open breaker. Reset the breaker here
	// as a stop-gap before the reconciliation occurs.
	n.getBreaker(nodeID).Reset()

Thoughts about moving this below ConnectionReady? What does that even do?

Change `DistSQLPlanner.checkNodeHealth` so that it uses
`nodedialer.Dialer.ConnHealth` instead of `rpc.Context.ConnHealth`. The
former is the right method to be calling to check a node's connection
health.

Refactor `DistSQLPlanner.checkNodeHealth` into a `distSQLNodeHealth`
struct. This removed the need for `DistSQLPlannerTestingKnobs`.

Enhance `nodedialer.Dialer.ConnHealth` to mark connections as unhealthy
if the circuit breaker is open. This prevents DistSQL from planning
processors on such nodes.

Release note: None
Copy link
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Thoughts about moving this below ConnectionReady? What does that even do?

Done. ConnectionReady checks to see if the connection is in the "transient failure" state. I've added a comment about why it is useful (though I'm not 100% sure about whether that scenario can happen, I think we tear down connections really quickly when a heartbeat fails).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@petermattis
Copy link
Collaborator Author

bors r=tschottdorf

craig bot pushed a commit that referenced this pull request Oct 5, 2018
30987: sql,rpc/nodedialer: improve distsql node health checks r=tschottdorf a=petermattis

Improve distsql node health checks so that the presence of an open
circuit breaker is consider. Previously it was possible for distsql to
plan a processor on a node with an open circuit breaker which ensured an
"unable to dial" error when the plan was run.

Fixes #29149
Fixes #28704

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig
Copy link
Contributor

craig bot commented Oct 5, 2018

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants