Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: checkNodeHealth system needs to take into account circuit breaker state #28704

Closed
jordanlewis opened this issue Aug 16, 2018 · 2 comments
Assignees
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@jordanlewis
Copy link
Member

The new node dialer circuit breakers make it so that the current mechanism for node health checks during distsql physical planning aren't sufficient - I think we also need to make sure that the nodeDialer won't return a circuit breaker tripped error for each of the nodes.

We need to figure out a way to keep the liveness check accurate with respect to this circuit breaker, or we'll get spurious failures during flow setup.

@jordanlewis jordanlewis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-execution Relating to SQL execution. labels Aug 16, 2018
@jordanlewis jordanlewis added this to the 2.2 milestone Aug 16, 2018
@jordanlewis jordanlewis assigned asubiotto and unassigned solongordon Aug 16, 2018
@jordanlewis jordanlewis assigned solongordon and unassigned asubiotto Aug 21, 2018
@petermattis
Copy link
Collaborator

I think we also need to make sure that the nodeDialer won't return a circuit breaker tripped error for each of the nodes.

I'm not following your concern here. Can you elaborate?

We need to figure out a way to keep the liveness check accurate with respect to this circuit breaker, or we'll get spurious failures during flow setup.

Which liveness checks are you thinking of? Node liveness?

@jordanlewis
Copy link
Member Author

I'm not following your concern here. Can you elaborate?

I meant to say that the node health check mechanism needs to determine whether or not the circuit breaker has been tripped. I think your PR does that. The second question you had is answered in the same way. This issue wasn't particularly clear. If your PR causes the node health check to fail if the node's circuit breaker is tripped, then all my concerns here are answered.

@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
craig bot pushed a commit that referenced this issue Oct 5, 2018
30987: sql,rpc/nodedialer: improve distsql node health checks r=tschottdorf a=petermattis

Improve distsql node health checks so that the presence of an open
circuit breaker is consider. Previously it was possible for distsql to
plan a processor on a node with an open circuit breaker which ensured an
"unable to dial" error when the plan was run.

Fixes #29149
Fixes #28704

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig craig bot closed this as completed in #30987 Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

4 participants