-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(net): Add outer timeouts for critical network operations to avoid hangs #7869
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Manually disconnect Zebra from the network for more than 10 minutes, then restore the network, and make sure that it reconnects within 10 minutes
This worked as expected.
I'm just wondering how you did this. |
I would remove the ethernet cable on the machine, or turn off wifi. If I wanted to keep working on other things on the same machine, I'd make the Linux firewall block or drop all packets into and out of a process. Dropping all packets to or from the Zcash network port number doesn't work, because peers can have other ports. Here is a GUI that does that: Or you can use (This used to require another user account, but recent versions can do it by process ID.) |
0f42e9c
to
025d0ac
Compare
I pushed another significant hang fix in commit b2385d9. Previously we weren't resetting the lookahead pause flag when we were above the lookahead limit, then the syncer was reset. We could also skip resetting the flag if the mutex was locked when the final block was downloaded. This could be the cause of some of the hangs we see during syncing. It's better for a few blocks to wait for a short time than hang the entire syncer. |
This rate-limit will be hit less if we fix #7816. |
Co-authored-by: Marek <mail@marek.onl>
Co-authored-by: Marek <mail@marek.onl>
This failure looks like an infrastructure issue, compiling hung part of the way through:
|
Here's a comment from Slack that might be useful here: The network & state release freeze applies to high-risk PRs between the start of the last full sync and the release. I'm thinking of allowing fix(net): Add outer timeouts for critical network operations to avoid hangs, because it helps fix an important network hang bug. Here's my risk analysis of each change in the PR:
I think the biggest risk here is a panic or hang, but I can't see how that would happen in these code changes. And if it does and it's frequent, it is likely to get picked up by the CI tests before we release. |
Motivation
This is a quick fix for Zebra hanging rather than disconnecting when the peer set is empty. It prevents the requesting service side of the hang.
Close #7772
Specifications
Sometimes a hang can happen if the called service doesn't correctly set up its waker (a bug), the request is really expensive, or the service is full of requests.
Complex Code or Requirements
It seems unnecessary to have timeouts within service implementations, and in their callers. But it's very easy to add code that doesn't have a timeout, or has a subtle hang or slowness bug under some conditions. So we should move timeouts as far out of the service stack as possible, to include more code.
For now this PR just adds timeouts to the outermost layer of critical network services, or explains how they are already implemented.
Solution
Timeouts:
Hang fixes:
Documentation:
Testing
Review
This PR would be good to get in the release. It is low risk, because it just adds timeouts to existing code. (Or ignores inbound peers requests under load.)
Reviewer Checklist
Follow Up Work
PR #7859 is a more complicated fix to an underlying peer set bug. But it should go in the next release to get more testing.