Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

Merged
merged 16 commits into from
Nov 2, 2023

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Oct 30, 2023

Motivation

This is a quick fix for Zebra hanging rather than disconnecting when the peer set is empty. It prevents the requesting service side of the hang.

Close #7772

Specifications

Sometimes a hang can happen if the called service doesn't correctly set up its waker (a bug), the request is really expensive, or the service is full of requests.

Complex Code or Requirements

It seems unnecessary to have timeouts within service implementations, and in their callers. But it's very easy to add code that doesn't have a timeout, or has a subtle hang or slowness bug under some conditions. So we should move timeouts as far out of the service stack as possible, to include more code.

For now this PR just adds timeouts to the outermost layer of critical network services, or explains how they are already implemented.

Solution

Timeouts:

  • Add a timeout to syncer obtain, extend, and genesis block downloads and verifies
  • Add a timeout to mempool crawls
  • Add a timeout to mempool download and verify

Hang fixes:

  • Remove a potential hang from address book requests
  • Remove a potential hang when the lookahead limit is reached, then the pending blocks drop below the limit
  • Remove a potential hang when the lookahead limit is reached, then the syncer is completely reset

Documentation:

  • Explain why the inbound service never hangs

Testing

  • TODO: Manually disconnect Zebra from the network for more than 10 minutes, then restore the network, and make sure that it reconnects within 10 minutes

Review

This PR would be good to get in the release. It is low risk, because it just adds timeouts to existing code. (Or ignores inbound peers requests under load.)

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

Follow Up Work

PR #7859 is a more complicated fix to an underlying peer set bug. But it should go in the next release to get more testing.

@teor2345 teor2345 added C-bug Category: This is a bug P-Medium ⚡ I-hang A Zebra component stops responding to requests A-network Area: Network protocol updates or fixes A-concurrency Area: Async code, needs extra work to make it work properly. labels Oct 30, 2023
@teor2345 teor2345 self-assigned this Oct 30, 2023
@teor2345 teor2345 changed the title fix(net): Add outer timeouts for critical network operations fix(net): Add outer timeouts for critical network operations to avoid hangs Oct 30, 2023
@teor2345 teor2345 marked this pull request as ready for review October 30, 2023 08:44
@teor2345 teor2345 requested review from a team as code owners October 30, 2023 08:44
@teor2345 teor2345 requested review from oxarbitrage and removed request for a team October 30, 2023 08:44
@mpguerra mpguerra requested review from arya2 and removed request for oxarbitrage October 30, 2023 09:23
@upbqdn upbqdn self-requested a review October 30, 2023 14:46
arya2
arya2 previously approved these changes Oct 30, 2023
Copy link
Contributor

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Manually disconnect Zebra from the network for more than 10 minutes, then restore the network, and make sure that it reconnects within 10 minutes

This worked as expected.

zebrad/src/components/sync.rs Show resolved Hide resolved
@upbqdn
Copy link
Member

upbqdn commented Oct 30, 2023

Manually disconnect Zebra from the network

I'm just wondering how you did this.

@teor2345
Copy link
Contributor Author

teor2345 commented Oct 30, 2023

Manually disconnect Zebra from the network

I'm just wondering how you did this.

I would remove the ethernet cable on the machine, or turn off wifi. If I wanted to keep working on other things on the same machine, I'd make the Linux firewall block or drop all packets into and out of a process.

Dropping all packets to or from the Zcash network port number doesn't work, because peers can have other ports.

Here is a GUI that does that:
https://github.com/evilsocket/opensnitch

Or you can use zebrad &; iptables -m owner --pid-owner $! -j REJECT:
https://stackoverflow.com/questions/4314163/create-iptables-rule-per-process-service

(This used to require another user account, but recent versions can do it by process ID.)

@teor2345
Copy link
Contributor Author

I pushed another significant hang fix in commit b2385d9.

Previously we weren't resetting the lookahead pause flag when we were above the lookahead limit, then the syncer was reset. We could also skip resetting the flag if the mutex was locked when the final block was downloaded.

This could be the cause of some of the hangs we see during syncing. It's better for a few blocks to wait for a short time than hang the entire syncer.

@teor2345
Copy link
Contributor Author

ERROR: failed to solve: rust:bullseye: unexpected status from HEAD request to https://registry-1.docker.io/v2/library/rust/manifests/bullseye: 429 Too Many Requests

https://github.com/ZcashFoundation/zebra/actions/runs/6700359536/job/18206164708?pr=7869#step:10:230

This rate-limit will be hit less if we fix #7816.

Co-authored-by: Marek <mail@marek.onl>
arya2
arya2 previously approved these changes Oct 31, 2023
@arya2 arya2 added the do-not-merge Tells Mergify not to merge this PR label Oct 31, 2023
Co-authored-by: Marek <mail@marek.onl>
arya2
arya2 previously approved these changes Oct 31, 2023
@teor2345
Copy link
Contributor Author

This failure looks like an infrastructure issue, compiling hung part of the way through:

Tue, 31 Oct 2023 21:07:01 GMT
   Compiling zebra-chain v1.0.0-beta.30 (/opt/zebrad/zebra-chain)
Tue, 31 Oct 2023 21:07:11 GMT
   Compiling metrics v0.21.1
Tue, 31 Oct 2023 21:07:12 GMT
   Compiling humantime-serde v1.1.1
Tue, 31 Oct 2023 21:07:12 GMT
   Compiling bincode v1.3.3
Tue, 31 Oct 2023 21:11:58 GMT
Error: The operation was canceled.

https://github.com/ZcashFoundation/zebra/actions/runs/6711815526/job/18240439236?pr=7869

@teor2345
Copy link
Contributor Author

teor2345 commented Nov 1, 2023

Here's a comment from Slack that might be useful here:

The network & state release freeze applies to high-risk PRs between the start of the last full sync and the release.

I'm thinking of allowing fix(net): Add outer timeouts for critical network operations to avoid hangs, because it helps fix an important network hang bug.

Here's my risk analysis of each change in the PR:

  • adding timeouts to syncer and mempool crawler functions that usually don't time out
    • in the syncer a timeout resets the syncer, this is a standard operation
    • in the mempool it just skips the crawl and does another one after the rate-limit timer, the exact timing of crawls is not important
  • removing a potential deadlock or hang from the block downloader, but we know it doesn't usually deadlock (though it might or might not be the source of the regular 10 minute hangs in the logs)
    • this might slightly delay some block verify futures, but that delay is much shorter than a hang
  • removing a potential hang from the inbound service by ignoring the request, but we know it usually works
    • ignoring inbound requests is allowed by the protocol, and zcashd does it all the time, so we know Zebra and zcashd can handle it

I think the biggest risk here is a panic or hang, but I can't see how that would happen in these code changes. And if it does and it's frequent, it is likely to get picked up by the CI tests before we release.

@teor2345 teor2345 removed the do-not-merge Tells Mergify not to merge this PR label Nov 1, 2023
mergify bot added a commit that referenced this pull request Nov 2, 2023
@mergify mergify bot merged commit 628b3e3 into main Nov 2, 2023
103 checks passed
@mergify mergify bot deleted the hang-timeout branch November 2, 2023 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-network Area: Network protocol updates or fixes C-bug Category: This is a bug I-hang A Zebra component stops responding to requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: zebrad will not reconnect after an internet connection failure and restore
3 participants