fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

teor2345 · 2023-10-30T01:09:14Z

Motivation

This is a quick fix for Zebra hanging rather than disconnecting when the peer set is empty. It prevents the requesting service side of the hang.

Close #7772

Specifications

Sometimes a hang can happen if the called service doesn't correctly set up its waker (a bug), the request is really expensive, or the service is full of requests.

Complex Code or Requirements

It seems unnecessary to have timeouts within service implementations, and in their callers. But it's very easy to add code that doesn't have a timeout, or has a subtle hang or slowness bug under some conditions. So we should move timeouts as far out of the service stack as possible, to include more code.

For now this PR just adds timeouts to the outermost layer of critical network services, or explains how they are already implemented.

Solution

Timeouts:

Add a timeout to syncer obtain, extend, and genesis block downloads and verifies
Add a timeout to mempool crawls
Add a timeout to mempool download and verify

Hang fixes:

Remove a potential hang from address book requests
Remove a potential hang when the lookahead limit is reached, then the pending blocks drop below the limit
Remove a potential hang when the lookahead limit is reached, then the syncer is completely reset

Documentation:

Explain why the inbound service never hangs

Testing

TODO: Manually disconnect Zebra from the network for more than 10 minutes, then restore the network, and make sure that it reconnects within 10 minutes

Review

This PR would be good to get in the release. It is low risk, because it just adds timeouts to existing code. (Or ignores inbound peers requests under load.)

Reviewer Checklist

Will the PR name make sense to users?
- Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
Are the PR labels correct?
Does the code do what the ticket and PR says?
- Does it change concurrent code, unsafe code, or consensus rules?
How do you know it works? Does it have tests?

Follow Up Work

PR #7859 is a more complicated fix to an underlying peer set bug. But it should go in the next release to get more testing.

zebrad/src/components/sync.rs

arya2

Looks good!

Manually disconnect Zebra from the network for more than 10 minutes, then restore the network, and make sure that it reconnects within 10 minutes

This worked as expected.

zebrad/src/components/sync.rs

upbqdn · 2023-10-30T20:13:39Z

Manually disconnect Zebra from the network

I'm just wondering how you did this.

teor2345 · 2023-10-30T23:52:10Z

Manually disconnect Zebra from the network

I'm just wondering how you did this.

I would remove the ethernet cable on the machine, or turn off wifi. If I wanted to keep working on other things on the same machine, I'd make the Linux firewall block or drop all packets into and out of a process.

Dropping all packets to or from the Zcash network port number doesn't work, because peers can have other ports.

Here is a GUI that does that:
https://github.com/evilsocket/opensnitch

Or you can use zebrad &; iptables -m owner --pid-owner $! -j REJECT:
https://stackoverflow.com/questions/4314163/create-iptables-rule-per-process-service

(This used to require another user account, but recent versions can do it by process ID.)

teor2345 · 2023-10-31T00:29:29Z

I pushed another significant hang fix in commit b2385d9.

Previously we weren't resetting the lookahead pause flag when we were above the lookahead limit, then the syncer was reset. We could also skip resetting the flag if the mutex was locked when the final block was downloaded.

This could be the cause of some of the hangs we see during syncing. It's better for a few blocks to wait for a short time than hang the entire syncer.

teor2345 · 2023-10-31T00:31:07Z

ERROR: failed to solve: rust:bullseye: unexpected status from HEAD request to https://registry-1.docker.io/v2/library/rust/manifests/bullseye: 429 Too Many Requests

https://github.com/ZcashFoundation/zebra/actions/runs/6700359536/job/18206164708?pr=7869#step:10:230

This rate-limit will be hit less if we fix #7816.

zebrad/src/components/sync.rs

Co-authored-by: Marek <mail@marek.onl>

zebrad/src/components/inbound.rs

Co-authored-by: Marek <mail@marek.onl>

zebrad/Cargo.toml

teor2345 · 2023-10-31T22:48:48Z

This failure looks like an infrastructure issue, compiling hung part of the way through:

Tue, 31 Oct 2023 21:07:01 GMT
   Compiling zebra-chain v1.0.0-beta.30 (/opt/zebrad/zebra-chain)
Tue, 31 Oct 2023 21:07:11 GMT
   Compiling metrics v0.21.1
Tue, 31 Oct 2023 21:07:12 GMT
   Compiling humantime-serde v1.1.1
Tue, 31 Oct 2023 21:07:12 GMT
   Compiling bincode v1.3.3
Tue, 31 Oct 2023 21:11:58 GMT
Error: The operation was canceled.

https://github.com/ZcashFoundation/zebra/actions/runs/6711815526/job/18240439236?pr=7869

teor2345 · 2023-11-01T21:43:09Z

Here's a comment from Slack that might be useful here:

The network & state release freeze applies to high-risk PRs between the start of the last full sync and the release.

I'm thinking of allowing fix(net): Add outer timeouts for critical network operations to avoid hangs, because it helps fix an important network hang bug.

Here's my risk analysis of each change in the PR:

adding timeouts to syncer and mempool crawler functions that usually don't time out
- in the syncer a timeout resets the syncer, this is a standard operation
- in the mempool it just skips the crawl and does another one after the rate-limit timer, the exact timing of crawls is not important
removing a potential deadlock or hang from the block downloader, but we know it doesn't usually deadlock (though it might or might not be the source of the regular 10 minute hangs in the logs)
- this might slightly delay some block verify futures, but that delay is much shorter than a hang
removing a potential hang from the inbound service by ignoring the request, but we know it usually works
- ignoring inbound requests is allowed by the protocol, and zcashd does it all the time, so we know Zebra and zcashd can handle it

I think the biggest risk here is a panic or hang, but I can't see how that would happen in these code changes. And if it does and it's frequent, it is likely to get picked up by the CI tests before we release.

zebrad/src/components/sync/downloads.rs

teor2345 added C-bug Category: This is a bug P-Medium ⚡ I-hang A Zebra component stops responding to requests A-network Area: Network protocol updates or fixes A-concurrency Area: Async code, needs extra work to make it work properly. labels Oct 30, 2023

teor2345 self-assigned this Oct 30, 2023

teor2345 changed the title ~~fix(net): Add outer timeouts for critical network operations~~ fix(net): Add outer timeouts for critical network operations to avoid hangs Oct 30, 2023

teor2345 marked this pull request as ready for review October 30, 2023 08:44

teor2345 requested review from a team as code owners October 30, 2023 08:44

teor2345 requested review from oxarbitrage and removed request for a team October 30, 2023 08:44

teor2345 commented Oct 30, 2023

View reviewed changes

zebrad/src/components/sync.rs Show resolved Hide resolved

mpguerra requested review from arya2 and removed request for oxarbitrage October 30, 2023 09:23

upbqdn self-requested a review October 30, 2023 14:46

arya2 previously approved these changes Oct 30, 2023

View reviewed changes

zebrad/src/components/sync.rs Show resolved Hide resolved

teor2345 added 11 commits October 31, 2023 09:56

Refactor out try_to_sync_once()

c99f557

Add outer timeouts for obtaining and extending tips

2858dcd

Refactor out request_genesis_once()

b8c7f6a

Wrap genesis download once in a timeout

ce7177b

Increase the genesis timeout to avoid denial of service from old nodes

6bda46e

Add an outer timeout to mempool crawls

720d3c6

Add an outer timeout to mempool download/verify

ade6e41

Remove threaded mutex blocking from the inbound service

0f44ce1

Explain why inbound readiness never hangs

64efd83

Fix whitespace that cargo fmt doesn't

da5597a

Avoid hangs by always resetting the past lookahead limit flag

b2385d9

Document block-specific and syncer-wide errors

025d0ac

teor2345 dismissed arya2’s stale review via 025d0ac October 31, 2023 00:27

teor2345 force-pushed the hang-timeout branch from 0f42e9c to 025d0ac Compare October 31, 2023 00:27

upbqdn reviewed Oct 31, 2023

View reviewed changes

zebrad/src/components/sync.rs Outdated Show resolved Hide resolved

Update zebrad/src/components/sync.rs

9cd8325

Co-authored-by: Marek <mail@marek.onl>

arya2 previously approved these changes Oct 31, 2023

View reviewed changes

arya2 added the do-not-merge Tells Mergify not to merge this PR label Oct 31, 2023

upbqdn reviewed Oct 31, 2023

View reviewed changes

zebrad/src/components/inbound.rs Outdated Show resolved Hide resolved

Use correct condition for log messages

3021d22

Co-authored-by: Marek <mail@marek.onl>

teor2345 dismissed arya2’s stale review via 3021d22 October 31, 2023 19:58

Merge branch 'main' into hang-timeout

f225918

arya2 previously approved these changes Oct 31, 2023

View reviewed changes

upbqdn reviewed Oct 31, 2023

View reviewed changes

zebrad/Cargo.toml Show resolved Hide resolved

teor2345 commented Nov 1, 2023

View reviewed changes

zebrad/src/components/sync/downloads.rs Show resolved Hide resolved

Keep lookahead reset metric

f682915

teor2345 dismissed arya2’s stale review via f682915 November 1, 2023 21:44

teor2345 removed the do-not-merge Tells Mergify not to merge this PR label Nov 1, 2023

upbqdn approved these changes Nov 2, 2023

View reviewed changes

mergify bot added a commit that referenced this pull request Nov 2, 2023

Merge of #7869

2bc429d

mergify bot mentioned this pull request Nov 2, 2023

merge queue: embarking main (afbe807) and #7869 together #7895

Closed

mergify bot merged commit 628b3e3 into main Nov 2, 2023
103 checks passed

mergify bot deleted the hang-timeout branch November 2, 2023 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

teor2345 commented Oct 30, 2023 •

edited

Loading

arya2 left a comment

upbqdn commented Oct 30, 2023

teor2345 commented Oct 30, 2023 •

edited

Loading

teor2345 commented Oct 31, 2023

teor2345 commented Oct 31, 2023

teor2345 commented Oct 31, 2023

teor2345 commented Nov 1, 2023

fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

fix(net): Add outer timeouts for critical network operations to avoid hangs #7869

Conversation

teor2345 commented Oct 30, 2023 • edited Loading

Motivation

Specifications

Complex Code or Requirements

Solution

Testing

Review

Reviewer Checklist

Follow Up Work

arya2 left a comment

Choose a reason for hiding this comment

upbqdn commented Oct 30, 2023

teor2345 commented Oct 30, 2023 • edited Loading

teor2345 commented Oct 31, 2023

teor2345 commented Oct 31, 2023

teor2345 commented Oct 31, 2023

teor2345 commented Nov 1, 2023

teor2345 commented Oct 30, 2023 •

edited

Loading

teor2345 commented Oct 30, 2023 •

edited

Loading