Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer connections hang temporarily or fail permanently #1435

Closed
1 task
oxarbitrage opened this issue Dec 2, 2020 · 21 comments · Fixed by #1709
Closed
1 task

Peer connections hang temporarily or fail permanently #1435

oxarbitrage opened this issue Dec 2, 2020 · 21 comments · Fixed by #1709
Assignees
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug I-hang A Zebra component stops responding to requests
Milestone

Comments

@oxarbitrage
Copy link
Contributor

oxarbitrage commented Dec 2, 2020

Analysis

Peer connection failures can result in a slow sync, temporary sync hangs, or permanent sync hangs.

These failures can happen after network interruptions, or due to normal connection churn. They might also happen because Zebra's protocol state machine gets in an invalid or unrecoverable state.

Next Steps

Here are some things we could try:

  • Run Zebra with a single local peer, and make it panic as soon as that peer disappears. It's ok for the peer to go from ready to unready and back, and maybe disconnect, but it should never disappear.

Version

zebrad 3.0.0-alpha.0

Current main branch
Edit: as of 2020-12-02

Platform
Linux oxarbitrage 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description

During sync on the mainnet the sync component stopped responding. The program didn't crashed or hanged totally as the inbound requests where still being responded however no more block downloads were made. I have a log of this at https://gist.github.com/oxarbitrage/2f067fed9588c3d942e499d1252fc777 where the last msg from the sync component is at https://gist.github.com/oxarbitrage/2f067fed9588c3d942e499d1252fc777#file-gistfile1-txt-L2843 and no further msg after 30 minutes.

I tried stooping the program and start again. The sync resumed.

--
When trying to sync the zcash blockchain using zebra i will expect to download all the blocks from start to end without intervention. Instead, i had to stop(ctrl-c) the program and restart to keep going.

@oxarbitrage oxarbitrage added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage labels Dec 2, 2020
@teor2345
Copy link
Contributor

teor2345 commented Dec 2, 2020

Your logs say that you're using commit bfbc737.

This looks like issue #1409, which is fixed by PR #1425 in a later commit b449fe9.

Please pull the latest main branch, recompile, and relaunch Zebra. If the issue happens again, please re-open this ticket.

@teor2345 teor2345 closed this as completed Dec 2, 2020
@oxarbitrage
Copy link
Contributor Author

Yea, the PR was not merged when i started that test earlier today. Thanks, i will reopen if i see it again.

@oxarbitrage
Copy link
Contributor Author

oxarbitrage commented Dec 2, 2020

I am seeing the same issue with an updated branch that includes #1425

No response from the sync component after 30 minutes.

A restart fixes it.

Click triangle to reveal logs

Dec 02 19:46:32.007  INFO {zebrad="7d1dc174"}:peer{addr=39.97.178.10:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14491) response_len=158 chain_tip_height=Height(419181) stop_height=None intersection_height=Some(Height(14333))
Dec 02 19:46:35.684  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Dec 02 19:46:35.685  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:36.668  INFO {zebrad="7d1dc174"}:sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=400 current_range=(Excluded(Height(419181)), Included(Height(419581)))
Dec 02 19:46:45.185  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Dec 02 19:46:45.185  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:49.016  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Dec 02 19:46:49.017  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:52.900  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2374 lookahead_limit=2000
Dec 02 19:46:52.901  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
Dec 02 19:46:57.378  INFO {zebrad="7d1dc174"}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2489 lookahead_limit=2000
Dec 02 19:47:52.769  INFO {zebrad="7d1dc174"}:peer{addr=39.97.178.10:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000036349c3579797cff490c80a6d337c83ab20839b7b1c831caf5cedf}: zebra_state::service: pruned utxo requests old_len=5 new_len=0 prune_count=5 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 19:51:31.471  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=104 new_len=103 prune_count=1 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 19:51:31.521  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14496) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14338))
Dec 02 19:56:13.322  INFO {zebrad="7d1dc174"}:peer{addr=59.110.243.62:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=00000000010eaac9ff05c2db4c17a64ea5ec024efb300f517c6fec60236817e8}: zebra_state::service: pruned utxo requests old_len=203 new_len=0 prune_count=203 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:01:13.848  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=96 new_len=91 prune_count=5 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:01:13.867  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14507) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14349))
Dec 02 20:04:41.853  INFO {zebrad="7d1dc174"}:peer{addr=120.79.50.67:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000010e5d01e28c86a99c18906fd6cf6c975e565908fc471c5fbe42e83}: zebra_state::service: pruned utxo requests old_len=109 new_len=0 prune_count=109 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:04:57.586  INFO {zebrad="7d1dc174"}:peer{addr=66.70.180.144:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14510) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14352))
Dec 02 20:08:51.840  INFO {zebrad="7d1dc174"}:peer{addr=139.99.123.157:8233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=275 new_len=9 prune_count=266 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:08:51.856  INFO {zebrad="7d1dc174"}:peer{addr=139.99.123.157:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14513) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14355))
Dec 02 20:12:51.891  INFO {zebrad="7d1dc174"}:peer{addr=120.79.50.67:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000060fd9d327ae2680d946af863c191a96b1b8981574a6310f1830e20}: zebra_state::service: pruned utxo requests old_len=21 new_len=0 prune_count=21 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:14:14.204  INFO {zebrad="7d1dc174"}:peer{addr=39.99.44.175:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14516) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14358))
Dec 02 20:14:18.081  INFO {zebrad="7d1dc174"}:peer{addr=176.74.154.45:8233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(14516) response_len=158 chain_tip_height=Height(419581) stop_height=None intersection_height=Some(Height(14358))
Dec 02 20:16:27.824  INFO {zebrad="7d1dc174"}:peer{addr=47.254.69.198:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=000000000152e1077b370d87f446e04c10c7c586c0645266b0ba8c3c94d24451}: zebra_state::service: pruned utxo requests old_len=74 new_len=2 prune_count=72 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))
Dec 02 20:18:41.734  INFO {zebrad="7d1dc174"}:peer{addr=39.97.178.10:8233}:msg_as_req{msg=inv}:inbound:download_and_verify{hash=0000000000b8bf272dcc55cfc71b478c59116ccdf713a73dab85882c975fe3da}: zebra_state::service: pruned utxo requests old_len=2 new_len=0 prune_count=2 tip=Some((Height(419581), block::Hash("000000000003de545eedfd47b6ae10327fa3b317a6d1eb66a29c3bd6029a243d")))

@oxarbitrage oxarbitrage reopened this Dec 2, 2020
@teor2345
Copy link
Contributor

teor2345 commented Dec 3, 2020

@oxarbitrage if you have a 10s of gigabytes of spare disk space, you can help us diagnose the issue using a trace log:

would it be possible to run a sync with filter = 'trace' and save a complete execution trace? It will likely be quite large, but we'd have the ability to trace exactly what data is causing the timeout.
#1406 (comment)

If you don't have lots of disk space, using filter = 'debug' might help us diagnose the issue - the logs would be a lot smaller than trace.

@oxarbitrage
Copy link
Contributor Author

I was able to find the problem in debug mode so i have some logs https://gist.github.com/oxarbitrage/d2369fd4e5abb24e683e5a6301fcc6ff#file-sync_debug_logs_mainnet-txt

I didn't researched much yet but one possibility is i am losing connection locally(my adsl change IP every 12 hours or so) and by the amount of marking candidate as failed candidate.addr= msgs in that log all candidates are blacklisted when the connection returns back.

I will confirm this, can post bigger logs if needed.

@oxarbitrage
Copy link
Contributor Author

oxarbitrage commented Dec 3, 2020

Yea, it seems that is what is happening. I resumed and blocks are downloaded fine, then i closed my connection in purpose, while zebra is still running. Waited like 30 seconds and connected back, zebra was never able to download more blocks long after my connection was active again.

I have some logs for this too:

https://gist.github.com/oxarbitrage/916ed4d49db39116917ad6e938555e8e#file-sync_disconnect_mainnet-txt

@oxarbitrage
Copy link
Contributor Author

Is the cut is short enough(tested a cut of 3-5 seconds) and there is no time to mark all the peers as failing then the sync will continue

@oxarbitrage
Copy link
Contributor Author

The cut haves to be long(of at least 1 minute here) to actually be totally unrecoverable in the same session.

In a 1>minute cut sometimes i get this warnings long after the connection is restored:

Dec 03 19:49:46.245  WARN crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection

They at least indicate something is going on with the network with default info level. In some other runs i am not seeing this warns, i suppose this can depend on the amount of time the connection was off.

@hdevalence
Copy link
Contributor

Hmm, I don't think that the failure of an existing connection causes a peer address to be put in the "failed candidates" list, that's a list of peers where the connection attempt failed. (It's not a permanent rejection, it's just the lowest-preference for reconnection attempts). The peer set will attempt to reconnect to peers it disconnected from, by waiting until they could no longer be connected (ie. waiting beyond a timeout interval), then treating them as high-priority candidates for new connections.

The peer set is supposed to send a demand signal when it does not have enough peers, causing attempts to connect to new peers. It's unclear why that mechanism isn't working.

@teor2345 teor2345 changed the title sync component dead during sync Peer connections fail permanently after network interruptions Dec 4, 2020
@teor2345 teor2345 added this to the First Alpha Release milestone Dec 4, 2020
@teor2345
Copy link
Contributor

teor2345 commented Dec 4, 2020

I'm also seeing this issue, it looks like my machine or router stopped allowing connections for a few hours this morning. When it came back up, Zebra reconnected to 1 peer, but has failed to reconnect to any other peers.

Edit: this is on testnet, so I should be seeing 5-20 active peers

@teor2345
Copy link
Contributor

teor2345 commented Dec 4, 2020

I've just seen a Testnet Zebra rapidly lose all its peers, and never reconnect:

Click triangle to reveal logs

Dec 04 12:52:33.915  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 15 peers to answer requests
Dec 04 12:52:49.767  INFO {zebrad="af01a78e"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=1606 lookahead_limit=2000
Dec 04 12:53:06.779  INFO {zebrad="af01a78e"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=1977 lookahead_limit=2000
Dec 04 12:53:32.834  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_state::service: pruned utxo requests old_len=1365 new_len=40 prune_count=1325 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:53:34.275  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 15 peers to answer requests
Dec 04 12:54:17.672  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_state::service: pruned utxo requests old_len=729 new_len=45 prune_count=684 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:54:35.569  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 5 peers to answer requests
Dec 04 12:55:19.382  INFO {zebrad="af01a78e"}:peer{addr=[2600:3c00::f03c:92ff:fe48:bd98]:18233}:msg_as_req{msg=getheaders}:inbound: zebra_state::service: pruned utxo requests old_len=423 new_len=49 prune_count=374 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:55:19.384  INFO {zebrad="af01a78e"}:peer{addr=[2600:3c00::f03c:92ff:fe48:bd98]:18233}:msg_as_req{msg=getheaders}:inbound:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(141750) response_len=158 chain_tip_height=Height(282491) stop_height=None intersection_height=Some(Height(141592))
Dec 04 12:55:40.353  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 1 peers to answer requests
Dec 04 12:55:50.508  INFO {zebrad="af01a78e"}:sync:extend_tips: zebra_state::service: pruned utxo requests old_len=4801 new_len=62 prune_count=4739 tip=Some((Height(282491), block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")))
Dec 04 12:56:45.185  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 12:57:47.290  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 12:59:02.290  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 13:00:17.291  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 13:01:32.291  WARN {zebrad="af01a78e"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 04 13:02:47.291  WARN crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection

@teor2345 teor2345 added A-rust Area: Updates to Rust code and removed S-needs-triage Status: A bug report needs triage labels Dec 4, 2020
@teor2345
Copy link
Contributor

teor2345 commented Dec 4, 2020

I've taken the triage off this bug, because we already triaged some of its duplicates into the first alpha

@oxarbitrage
Copy link
Contributor Author

oxarbitrage commented Dec 6, 2020

Between each warn here is what i can get in a trace level:

Click triangle to reveal logs

...
Dec 06 18:42:32.966 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: ready_peers=0 unready_peers=0
Dec 06 18:42:32.966  WARN {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 06 18:42:32.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: preselected service was not ready, reselecting
Dec 06 18:42:32.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::peer_set::set: no ready services, sending demand signal
Dec 06 18:42:32.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: tower::buffer::worker: service.ready=false delay
Dec 06 18:42:52.968 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::policies: retrying req=BlocksByHash({block::Hash("0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a")}) e=Elapsed(()) remaining_tries=1
Dec 06 18:42:52.968 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a}: zebra_network::policies: retrying req=BlocksByHash({block::Hash("0000000000176a3344d93598c018fddef80f2edd59b194a7c0a0b7a59aa52c4a")}) e=Elapsed(()) remaining_tries=2
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: worker polling for next message
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: dropping cancelled buffered request
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: dropping cancelled request
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::service: sending request to buffer worker
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: worker polling for next message
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: processing new request
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: resumed=false worker received request; waiting for service readiness
Dec 06 18:43:47.966 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: tower::hedge::rotating_histogram: Time since last rotation is 189.693851337s.  clearing!
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: ready_peers=0 unready_peers=0
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: tower::buffer::service: sending request to buffer worker
Dec 06 18:43:47.967  WARN {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 06 18:43:47.967 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: zebrad::components::sync::downloads: requested block
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: preselected service was not ready, reselecting
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: no ready services, sending demand signal
Dec 06 18:43:47.967 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: service.ready=false delay
Dec 06 18:43:47.967 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000254e1d8cf90681e84f877d56fb4f7d4965b3eb09c175ca56820b635}: zebrad::components::sync::downloads: waiting to request block
Dec 06 18:44:07.968 DEBUG {zebrad="5b687f87"}:sync:extend_tips:download_and_verify{hash=000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821}: zebra_network::policies: retrying req=BlocksByHash({block::Hash("000000000100c07024462b6221c73daca37afc18047cc0a5fcf2e4520573c821")}) e=Elapsed(()) remaining_tries=2
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: worker polling for next message
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/.cargo/git/checkouts/tower-b098c32cf5a1bcca/d4d1c67/tower/src/buffer/service.rs:67:9}: tower::buffer::worker: resuming buffered request
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: resumed=true worker received request; waiting for service readiness
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: ready_peers=0 unready_peers=0
Dec 06 18:45:03.017  WARN {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: preselected service was not ready, reselecting
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: zebra_network::peer_set::set: no ready services, sending demand signal
Dec 06 18:45:03.017 TRACE {zebrad="5b687f87"}:task{kind=task spawn.location=/home/oxarbitrage/zebra/zebra_sync/zebra/zebra-network/src/peer_set/initialize.rs:162:23}:crawl_and_dial: tower::buffer::worker: service.ready=false delay
...

At around every 1 minute(connected or not connected) we get this one: https://github.com/ZcashFoundation/zebra/blob/main/zebra-network/src/peer_set/set.rs#L452 so we have a working loop but it seems that the demand signal at the following line is not working.

@teor2345
Copy link
Contributor

We just merged #1468, which should provide better diagnostics for this issue, and panics if our assumptions about the peer state machine don't hold.

@teor2345 teor2345 changed the title Peer connections fail permanently after network interruptions Peer connections are slow or fail permanently Dec 14, 2020
@teor2345
Copy link
Contributor

teor2345 commented Jan 5, 2021

We think #1531 will fix most of these issues - once it's merged we'll close this ticket.

Any new hangs should have separate tickets - ideally based on the part of Zebra that's hanging.

@teor2345
Copy link
Contributor

teor2345 commented Jan 6, 2021

Testing shows that #1531 doesn't fix this issue

@teor2345
Copy link
Contributor

teor2345 commented Jan 16, 2021

This issue persists after #1586, but it seems less frequent, and has a lower impact (until there are no peers at all, which still results in permanent failure.)

@teor2345
Copy link
Contributor

teor2345 commented Jan 25, 2021

When zebrad is stuck in a "network request with no peer connections" loop, starting a peer that connects to zebrad's listener port gets the syncer running again.

So there appear to be two issues here:

@teor2345
Copy link
Contributor

@mpguerra #1620 only fixes one cause of this issue, there could me more instances of #1593, and other causes of peer hangs, like #1633.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug I-hang A Zebra component stops responding to requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants