Zebra gradually loses peer connections, until it has none left #1905

teor2345 · 2021-03-15T11:29:05Z

Version

zebrad 1.0.0-alpha.3 commit 6473ed99

Platform

Linux ... 5.4.96 #1-NixOS SMP Sun Feb 7 14:35:50 UTC 2021 x86_64 GNU/Linux

Description

Sometimes, on testnet, Zebra gradually loses peers, until it has none left.

Sometimes the peers are lost after Zebra disconnects them due to block download or verify errors. Other times, the peers just seem to disconnect themselves.

Zebra should try to reconnect to these peers, but it doesn't seem to get any peers back.

Maybe Zebra isn't trying to connect. Or maybe the connections keep failing. Zebra might be triggering zcashd's peer blocklist, due to bugs like #1848. Or there could be a peer address or connection state handling issue in Zebra's network stack.

This issue doesn't seem to happen that often (I've only seen it once in weeks of local testing). Our current reliability standard of 2 weeks continuous runtime: Zebra easily achieves that standard on mainnet, and mostly achieves that standard on testnet.

Therefore, this issue is a low priority, until it affects CI or other node deployments.

Related Tickets

Fixing #1904 will get Zebra better peers from the DNS seeders on testnet
Fixing #1848 and other network security issues will make Zebra's network stack more reliable

This ticket might block #1791, if it happens often enough in our CI

Commands

zebrad start, configured to use testnet with the DNS seeders and a few local peers.

Logs

The relevant logs are:

Mar 12 12:20:52.596  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: connecting to initial peer set initial_peers={120.77.82.190:18233, 91.121.88.52:18233, 54.209.49.207:18233, 37.59.57.96:18233, 157.245.113.22:18233, 209.141.47.197:18233, 127.0.0.1:38233, 192.168.215.106:58233, 127.0.0.1:58233, 192.168.215.108:38233, 51.15.166.148:18233, 47.98.224.31:18233, 13.115.17.0:18233, 192.168.215.106:38233, 95.216.242.46:18233, 138.197.216.253:18233, 3.112.190.48:18233, 192.168.215.108:58233, 101.37.77.121:18233, 217.160.142.49:18233, 198.199.112.230:18233, 138.68.230.245:18233, 88.99.162.254:18233, 13.228.164.120:18233}
Mar 12 12:20:52.596  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ConnectionClosed
Mar 12 12:20:52.957  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }
Mar 12 12:20:53.340  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ObsoleteVersion(Version(170008))
Mar 12 12:20:53.612  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ObsoleteVersion(Version(170007))
Mar 12 12:20:53.645  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ObsoleteVersion(Version(170009))
Mar 12 12:20:55.568  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=Os { code: 113, kind: Other, message: "No route to host" }
Mar 12 12:20:56.597  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=Elapsed(())
...
Mar 12 12:21:53.815  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 4 peers to answer requests
Mar 12 12:22:53.815  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 3 peers to answer requests
Mar 12 12:29:38.849  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 2 peers to answer requests
Mar 12 12:39:48.600  WARN {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: error downloading and verifying block e=Elapsed(())
Mar 12 12:40:49.606  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 1 peers to answer requests
Mar 12 12:40:49.703  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=372 lookahead_limit=2000
Mar 12 12:40:49.703  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebrad::components::sync: trying to extend chain tips tips=1
Mar 12 12:40:49.791  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=743 lookahead_limit=2000
...
Mar 12 12:40:49.962  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebrad::components::sync: trying to extend chain tips tips=1
Mar 12 12:40:50.060  WARN {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: error downloading and verifying block e=Block(Block { source: AlreadyInChain(block::Hash("00015eebadb01f74e09c0f7cbd03673c61fa5689d9b3fef3a98930c623a26ca0"), 308) })
Mar 12 12:40:50.060  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s
Mar 12 12:41:25.515  INFO {zebrad="6473ed99" net="Test"}:peer{addr=127.0.0.1:38233}:msg_as_req{msg=getblocks}:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(499) response_len=500 chain_tip_height=Height(282491) stop_height=None intersection_height=None
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: starting sync, obtaining new tips
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(282491) min_locator_height=282392 locators=[Height(282491), Height(282490), Height(282489), Height(282487), Height(282483), Height(282475), Height(282459), Height(282427), Height(282392)]
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebrad::components::sync: trying to obtain new chain tips tip=block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 1 peers to answer requests
Mar 12 12:43:22.597  WARN {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Mar 12 12:44:37.597  WARN {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection

The text was updated successfully, but these errors were encountered:

teor2345 · 2021-03-23T06:10:01Z

It looks like Zebra gets into a state where it stops crawling and dialing, even though there are many never_attempted peers, many attempt_pending peers, and some recently_stopped_responding (disconnected) peers.

This might be happening because attempt_pending peers never timeout, and the queue is full. It might also be happening because the crawler, candidate set, or some other queue is never read.

That makes this ticket a potential blocker for a bunch of network security fixes - because the fixes could make this problem much worse.

Mar 23 16:05:29.462 WARN {zebrad="837132a3" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection address_metrics=AddressMetrics { responded: 3, never_attempted: 16, failed: 0, attempt_pending: 35, recently_live: 0, recently_stopped_responding: 3 }

teor2345 · 2021-03-23T06:15:56Z

This might be happening because attempt_pending peers never timeout, and the queue is full. It might also be happening because the crawler, candidate set, or some other queue is never read.

It looks like the attempt_pending peers never stop being pending. This could be a bug in the handshaker timeout, the error handling for the handshake timeout, or a queue that processes these requests or responses.

Mar 23 16:11:44.462 WARN {zebrad="837132a3" net="Test"}:crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection address_metrics=AddressMetrics { responded: 0, never_attempted: 5, failed: 0, attempt_pending: 49, recently_live: 0, recently_stopped_responding: 0 }

teor2345 · 2021-03-23T20:44:12Z

From another instance - it's definitely attempt_pending that's broken somehow.

And this issue might be more frequent than we thought - I've had it happen to 2 testnet instances in 24 hours.

Mar 24 06:41:59.496 WARN {zebrad="837132a3" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection address_metrics=AddressMetrics { responded: 0, never_attempted: 2, failed: 0, attempt_pending: 49, recently_live: 0, recently_stopped_responding: 0 }

teor2345 · 2021-03-25T01:29:32Z

Here is a metrics dump from a peer in this state.

I waited 5 minutes, and dumped the metrics again. There weren't any changes in any metric.

# metrics snapshot (ts=1616635686) (prometheus exposition format)
# TYPE sync_verified_block_count counter
sync_verified_block_count 282001

# TYPE peer_canceled counter
peer_canceled 831

# TYPE state_finalized_committed_block_count counter
state_finalized_committed_block_count 282001

# TYPE state_finalized_cumulative_transparent_prevouts counter
state_finalized_cumulative_transparent_prevouts 229206

# TYPE checkpoint_waiting_count counter
checkpoint_waiting_count 282368

# TYPE state_finalized_cumulative_sprout_nullifiers counter
state_finalized_cumulative_sprout_nullifiers 75200

# TYPE checkpoint_verified_block_count counter
checkpoint_verified_block_count 282001

# TYPE state_requests counter
state_requests{type="depth"} 507
state_requests{type="block"} 27
state_requests{type="commit_finalized_block"} 282001
state_requests{type="block_locator"} 1
state_requests{type="tip"} 1

# TYPE zcash_net_out_bytes_total counter
zcash_net_out_bytes_total{addr="217.160.142.49"} 163
zcash_net_out_bytes_total{addr="51.15.166.148"} 163
zcash_net_out_bytes_total{addr="3.112.190.48"} 235449
zcash_net_out_bytes_total{addr="127.0.0.1"} 18162107
zcash_net_out_bytes_total{addr="54.209.49.207"} 527767
zcash_net_out_bytes_total{addr="88.99.162.254"} 163
zcash_net_out_bytes_total{addr="138.68.230.245"} 163
zcash_net_out_bytes_total{addr="95.216.242.46"} 163

# TYPE sync_downloaded_block_count counter
sync_downloaded_block_count 282368

# TYPE zcash_net_in_bytes_total counter
zcash_net_in_bytes_total{addr="51.15.166.148"} 155
zcash_net_in_bytes_total{addr="138.68.230.245"} 153
zcash_net_in_bytes_total{addr="54.209.49.207"} 10862442
zcash_net_in_bytes_total{addr="3.112.190.48"} 19295
zcash_net_in_bytes_total{addr="88.99.162.254"} 153
zcash_net_in_bytes_total{addr="95.216.242.46"} 153
zcash_net_in_bytes_total{addr="127.0.0.1"} 655461517
zcash_net_in_bytes_total{addr="217.160.142.49"} 153

# TYPE zcash_net_in_messages counter
zcash_net_in_messages{command="getheaders",addr="54.209.49.207:18233"} 1
zcash_net_in_messages{command="getdata",addr="127.0.0.1:38233"} 27
zcash_net_in_messages{command="pong",addr="54.209.49.207:18233"} 10
zcash_net_in_messages{command="addr",addr="127.0.0.1:38233"} 27322
zcash_net_in_messages{command="getheaders",addr="3.112.190.48:18233"} 1
zcash_net_in_messages{command="ping",addr="127.0.0.1:38233"} 10
zcash_net_in_messages{command="ping",addr="3.112.190.48:18233"} 1
zcash_net_in_messages{command="block",addr="127.0.0.1:38233"} 282169
zcash_net_in_messages{command="pong",addr="127.0.0.1:38233"} 10
zcash_net_in_messages{command="getblocks",addr="127.0.0.1:38233"} 5
zcash_net_in_messages{command="ping",addr="54.209.49.207:18233"} 5
zcash_net_in_messages{command="addr",addr="54.209.49.207:18233"} 1
zcash_net_in_messages{command="getaddr",addr="127.0.0.1:38233"} 4
zcash_net_in_messages{command="inv",addr="54.209.49.207:18233"} 131
zcash_net_in_messages{command="block",addr="54.209.49.207:18233"} 4520
zcash_net_in_messages{command="inv",addr="3.112.190.48:18233"} 1
zcash_net_in_messages{command="inv",addr="127.0.0.1:38233"} 2109

# TYPE zcash_net_out_messages counter
zcash_net_out_messages{command="pong",addr="127.0.0.1:38233"} 10
zcash_net_out_messages{command="inv",addr="127.0.0.1:38233"} 5
zcash_net_out_messages{command="getaddr",addr="54.209.49.207:18233"} 1
zcash_net_out_messages{command="getdata",addr="127.0.0.1:38233"} 282172
zcash_net_out_messages{command="getblocks",addr="127.0.0.1:38233"} 2110
zcash_net_out_messages{command="pong",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="ping",addr="127.0.0.1:38233"} 10
zcash_net_out_messages{command="getblocks",addr="54.209.49.207:18233"} 159
zcash_net_out_messages{command="addr",addr="127.0.0.1:38233"} 4
zcash_net_out_messages{command="getdata",addr="54.209.49.207:18233"} 4544
zcash_net_out_messages{command="ping",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="headers",addr="54.209.49.207:18233"} 1
zcash_net_out_messages{command="pong",addr="54.209.49.207:18233"} 5
zcash_net_out_messages{command="getblocks",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="ping",addr="54.209.49.207:18233"} 10
zcash_net_out_messages{command="headers",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="getaddr",addr="127.0.0.1:38233"} 27322

# TYPE zebrad_build_info counter
zebrad_build_info{version="1.0.0-alpha.4"} 1

# TYPE state_finalized_cumulative_transactions counter
state_finalized_cumulative_transactions 378773

# TYPE state_finalized_cumulative_transparent_newouts counter
state_finalized_cumulative_transparent_newouts 776477

# TYPE state_finalized_cumulative_sapling_nullifiers counter
state_finalized_cumulative_sapling_nullifiers 1

# TYPE pool_num_ready gauge
pool_num_ready 0

# TYPE state_finalized_queued_block_count gauge
state_finalized_queued_block_count 0

# TYPE candidate_set_pending gauge
candidate_set_pending 49

# TYPE checkpoint_verified_height gauge
checkpoint_verified_height 282000

# TYPE state_finalized_queued_max_height gauge
state_finalized_queued_max_height NaN

# TYPE candidate_set_failed gauge
candidate_set_failed 0

# TYPE gossip_queued_block_count gauge
gossip_queued_block_count 1

# TYPE zcash_net_peers gauge
zcash_net_peers 0

# TYPE candidate_set_responded gauge
candidate_set_responded 0

# TYPE crawler_in_flight_handshakes gauge
crawler_in_flight_handshakes 50

# TYPE candidate_set_disconnected gauge
candidate_set_disconnected 0

# TYPE sync_prospective_tips_len gauge
sync_prospective_tips_len 1

# TYPE candidate_set_recently_live gauge
candidate_set_recently_live 0

# TYPE checkpoint_processing_next_height gauge
checkpoint_processing_next_height 282000

# TYPE candidate_set_gossiped gauge
candidate_set_gossiped 5

# TYPE checkpoint_queued_continuous_height gauge
checkpoint_queued_continuous_height 282367

# TYPE pool_num_unready gauge
pool_num_unready 0

# TYPE state_finalized_committed_block_height gauge
state_finalized_committed_block_height 281952

# TYPE checkpoint_queued_slots gauge
checkpoint_queued_slots 367

# TYPE sync_downloads_in_flight gauge
sync_downloads_in_flight 2000

# TYPE checkpoint_queued_max_height gauge
checkpoint_queued_max_height 282367

teor2345 · 2021-03-25T03:25:11Z

Things we could try to debug this issue:

log metrics for client request states whenever the PeerSet or CandidateSet hangs
add gauges for those metrics
log metrics for address timestamps after Refactor MetaAddr fields to enable security fixes #1849
wrap each task to gather metrics for polling times, so we know if a task is hung or always live
wrap each buffer to gather metrics for input, output, length, and capacity

Here's the substrate task polling monitor code:
https://github.com/paritytech/substrate/blob/b0667821e61f4790da84930b7cdb80fb20b48596/client/service/src/task_manager/prometheus_future.rs

teor2345 · 2021-03-25T03:29:56Z

We should also look for places where select! drops futures (or where select returns a future that we want to keep, but we discard that future as part of the match).

teor2345 mentioned this issue Mar 15, 2021

Log address book metrics when PeerSet or CandidateSet get in a bad state #1906

Merged

2 tasks

teor2345 mentioned this issue Mar 22, 2021

Re-Apply "Refactor connection.rs to make fail_with errors impossible" (double revert) #1817

Closed

3 tasks

mpguerra removed the S-needs-triage Status: A bug report needs triage label Mar 23, 2021

teor2345 added P-Medium and removed P-Low labels Mar 23, 2021

This was referenced Mar 25, 2021

Zebra lags behind the best tip on testnet #1941

Closed

Enable joinsplit and spend auth sighash verification #1940

Merged

Fix a deadlock between the crawler and dialer, and other hangs #1950

Merged

Document correctness of select! and Poll::Pending #1954

Merged

mpguerra added this to the 2021 Sprint 6 milestone Apr 6, 2021

oxarbitrage closed this as completed in #1950 Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zebra gradually loses peer connections, until it has none left #1905

Zebra gradually loses peer connections, until it has none left #1905

teor2345 commented Mar 15, 2021

teor2345 commented Mar 23, 2021 •

edited

Loading

teor2345 commented Mar 23, 2021

teor2345 commented Mar 23, 2021

teor2345 commented Mar 25, 2021 •

edited

Loading

teor2345 commented Mar 25, 2021

teor2345 commented Mar 25, 2021

Zebra gradually loses peer connections, until it has none left #1905

Zebra gradually loses peer connections, until it has none left #1905

Comments

teor2345 commented Mar 15, 2021

teor2345 commented Mar 23, 2021 • edited Loading

teor2345 commented Mar 23, 2021

teor2345 commented Mar 23, 2021

teor2345 commented Mar 25, 2021 • edited Loading

teor2345 commented Mar 25, 2021

teor2345 commented Mar 25, 2021

teor2345 commented Mar 23, 2021 •

edited

Loading

teor2345 commented Mar 25, 2021 •

edited

Loading