Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zebra gradually loses peer connections, until it has none left #1905

Closed
teor2345 opened this issue Mar 15, 2021 · 6 comments · Fixed by #1950
Closed

Zebra gradually loses peer connections, until it has none left #1905

teor2345 opened this issue Mar 15, 2021 · 6 comments · Fixed by #1950
Labels
A-infrastructure Area: Infrastructure changes A-rust Area: Updates to Rust code C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-integration-fail Continuous integration fails, including build and test failures I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use S-needs-investigation Status: Needs further investigation
Milestone

Comments

@teor2345
Copy link
Contributor

Version

zebrad 1.0.0-alpha.3 commit 6473ed99

Platform

Linux ... 5.4.96 #1-NixOS SMP Sun Feb 7 14:35:50 UTC 2021 x86_64 GNU/Linux

Description

Sometimes, on testnet, Zebra gradually loses peers, until it has none left.

Sometimes the peers are lost after Zebra disconnects them due to block download or verify errors. Other times, the peers just seem to disconnect themselves.

Zebra should try to reconnect to these peers, but it doesn't seem to get any peers back.

Maybe Zebra isn't trying to connect. Or maybe the connections keep failing. Zebra might be triggering zcashd's peer blocklist, due to bugs like #1848. Or there could be a peer address or connection state handling issue in Zebra's network stack.

This issue doesn't seem to happen that often (I've only seen it once in weeks of local testing). Our current reliability standard of 2 weeks continuous runtime: Zebra easily achieves that standard on mainnet, and mostly achieves that standard on testnet.

Therefore, this issue is a low priority, until it affects CI or other node deployments.

Related Tickets

Fixing #1904 will get Zebra better peers from the DNS seeders on testnet
Fixing #1848 and other network security issues will make Zebra's network stack more reliable

This ticket might block #1791, if it happens often enough in our CI

Commands

zebrad start, configured to use testnet with the DNS seeders and a few local peers.

Logs

The relevant logs are:

Mar 12 12:20:52.596  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: connecting to initial peer set initial_peers={120.77.82.190:18233, 91.121.88.52:18233, 54.209.49.207:18233, 37.59.57.96:18233, 157.245.113.22:18233, 209.141.47.197:18233, 127.0.0.1:38233, 192.168.215.106:58233, 127.0.0.1:58233, 192.168.215.108:38233, 51.15.166.148:18233, 47.98.224.31:18233, 13.115.17.0:18233, 192.168.215.106:38233, 95.216.242.46:18233, 138.197.216.253:18233, 3.112.190.48:18233, 192.168.215.108:58233, 101.37.77.121:18233, 217.160.142.49:18233, 198.199.112.230:18233, 138.68.230.245:18233, 88.99.162.254:18233, 13.228.164.120:18233}
Mar 12 12:20:52.596  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ConnectionClosed
Mar 12 12:20:52.957  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=Os { code: 111, kind: ConnectionRefused, message: "Connection refused" }
Mar 12 12:20:53.340  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ObsoleteVersion(Version(170008))
Mar 12 12:20:53.612  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ObsoleteVersion(Version(170007))
Mar 12 12:20:53.645  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=ObsoleteVersion(Version(170009))
Mar 12 12:20:55.568  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=Os { code: 113, kind: Other, message: "No route to host" }
Mar 12 12:20:56.597  INFO {zebrad="6473ed99" net="Test"}:add_initial_peers: zebra_network::peer_set::initialize: an initial peer connection failed e=Elapsed(())
...
Mar 12 12:21:53.815  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 4 peers to answer requests
Mar 12 12:22:53.815  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 3 peers to answer requests
Mar 12 12:29:38.849  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 2 peers to answer requests
Mar 12 12:39:48.600  WARN {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: error downloading and verifying block e=Elapsed(())
Mar 12 12:40:49.606  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 1 peers to answer requests
Mar 12 12:40:49.703  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=372 lookahead_limit=2000
Mar 12 12:40:49.703  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebrad::components::sync: trying to extend chain tips tips=1
Mar 12 12:40:49.791  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=743 lookahead_limit=2000
...
Mar 12 12:40:49.962  INFO {zebrad="6473ed99" net="Test"}:sync:extend_tips: zebrad::components::sync: trying to extend chain tips tips=1
Mar 12 12:40:50.060  WARN {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: error downloading and verifying block e=Block(Block { source: AlreadyInChain(block::Hash("00015eebadb01f74e09c0f7cbd03673c61fa5689d9b3fef3a98930c623a26ca0"), 308) })
Mar 12 12:40:50.060  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: waiting to restart sync timeout=61s
Mar 12 12:41:25.515  INFO {zebrad="6473ed99" net="Test"}:peer{addr=127.0.0.1:38233}:msg_as_req{msg=getblocks}:state: zebra_state::service: responding to peer GetBlocks or GetHeaders final_height=Height(499) response_len=500 chain_tip_height=Height(282491) stop_height=None intersection_height=None
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync: zebrad::components::sync: starting sync, obtaining new tips
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(282491) min_locator_height=282392 locators=[Height(282491), Height(282490), Height(282489), Height(282487), Height(282483), Height(282475), Height(282459), Height(282427), Height(282392)]
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebrad::components::sync: trying to obtain new chain tips tip=block::Hash("0013da35d7196632f70cfa9fcc6e22906db20d21430ec08cfd2ba231bfd594b4")
Mar 12 12:41:51.063  INFO {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no ready peers: finding more peers, waiting for 1 peers to answer requests
Mar 12 12:43:22.597  WARN {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
Mar 12 12:44:37.597  WARN {zebrad="6473ed99" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection
@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code A-infrastructure Area: Infrastructure changes S-needs-triage Status: A bug report needs triage S-needs-investigation Status: Needs further investigation P-Low I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-integration-fail Continuous integration fails, including build and test failures I-usability Zebra is hard to understand or use labels Mar 15, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Mar 23, 2021

It looks like Zebra gets into a state where it stops crawling and dialing, even though there are many never_attempted peers, many attempt_pending peers, and some recently_stopped_responding (disconnected) peers.

This might be happening because attempt_pending peers never timeout, and the queue is full. It might also be happening because the crawler, candidate set, or some other queue is never read.

That makes this ticket a potential blocker for a bunch of network security fixes - because the fixes could make this problem much worse.

Mar 23 16:05:29.462 WARN {zebrad="837132a3" net="Test"}:sync:extend_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection address_metrics=AddressMetrics { responded: 3, never_attempted: 16, failed: 0, attempt_pending: 35, recently_live: 0, recently_stopped_responding: 3 }

@teor2345
Copy link
Contributor Author

This might be happening because attempt_pending peers never timeout, and the queue is full. It might also be happening because the crawler, candidate set, or some other queue is never read.

It looks like the attempt_pending peers never stop being pending. This could be a bug in the handshaker timeout, the error handling for the handshake timeout, or a queue that processes these requests or responses.

Mar 23 16:11:44.462 WARN {zebrad="837132a3" net="Test"}:crawl_and_dial: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection address_metrics=AddressMetrics { responded: 0, never_attempted: 5, failed: 0, attempt_pending: 49, recently_live: 0, recently_stopped_responding: 0 }

@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Mar 23, 2021
@teor2345
Copy link
Contributor Author

From another instance - it's definitely attempt_pending that's broken somehow.

And this issue might be more frequent than we thought - I've had it happen to 2 testnet instances in 24 hours.

Mar 24 06:41:59.496 WARN {zebrad="837132a3" net="Test"}:sync:obtain_tips: zebra_network::peer_set::set: network request with no peer connections. Hint: check your network connection address_metrics=AddressMetrics { responded: 0, never_attempted: 2, failed: 0, attempt_pending: 49, recently_live: 0, recently_stopped_responding: 0 }

@teor2345 teor2345 added P-Medium and removed P-Low labels Mar 23, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Mar 25, 2021

Here is a metrics dump from a peer in this state.

I waited 5 minutes, and dumped the metrics again. There weren't any changes in any metric.

# metrics snapshot (ts=1616635686) (prometheus exposition format)
# TYPE sync_verified_block_count counter
sync_verified_block_count 282001

# TYPE peer_canceled counter
peer_canceled 831

# TYPE state_finalized_committed_block_count counter
state_finalized_committed_block_count 282001

# TYPE state_finalized_cumulative_transparent_prevouts counter
state_finalized_cumulative_transparent_prevouts 229206

# TYPE checkpoint_waiting_count counter
checkpoint_waiting_count 282368

# TYPE state_finalized_cumulative_sprout_nullifiers counter
state_finalized_cumulative_sprout_nullifiers 75200

# TYPE checkpoint_verified_block_count counter
checkpoint_verified_block_count 282001

# TYPE state_requests counter
state_requests{type="depth"} 507
state_requests{type="block"} 27
state_requests{type="commit_finalized_block"} 282001
state_requests{type="block_locator"} 1
state_requests{type="tip"} 1

# TYPE zcash_net_out_bytes_total counter
zcash_net_out_bytes_total{addr="217.160.142.49"} 163
zcash_net_out_bytes_total{addr="51.15.166.148"} 163
zcash_net_out_bytes_total{addr="3.112.190.48"} 235449
zcash_net_out_bytes_total{addr="127.0.0.1"} 18162107
zcash_net_out_bytes_total{addr="54.209.49.207"} 527767
zcash_net_out_bytes_total{addr="88.99.162.254"} 163
zcash_net_out_bytes_total{addr="138.68.230.245"} 163
zcash_net_out_bytes_total{addr="95.216.242.46"} 163

# TYPE sync_downloaded_block_count counter
sync_downloaded_block_count 282368

# TYPE zcash_net_in_bytes_total counter
zcash_net_in_bytes_total{addr="51.15.166.148"} 155
zcash_net_in_bytes_total{addr="138.68.230.245"} 153
zcash_net_in_bytes_total{addr="54.209.49.207"} 10862442
zcash_net_in_bytes_total{addr="3.112.190.48"} 19295
zcash_net_in_bytes_total{addr="88.99.162.254"} 153
zcash_net_in_bytes_total{addr="95.216.242.46"} 153
zcash_net_in_bytes_total{addr="127.0.0.1"} 655461517
zcash_net_in_bytes_total{addr="217.160.142.49"} 153

# TYPE zcash_net_in_messages counter
zcash_net_in_messages{command="getheaders",addr="54.209.49.207:18233"} 1
zcash_net_in_messages{command="getdata",addr="127.0.0.1:38233"} 27
zcash_net_in_messages{command="pong",addr="54.209.49.207:18233"} 10
zcash_net_in_messages{command="addr",addr="127.0.0.1:38233"} 27322
zcash_net_in_messages{command="getheaders",addr="3.112.190.48:18233"} 1
zcash_net_in_messages{command="ping",addr="127.0.0.1:38233"} 10
zcash_net_in_messages{command="ping",addr="3.112.190.48:18233"} 1
zcash_net_in_messages{command="block",addr="127.0.0.1:38233"} 282169
zcash_net_in_messages{command="pong",addr="127.0.0.1:38233"} 10
zcash_net_in_messages{command="getblocks",addr="127.0.0.1:38233"} 5
zcash_net_in_messages{command="ping",addr="54.209.49.207:18233"} 5
zcash_net_in_messages{command="addr",addr="54.209.49.207:18233"} 1
zcash_net_in_messages{command="getaddr",addr="127.0.0.1:38233"} 4
zcash_net_in_messages{command="inv",addr="54.209.49.207:18233"} 131
zcash_net_in_messages{command="block",addr="54.209.49.207:18233"} 4520
zcash_net_in_messages{command="inv",addr="3.112.190.48:18233"} 1
zcash_net_in_messages{command="inv",addr="127.0.0.1:38233"} 2109

# TYPE zcash_net_out_messages counter
zcash_net_out_messages{command="pong",addr="127.0.0.1:38233"} 10
zcash_net_out_messages{command="inv",addr="127.0.0.1:38233"} 5
zcash_net_out_messages{command="getaddr",addr="54.209.49.207:18233"} 1
zcash_net_out_messages{command="getdata",addr="127.0.0.1:38233"} 282172
zcash_net_out_messages{command="getblocks",addr="127.0.0.1:38233"} 2110
zcash_net_out_messages{command="pong",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="ping",addr="127.0.0.1:38233"} 10
zcash_net_out_messages{command="getblocks",addr="54.209.49.207:18233"} 159
zcash_net_out_messages{command="addr",addr="127.0.0.1:38233"} 4
zcash_net_out_messages{command="getdata",addr="54.209.49.207:18233"} 4544
zcash_net_out_messages{command="ping",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="headers",addr="54.209.49.207:18233"} 1
zcash_net_out_messages{command="pong",addr="54.209.49.207:18233"} 5
zcash_net_out_messages{command="getblocks",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="ping",addr="54.209.49.207:18233"} 10
zcash_net_out_messages{command="headers",addr="3.112.190.48:18233"} 1
zcash_net_out_messages{command="getaddr",addr="127.0.0.1:38233"} 27322

# TYPE zebrad_build_info counter
zebrad_build_info{version="1.0.0-alpha.4"} 1

# TYPE state_finalized_cumulative_transactions counter
state_finalized_cumulative_transactions 378773

# TYPE state_finalized_cumulative_transparent_newouts counter
state_finalized_cumulative_transparent_newouts 776477

# TYPE state_finalized_cumulative_sapling_nullifiers counter
state_finalized_cumulative_sapling_nullifiers 1

# TYPE pool_num_ready gauge
pool_num_ready 0

# TYPE state_finalized_queued_block_count gauge
state_finalized_queued_block_count 0

# TYPE candidate_set_pending gauge
candidate_set_pending 49

# TYPE checkpoint_verified_height gauge
checkpoint_verified_height 282000

# TYPE state_finalized_queued_max_height gauge
state_finalized_queued_max_height NaN

# TYPE candidate_set_failed gauge
candidate_set_failed 0

# TYPE gossip_queued_block_count gauge
gossip_queued_block_count 1

# TYPE zcash_net_peers gauge
zcash_net_peers 0

# TYPE candidate_set_responded gauge
candidate_set_responded 0

# TYPE crawler_in_flight_handshakes gauge
crawler_in_flight_handshakes 50

# TYPE candidate_set_disconnected gauge
candidate_set_disconnected 0

# TYPE sync_prospective_tips_len gauge
sync_prospective_tips_len 1

# TYPE candidate_set_recently_live gauge
candidate_set_recently_live 0

# TYPE checkpoint_processing_next_height gauge
checkpoint_processing_next_height 282000

# TYPE candidate_set_gossiped gauge
candidate_set_gossiped 5

# TYPE checkpoint_queued_continuous_height gauge
checkpoint_queued_continuous_height 282367

# TYPE pool_num_unready gauge
pool_num_unready 0

# TYPE state_finalized_committed_block_height gauge
state_finalized_committed_block_height 281952

# TYPE checkpoint_queued_slots gauge
checkpoint_queued_slots 367

# TYPE sync_downloads_in_flight gauge
sync_downloads_in_flight 2000

# TYPE checkpoint_queued_max_height gauge
checkpoint_queued_max_height 282367

@teor2345
Copy link
Contributor Author

Things we could try to debug this issue:

  • log metrics for client request states whenever the PeerSet or CandidateSet hangs
  • add gauges for those metrics
  • log metrics for address timestamps after Refactor MetaAddr fields to enable security fixes #1849
  • wrap each task to gather metrics for polling times, so we know if a task is hung or always live
  • wrap each buffer to gather metrics for input, output, length, and capacity

Here's the substrate task polling monitor code:
https://github.com/paritytech/substrate/blob/b0667821e61f4790da84930b7cdb80fb20b48596/client/service/src/task_manager/prometheus_future.rs

@teor2345
Copy link
Contributor Author

We should also look for places where select! drops futures (or where select returns a future that we want to keep, but we discard that future as part of the match).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-infrastructure Area: Infrastructure changes A-rust Area: Updates to Rust code C-bug Category: This is a bug I-hang A Zebra component stops responding to requests I-integration-fail Continuous integration fails, including build and test failures I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use S-needs-investigation Status: Needs further investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants