Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security: Spawn a separate task for each initial handshake, Credit: Equilibrium #2181

Closed
wants to merge 7 commits into from

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented May 21, 2021

Motivation

This fix prevents hangs and deadlocks during initialization, particularly when there are a small number of valid peers in the initial peer config (or from the DNS seeders).

It also cleans up some variable names and error handling, so it's easier to understand what the code is doing.

Reported by Equilibrium.

Solution

  • Spawn a separate task for each initial handshake, using the existing dial function
    • Sleep for MIN_PEER_CONNECTION_INTERVAL between each handshake
  • Correctly use MIN_PEER_CONNECTION_INTERVAL in CandidateSet
    • Previously, we would allow a lot of connections if we hadn't connected for a while

Also:

  • rename some variables for clarity
  • provide errors as part of the HandshakeFailed crawler action
  • add extra progress logging to help diagnose hangs

The code in this pull request has:

  • Documentation Comments
  • Integration Tests

Review

I'd like @dconnolly to review this security fix. Or @jvff might like to take a look.

It's blocking the rest of #2160, so it would be great to have it reviewed before my Monday.

Related Issues

Part of #2163

Blocks #2160

Follow Up Work

Rest of #2160

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code P-Medium C-security Category: Security issues I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use labels May 21, 2021
@teor2345 teor2345 requested review from dconnolly and jvff May 21, 2021 03:38
@teor2345 teor2345 self-assigned this May 21, 2021
@teor2345
Copy link
Contributor Author

@dconnolly / @jvff these changes could be causing the macOS failures I'm seeing on #2160.

So I've marked macOS stable as a required CI check. We can go back to just using Linux once #2160 is all split up and merged.

@teor2345
Copy link
Contributor Author

So I've marked macOS stable as a required CI check. We can go back to just using Linux once #2160 is all split up and merged.

Thinking about this some more, we might want to leave macOS enabled.

Linux and Windows have the larger network tests turned off, because they were unreliable in CI. (After NU5, we should check if they're still unreliable - maybe all our fixes have made them better.)

teor2345 added 4 commits May 21, 2021 17:10
This fix prevents hangs and deadlocks during initialization, particularly
when there are a small number of valid peers in the initial peer config
(or from the DNS seeders).
Previously, if we hadn't had a connection for a while, we'd allow a lot
of connections all at once, until we'd caught up.
This prevents denial of service if the local network is constrained, and
the seeders return a large number of peers.
@teor2345 teor2345 force-pushed the initial-handshake-task branch from 52850e1 to 9328a88 Compare May 21, 2021 07:19
// next time off our sleep time. Otherwise, use the current time.
self.next_peer_sleep_until = max(self.next_peer_sleep_until, Instant::now());
self.next_peer_sleep_until += constants::MIN_PEER_CONNECTION_INTERVAL;
sleep_until(current_deadline).await;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Correctly use MIN_PEER_CONNECTION_INTERVAL in CandidateSet

  • Make sure the next deadline is at least MIN_PEER_CONNECTION_INTERVAL in the future
  • Replace the Sleep future with an Instant

Comment on lines +240 to +309
// Each `FuturesUnordered` can hold one `Buffer` reservation for an
// indefinite period. We can use `FuturesUnordered` without filling the
// underlying `Handshake` buffer, because we immediately drive this single
// `FuturesUnordered` to completion, and handshakes have a short timeout.
let mut handshakes = FuturesUnordered::new();
// TODO: replace with *success_count_tx.borrow() in Tokio 1.6
let mut success_count = 0;
for candidate in initial_meta_addr {
// 1. Spawn a task to handshake with a new peer
let next_peer_sleep_until = Instant::now() + constants::MIN_PEER_CONNECTION_INTERVAL;
let hs_join = tokio::spawn(dial(candidate, outbound_connector.clone()))
.map(move |res| match res {
Ok(crawler_action) => crawler_action,
Err(e) => {
panic!(
"panic during initial handshake with {:?}: {:?} ",
candidate, e
);
}
})
.instrument(Span::current());
handshakes.push(Box::pin(hs_join));

// 2. Check if any peers have finished their handshakes
// If we've had enough successes, wait for all the handshakes to finish
while let Some(handshake_action) = if success_count < peerset_initial_target_size {
handshakes.next().now_or_never().flatten()
} else {
handshakes.next().await
} {
use CrawlerAction::*;
match handshake_action {
HandshakeConnected { peer_set_change } => {
success_count += 1;
if let Change::Insert(ref addr, _) = peer_set_change {
debug!(
?addr,
?success_count,
?peerset_initial_target_size,
"successfully dialed initial peer"
);
} else {
unreachable!("unexpected handshake result: all changes should be Insert");
}
// the peer set is handled by an independent task, so this send
// shouldn't hang
peerset_tx.send(Ok(peer_set_change)).await?;
}
HandshakeFailed { failed_addr, error } => {
if success_count <= constants::GET_ADDR_FANOUT {
// this creates verbose logs, but it's better than just hanging on
// startup with no output
info!(addr = ?failed_addr.addr,
?error,
?success_count,
?peerset_initial_target_size,
"an initial peer connection failed");
} else {
// switch to debug when we have enough peers
debug!(addr = ?failed_addr.addr,
?error,
?success_count,
?peerset_initial_target_size,
"an initial peer connection failed");
}
}
DemandCrawl | DemandDrop | DemandHandshake { .. } | TimerCrawl { .. } => {
unreachable!("unexpected CrawlerAction: should be handshake result")
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Spawn a separate task for each initial handshake

Comment on lines +312 to +315
// 3. If we haven't had enough successes, sleep for a while
if success_count < peerset_initial_target_size {
tokio::time::sleep_until(next_peer_sleep_until).await;
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Limit the initial peer connection rate, part 2

let mut success_count = 0;
for candidate in initial_meta_addr {
// 1. Spawn a task to handshake with a new peer
let next_peer_sleep_until = Instant::now() + constants::MIN_PEER_CONNECTION_INTERVAL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: Limit the initial peer connection rate, part 1

Comment on lines +265 to +269
while let Some(handshake_action) = if success_count < peerset_initial_target_size {
handshakes.next().now_or_never().flatten()
} else {
handshakes.next().await
} {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code isn't pretty, but it's hard to refactor, because it has a lot of futures and channels.

zebra-network/src/constants.rs Show resolved Hide resolved
zebra-network/src/constants.rs Show resolved Hide resolved
zebra-network/src/meta_addr.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jvff jvff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this approved! 🎉

(I didn't mark it as approved to avoid auto-merging)

zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
zebra-network/src/peer_set/initialize.rs Show resolved Hide resolved
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
@teor2345 teor2345 marked this pull request as draft May 22, 2021 09:35
@teor2345
Copy link
Contributor Author

#2160 removes initial handshakes entirely, and just sends all the seed addresses to the candidate set.

Then there is a bit of special handling for initial crawls in the crawler.

So this PR isn't needed any more.

@teor2345 teor2345 closed this May 25, 2021
@teor2345 teor2345 deleted the initial-handshake-task branch March 21, 2022 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes A-rust Area: Updates to Rust code C-bug Category: This is a bug C-security Category: Security issues I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-usability Zebra is hard to understand or use
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants