-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Security: only apply the outbound connection rate-limit to actual connections #2278
Conversation
Previously, we were advancing the timer even when we returned `None`. This created large wait times when there were no eligible peers.
e8e2f23
to
05a3673
Compare
This comment has been minimized.
This comment has been minimized.
Is that proptest a blocker to this merging? |
Also refactor peer numbers into constants.
Also cleanup the test constants.
This is my third attempt at implementing/reviewing this rate-limit, so I need better tests for it. I have an idea for one more proptest, but it needs to done after PR #2275 merges to avoid conflicts. The existing proptests put pretty strict bounds on the wait time. So I don't think we need extra tests any time soon. |
macOS VMs seem to need this extra time to pass their tests.
This change avoids test failures due to cumulative errors. Also use a single call to `Instant::now` for each test round. And print the times when the tests fail.
The candidate set proptests will fail if enough generated peers are invalid for outbound connections.
cc e27ea62986a1ace4e16dde4b99d8ae8557a6b1e92e3f9a7c6aba0d7b179eb817 # shrinks to peers = [MetaAddr { addr: 0.0.0.0:0, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:0, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:1151, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 129.65.108.101:9019, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 95.68.199.21:46887, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 167.206.253.131:43927, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 160.138.155.239:63430, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: [6a4e:a543:8c93:e074:8a2a:5cc4:96f8:ce95]:51071, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623572152, calendar: 2021-06-13T08:15:52Z }, last_connection_state: NeverAttemptedAlternate }], initial_candidates = 0, extra_candidates = 2 | ||
cc 788fe2c47a4af559e9a921764d1ce8fcfd0743361c19e6e7a06adc093bc19078 # shrinks to peers = [MetaAddr { addr: [::e9:8050:dbf7:5665]:27188, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 106.27.232.177:51742, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 127.0.0.1:30827, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:21948, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:38186, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: [4db2:96e6:38c5:fd62:e964:9338:4ce8:bfa0]:32686, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 0.0.0.0:22952, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }, MetaAddr { addr: 114.164.73.233:10203, services: NODE_NETWORK, last_seen: DateTime32 { timestamp: 1623573029, calendar: 2021-06-13T08:30:29Z }, last_connection_state: NeverAttemptedAlternate }], initial_candidates = 3, extra_candidates = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addresses containing either:
- IPv4:
0.0.0.0
- IPv6:
::
- Port:
0
Are invalid for outbound connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did about 5000 runs of this test overnight, seems like the generated test cases are (almost) all valid now.
Coverage failed with:
This seems to be happening a bit lately. It would be good to wait between retries, but there are no retry or timeout options on the action: So this likely needs a |
pub fn alternate_strategy() -> BoxedStrategy<Self> { | ||
(canonical_socket_addr(), any::<PeerServices>()) | ||
.prop_map(|(socket_addr, services)| MetaAddr::new_alternate(&socket_addr, &services)) | ||
.boxed() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the alternate strategy so it returns the full range of possible service bits. (Node and client.)
// | ||
// TODO: create a "Zebra supported services" constant | ||
let meta_addr = MetaAddr::new_alternate(&address, &PeerServices::NODE_NETWORK); | ||
if meta_addr.is_valid_for_outbound() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a specific strategy for valid and ready outbound peers.
This was the missing check in the previous code:
if meta_addr.is_valid_for_outbound() {
// SECURITY: rate-limit new outbound peer connections | ||
(&mut self.wait_next_handshake).await; | ||
let mut sleep = sleep(constants::MIN_PEER_CONNECTION_INTERVAL); | ||
mem::swap(&mut self.wait_next_handshake, &mut sleep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two fixes:
- only wait if we're returning
Some
- schedule the next sleep time after the previous sleep returns, avoiding issues with overlapping times
// Allow enough time for the maximum number of candidates, | ||
// plus some extra time for test machines with high CPU load. | ||
// (The max delay asserts usually fail before hitting this timeout.) | ||
let max_rate_limit_sleep = 3 * MAX_TEST_CANDIDATES * MIN_PEER_CONNECTION_INTERVAL; | ||
let max_extra_delay = (2 * MAX_TEST_CANDIDATES + 1) * MAX_SLEEP_EXTRA_DELAY; | ||
assert!(runtime.block_on(timeout(max_rate_limit_sleep + max_extra_delay, checks)).is_ok()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increase the test timeout to match any increases in the candidates or times.
(Currently it's 10.2 seconds rather than 10 seconds!)
Also use named constants rather than integer literals.
let mut minimum_reconnect_instant = Instant::now(); | ||
let mut now = Instant::now(); | ||
let mut minimum_reconnect_instant = now; | ||
// Allow extra time for test machines with high CPU load | ||
let mut maximum_reconnect_instant = now + MAX_SLEEP_EXTRA_DELAY; | ||
|
||
for _ in 0..candidates { | ||
assert!(candidate_set.next().await.is_some()); | ||
assert!(Instant::now() >= minimum_reconnect_instant); | ||
|
||
minimum_reconnect_instant += MIN_PEER_CONNECTION_INTERVAL; | ||
assert!( | ||
candidate_set.next().await.is_some(), | ||
"there are enough available candidates" | ||
); | ||
|
||
now = Instant::now(); | ||
assert!( | ||
now >= minimum_reconnect_instant, | ||
"all candidates should obey the minimum rate-limit: now: {:?} min: {:?}", | ||
now, | ||
minimum_reconnect_instant, | ||
); | ||
assert!( | ||
now <= maximum_reconnect_instant, | ||
"rate-limited candidates should not be delayed too long: now: {:?} max: {:?}. Hint: is the test machine overloaded?", | ||
now, | ||
maximum_reconnect_instant, | ||
); | ||
|
||
minimum_reconnect_instant = now + MIN_PEER_CONNECTION_INTERVAL; | ||
maximum_reconnect_instant = now + MIN_PEER_CONNECTION_INTERVAL + MAX_SLEEP_EXTRA_DELAY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two test changes:
- add a maximum time that's a second after the minimum time (unfortunately macOS tends to fail on shorter times)
- get the value of
now
once per loop iteration
/// TODO: after PR #2275 merges, add a similar proptest, | ||
/// using a "not ready for attempt" peer generation strategy | ||
#[test] | ||
fn skipping_outbound_peer_connection_skips_rate_limit(next_peer_attempts in 0..TEST_ADDRESSES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skipping_outbound_peer_connection_skips_rate_limit
test checks the bug fix in the code.
(The other test is useful for making sure we're not delaying way too long when we actually have peers, but it can't be too strict due to OS scheduling delays.)
…nections (#2278) * Only advance the outbound connection timer when it returns an address Previously, we were advancing the timer even when we returned `None`. This created large wait times when there were no eligible peers. * Refactor to avoid overlapping sleep timers * Add a maximum next peer delay test Also refactor peer numbers into constants. * Make the number of proptests overridable by the standard env var Also cleanup the test constants. * Test that skipping peer connections also skips their rate limits * Allow an extra second after each sleep on loaded machines macOS VMs seem to need this extra time to pass their tests. * Restart test time bounds from the current time This change avoids test failures due to cumulative errors. Also use a single call to `Instant::now` for each test round. And print the times when the tests fail. * Stop generating invalid outbound peers in proptests The candidate set proptests will fail if enough generated peers are invalid for outbound connections.
…nections (#2278) * Only advance the outbound connection timer when it returns an address Previously, we were advancing the timer even when we returned `None`. This created large wait times when there were no eligible peers. * Refactor to avoid overlapping sleep timers * Add a maximum next peer delay test Also refactor peer numbers into constants. * Make the number of proptests overridable by the standard env var Also cleanup the test constants. * Test that skipping peer connections also skips their rate limits * Allow an extra second after each sleep on loaded machines macOS VMs seem to need this extra time to pass their tests. * Restart test time bounds from the current time This change avoids test failures due to cumulative errors. Also use a single call to `Instant::now` for each test round. And print the times when the tests fail. * Stop generating invalid outbound peers in proptests The candidate set proptests will fail if enough generated peers are invalid for outbound connections.
Motivation
Zebra advances the next peer timer, even when there are no eligible peers. (And the method is about to return
None
.)This creates large peer wait times when there are no eligible peers. This can be a significant issue on testnet.
Solution
Review
@jvff can review this PR.
Reviewer Checklist