-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defines Retrier polling waiting time as a constant and fixes tests #184
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.
Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).
watchtower-plugin/src/retrier.rs
Outdated
@@ -16,6 +16,8 @@ use crate::net::http::{self, AddAppointmentError}; | |||
use crate::wt_client::{RevocationData, WTClient}; | |||
use crate::{MisbehaviorProof, TowerStatus}; | |||
|
|||
const POOLING_TIME: u64 = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: I guess you meant POLLING
here.
Oh yeah, I corrected that on the other PR and forgot about it here, let me see if I can patch it without breaking it 😅 |
More on why instances with API_TIME never failed in the CI: // Init the manager
let task = tokio::spawn(async move {
RetryManager::new(
wt_client_clone,
rx,
MAX_ELAPSED_TIME,
LONG_AUTO_RETRY_DELAY,
MAX_INTERVAL_TIME,
)
.manage_retry()
.await
});
// Send the data
tx.send((tower_id, RevocationData::Fresh(appointment.locator)))
.unwrap();
// Wait for the elapsed time and check how the tower status changed
tokio::time::sleep(Duration::from_secs((API_DELAY / 2.0) as u64)).await;
// Check
assert!(wt_client
.lock()
.unwrap()
.get_retrier_status(&tower_id)
.unwrap()
.is_running()); It looks like that To make the test fail you can try adding a little sleep before We can reduce the number of cases where we sleep |
Yes! That's it! Looks like I can finally make the timings make proper sense. Awesome catch 🎉 🎉 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, LGTM otherwise :)
watchtower-plugin/src/retrier.rs
Outdated
// Wait for the remaining time and re-check (giving this is a failure we need to check for the next polling cycle) | ||
tokio::time::sleep(Duration::from_secs_f64( | ||
HALF_API_DELAY + MAX_RUN_TIME + POLLING_TIME as f64, | ||
)) | ||
.await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to wait POLLING_TIME
in failure cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okaay,
So assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id));
will fail without it since we process retriers every POLLING_TIME
.
nit: We can split this sleep into two for clarity. One here with HALF_API_DELAY + MAX_RUN_TIME
and another one just above assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id));
with only POLLING_TIME
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think that may be less confusing for newcomers
watchtower-plugin/src/retrier.rs
Outdated
@@ -1212,7 +1230,7 @@ mod tests { | |||
.unwrap(); | |||
|
|||
{ | |||
tokio::time::sleep(Duration::from_secs(2)).await; | |||
tokio::time::sleep(Duration::from_secs(POLLING_TIME)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be 2 * POLLING_TIME to ensure correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could actually be brought down to POLLING_TIME + MAX_RUN_TIME:
- POLLING_TIME: accounts for the worst case of
tx.send
being sent exactly at the start of the sleep in the retrier manager. - MAX_RUN_TIME: gives the chance to the retrier to try and send out the fresh appointment and change the tower status. The retrier won't actually do this since it's idle, but we should give it that time anyway to assert that it really doesn't do this.
// Wait for one retry round and check to tower status | ||
tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Wait for one retry round and check to tower status
754|tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await;
assert!(temp_unreachable);
assert!(running);
// Wait until the task gives up and check again (this gives up due to accumulation of transient errors,
// so the retiers will be idle).
770|tokio::time::sleep(Duration::from_secs(MAX_ELAPSED_TIME as u64)).await;
assert!(unreachable);
assert!(idle);
nit: We can make the second sleep (line 770) be MAX_ELAPSED_TIME + MAX_RUN_TIME
instead of MAX_ELAPSED_TIME
for clarity.
As this accounts for an extra MAX_RUN_TIME
(last self.run
round) after MAX_ELAPSED_TIME
have passed. But it would be correct either way since L754 already waits MAX_RUN_TIME
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Wait for one retry round and check to tower status 754|tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await; assert!(temp_unreachable); assert!(running); // Wait until the task gives up and check again (this gives up due to accumulation of transient errors, // so the retiers will be idle). 770|tokio::time::sleep(Duration::from_secs(MAX_ELAPSED_TIME as u64)).await; assert!(unreachable); assert!(idle);nit: We can make the second sleep (line 770) be
MAX_ELAPSED_TIME + MAX_RUN_TIME
instead ofMAX_ELAPSED_TIME
for clarity. As this accounts for an extraMAX_RUN_TIME
(lastself.run
round) afterMAX_ELAPSED_TIME
have passed. But it would be correct either way since L754 already waitsMAX_RUN_TIME
.
I think I'll add a comment there specifying that we are not adding it due to it had been accounted for right before, so we don't have to add an unnecessary wait (not that they make sense haha)
The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that. Also, `retrier::tests::test_manage_retry_while_idle` was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.
Addressed the comments and slightly reduced Tested it in a loop for both OSX and low-resource Ubuntu without issues. |
I tried to repro the failed test in the CI and wasn't able to (in a loop for about 12h). They only possible logical explanation i can think of is that CI boxes spend so much time in I'm not totally sure whether the 2 explanations are actually the reasons for this failure (or even make much sense 😕), but i am sure that tests is logically correct at this point and through experimentation it also says the same. |
#[tokio::main]
async fn main() {
let start = tokio::time::Instant::now();
let res = reqwest::Client::new()
.get(format!("{}", "http://unreachable.tower"))
.send()
.await;
println!("Response: {:?}", res);
println!("Took: {:?}", tokio::time::Instant::now() - start);
} Looks like it's possible for a request to take that much time before getting a DNS error (at least on my slow network). So network issues might be one valid reason.
|
The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that.
Also,
retrier::tests::test_manage_retry_while_idle
was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.Turns out this is super brittle, so I'm going to modify the timers as little as possible for now.Many timers have been modified to make actual sense.
Superseeds #182 (Close #182)
Edit:
@mariocynicys found why this was brittle in #184 (comment), looks like it isn't anymore 😄