Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defines Retrier polling waiting time as a constant and fixes tests #184

Merged
merged 1 commit into from
Feb 6, 2023

Conversation

sr-gi
Copy link
Member

@sr-gi sr-gi commented Jan 26, 2023

The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that.

Also, retrier::tests::test_manage_retry_while_idle was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.

Turns out this is super brittle, so I'm going to modify the timers as little as possible for now.
Many timers have been modified to make actual sense.

Superseeds #182 (Close #182)

Edit:

@mariocynicys found why this was brittle in #184 (comment), looks like it isn't anymore 😄

Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.

Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).

@@ -16,6 +16,8 @@ use crate::net::http::{self, AddAppointmentError};
use crate::wt_client::{RevocationData, WTClient};
use crate::{MisbehaviorProof, TowerStatus};

const POOLING_TIME: u64 = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: I guess you meant POLLING here.

watchtower-plugin/src/retrier.rs Show resolved Hide resolved
@sr-gi
Copy link
Member Author

sr-gi commented Jan 26, 2023

I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.

Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).

Oh yeah, I corrected that on the other PR and forgot about it here, let me see if I can patch it without breaking it 😅

@mariocynicys
Copy link
Collaborator

More on why instances with API_TIME never failed in the CI:

// Init the manager
let task = tokio::spawn(async move {
    RetryManager::new(
        wt_client_clone,
        rx,
        MAX_ELAPSED_TIME,
        LONG_AUTO_RETRY_DELAY,
        MAX_INTERVAL_TIME,
    )
    .manage_retry()
    .await
});
// Send the data
tx.send((tower_id, RevocationData::Fresh(appointment.locator)))
    .unwrap();

// Wait for the elapsed time and check how the tower status changed
tokio::time::sleep(Duration::from_secs((API_DELAY / 2.0) as u64)).await;
// Check
assert!(wt_client
    .lock()
    .unwrap()
    .get_retrier_status(&tower_id)
    .unwrap()
    .is_running());

It looks like that tokio::spawn always starts the async future late enough after tx.send is performed, thus the retry manager always hits the Ok(...) branch when try_recv()ing from the channel.

To make the test fail you can try adding a little sleep before tx.send (tokio::time::sleep(Duration::from_secs_f64(0.01);), this will delay tx.send enough that the async future is started first and will fail try_recv()ing and go to the Err(...) branch and wait POLLING_TIME, then we wait (API_DELAY / 2.0) as u64) seconds which is actually zero and the assertion will fail since POLLING_TIME wasn't over yet.

We can reduce the number of cases where we sleep POLLING_TIME if we always do tx.send before tokio::spawning the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.

@sr-gi
Copy link
Member Author

sr-gi commented Jan 26, 2023

We can reduce the number of cases where we sleep POLLING_TIME if we always do tx.send before tokio::spawning the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.

Yes! That's it!

Looks like I can finally make the timings make proper sense.

Awesome catch 🎉 🎉 🎉

Copy link
Collaborator

@mariocynicys mariocynicys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, LGTM otherwise :)

Comment on lines 1003 to 1007
// Wait for the remaining time and re-check (giving this is a failure we need to check for the next polling cycle)
tokio::time::sleep(Duration::from_secs_f64(
HALF_API_DELAY + MAX_RUN_TIME + POLLING_TIME as f64,
))
.await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wait POLLING_TIME in failure cases?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okaay,
So assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id)); will fail without it since we process retriers every POLLING_TIME.

nit: We can split this sleep into two for clarity. One here with HALF_API_DELAY + MAX_RUN_TIME and another one just above assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id)); with only POLLING_TIME.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think that may be less confusing for newcomers

@@ -1212,7 +1230,7 @@ mod tests {
.unwrap();

{
tokio::time::sleep(Duration::from_secs(2)).await;
tokio::time::sleep(Duration::from_secs(POLLING_TIME)).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be 2 * POLLING_TIME to ensure correctness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could actually be brought down to POLLING_TIME + MAX_RUN_TIME:

  • POLLING_TIME: accounts for the worst case of tx.send being sent exactly at the start of the sleep in the retrier manager.
  • MAX_RUN_TIME: gives the chance to the retrier to try and send out the fresh appointment and change the tower status. The retrier won't actually do this since it's idle, but we should give it that time anyway to assert that it really doesn't do this.

Comment on lines +753 to +754
// Wait for one retry round and check to tower status
tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Wait for one retry round and check to tower status
754|tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await;
assert!(temp_unreachable);
assert!(running);
// Wait until the task gives up and check again (this gives up due to accumulation of transient errors,
// so the retiers will be idle).
770|tokio::time::sleep(Duration::from_secs(MAX_ELAPSED_TIME as u64)).await;
assert!(unreachable);
assert!(idle);

nit: We can make the second sleep (line 770) be MAX_ELAPSED_TIME + MAX_RUN_TIME instead of MAX_ELAPSED_TIME for clarity.
As this accounts for an extra MAX_RUN_TIME (last self.run round) after MAX_ELAPSED_TIME have passed. But it would be correct either way since L754 already waits MAX_RUN_TIME.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Wait for one retry round and check to tower status
754|tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await;
assert!(temp_unreachable);
assert!(running);
// Wait until the task gives up and check again (this gives up due to accumulation of transient errors,
// so the retiers will be idle).
770|tokio::time::sleep(Duration::from_secs(MAX_ELAPSED_TIME as u64)).await;
assert!(unreachable);
assert!(idle);

nit: We can make the second sleep (line 770) be MAX_ELAPSED_TIME + MAX_RUN_TIME instead of MAX_ELAPSED_TIME for clarity. As this accounts for an extra MAX_RUN_TIME (last self.run round) after MAX_ELAPSED_TIME have passed. But it would be correct either way since L754 already waits MAX_RUN_TIME.

I think I'll add a comment there specifying that we are not adding it due to it had been accounted for right before, so we don't have to add an unnecessary wait (not that they make sense haha)

The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that.

Also, `retrier::tests::test_manage_retry_while_idle` was randomly failing (for Ubuntu) when checking whether the Retrier was
idle after giving up on a retry. This is due to the time of running a round not being taken into account.
@sr-gi
Copy link
Member Author

sr-gi commented Jan 27, 2023

Addressed the comments and slightly reduced MAX_RUN_TIME (from 0.5 to 0.2).

Tested it in a loop for both OSX and low-resource Ubuntu without issues.

@mariocynicys
Copy link
Collaborator

I tried to repro the failed test in the CI and wasn't able to (in a loop for about 12h).
But i was able to repro it when i lowered MAX_RUN_TIME to 0.01 in just a couple of runs. So it should happen a lot when MAX_RUN_TIME is decreased.

They only possible logical explanation i can think of is that CI boxes spend so much time in self.run, more than 0.2s for some reason, and I think the reason is that so many threads are run at once (each test function thread + threads spawned from inside the test functions/code) that makes the self.run thread starve sometimes and not finish quickly. Might also be network issues but not sure if that can actually take that much time without timing out.

I'm not totally sure whether the 2 explanations are actually the reasons for this failure (or even make much sense 😕), but i am sure that tests is logically correct at this point and through experimentation it also says the same.

@mariocynicys
Copy link
Collaborator

Might also be network issues but not sure if that can actually take that much time without timing out.

#[tokio::main]
async fn main() {
    let start = tokio::time::Instant::now();
    let res = reqwest::Client::new()
    .get(format!("{}", "http://unreachable.tower"))
    .send()
    .await;
    println!("Response: {:?}", res);
    println!("Took: {:?}", tokio::time::Instant::now() - start);
}

image

Looks like it's possible for a request to take that much time before getting a DNS error (at least on my slow network). So network issues might be one valid reason.

Note that it takes that much time 0.34s only in the first run, subsequent runs finish in less than 50ms. Probably the OS caches DNS results or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants