Defines Retrier polling waiting time as a constant and fixes tests #184

sr-gi · 2023-01-26T18:03:21Z

The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that.

Also, retrier::tests::test_manage_retry_while_idle was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.

~~Turns out this is super brittle, so I'm going to modify the timers as little as possible for now.~~
Many timers have been modified to make actual sense.

Superseeds #182 (Close #182)

Edit:

@mariocynicys found why this was brittle in #184 (comment), looks like it isn't anymore 😄

mariocynicys

I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.

Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).

mariocynicys · 2023-01-26T19:21:53Z

watchtower-plugin/src/retrier.rs

@@ -16,6 +16,8 @@ use crate::net::http::{self, AddAppointmentError};
 use crate::wt_client::{RevocationData, WTClient};
 use crate::{MisbehaviorProof, TowerStatus};

+const POOLING_TIME: u64 = 1;


typo: I guess you meant POLLING here.

watchtower-plugin/src/retrier.rs

sr-gi · 2023-01-26T20:34:06Z

I saw some instances in the tests that wait on API_DELAY instead of POLLING_TIME.

Funny tho, the instance linked above actually waits zero seconds but yet the test succeeds the test every time (without ever waiting for polling time duration).

Oh yeah, I corrected that on the other PR and forgot about it here, let me see if I can patch it without breaking it 😅

mariocynicys · 2023-01-26T20:52:53Z

More on why instances with API_TIME never failed in the CI:

// Init the manager
let task = tokio::spawn(async move {
    RetryManager::new(
        wt_client_clone,
        rx,
        MAX_ELAPSED_TIME,
        LONG_AUTO_RETRY_DELAY,
        MAX_INTERVAL_TIME,
    )
    .manage_retry()
    .await
});
// Send the data
tx.send((tower_id, RevocationData::Fresh(appointment.locator)))
    .unwrap();

// Wait for the elapsed time and check how the tower status changed
tokio::time::sleep(Duration::from_secs((API_DELAY / 2.0) as u64)).await;
// Check
assert!(wt_client
    .lock()
    .unwrap()
    .get_retrier_status(&tower_id)
    .unwrap()
    .is_running());

It looks like that tokio::spawn always starts the async future late enough after tx.send is performed, thus the retry manager always hits the Ok(...) branch when try_recv()ing from the channel.

To make the test fail you can try adding a little sleep before tx.send (tokio::time::sleep(Duration::from_secs_f64(0.01);), this will delay tx.send enough that the async future is started first and will fail try_recv()ing and go to the Err(...) branch and wait POLLING_TIME, then we wait (API_DELAY / 2.0) as u64) seconds which is actually zero and the assertion will fail since POLLING_TIME wasn't over yet.

We can reduce the number of cases where we sleep POLLING_TIME if we always do tx.send before tokio::spawning the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.

sr-gi · 2023-01-26T21:46:12Z

We can reduce the number of cases where we sleep POLLING_TIME if we always do tx.send before tokio::spawning the retry manager task. If we send before spawning we will be sure that the spawned task won't wait POLLING_TIME before processing our sent data. Not really important for correctness, but will only lower the testing time by a few seconds or so.

Yes! That's it!

Looks like I can finally make the timings make proper sense.

Awesome catch 🎉 🎉 🎉

mariocynicys

Some nits, LGTM otherwise :)

mariocynicys · 2023-01-26T23:06:25Z

watchtower-plugin/src/retrier.rs

+        // Wait for the remaining time and re-check (giving this is a failure we need to check for the next polling cycle)
+        tokio::time::sleep(Duration::from_secs_f64(
+            HALF_API_DELAY + MAX_RUN_TIME + POLLING_TIME as f64,
+        ))
+        .await;


Why do we need to wait POLLING_TIME in failure cases?

Okaay,
So assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id)); will fail without it since we process retriers every POLLING_TIME.

nit: We can split this sleep into two for clarity. One here with HALF_API_DELAY + MAX_RUN_TIME and another one just above assert!(!wt_client.lock().unwrap().retriers.contains_key(&tower_id)); with only POLLING_TIME.

Agreed, I think that may be less confusing for newcomers

mariocynicys · 2023-01-26T23:25:03Z

watchtower-plugin/src/retrier.rs

@@ -1212,7 +1230,7 @@ mod tests {
            .unwrap();

        {
-            tokio::time::sleep(Duration::from_secs(2)).await;
+            tokio::time::sleep(Duration::from_secs(POLLING_TIME)).await;


I think this should be 2 * POLLING_TIME to ensure correctness.

or could actually be brought down to POLLING_TIME + MAX_RUN_TIME:

POLLING_TIME: accounts for the worst case of tx.send being sent exactly at the start of the sleep in the retrier manager.

MAX_RUN_TIME: gives the chance to the retrier to try and send out the fresh appointment and change the tower status. The retrier won't actually do this since it's idle, but we should give it that time anyway to assert that it really doesn't do this.

mariocynicys · 2023-01-27T05:47:27Z

watchtower-plugin/src/retrier.rs

+        // Wait for one retry round and check to tower status
+        tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await;


// Wait for one retry round and check to tower status 754|tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await; assert!(temp_unreachable); assert!(running); // Wait until the task gives up and check again (this gives up due to accumulation of transient errors, // so the retiers will be idle). 770|tokio::time::sleep(Duration::from_secs(MAX_ELAPSED_TIME as u64)).await; assert!(unreachable); assert!(idle);

nit: We can make the second sleep (line 770) be MAX_ELAPSED_TIME + MAX_RUN_TIME instead of MAX_ELAPSED_TIME for clarity.
As this accounts for an extra MAX_RUN_TIME (last self.run round) after MAX_ELAPSED_TIME have passed. But it would be correct either way since L754 already waits MAX_RUN_TIME.

// Wait for one retry round and check to tower status 754|tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await; assert!(temp_unreachable); assert!(running); // Wait until the task gives up and check again (this gives up due to accumulation of transient errors, // so the retiers will be idle). 770|tokio::time::sleep(Duration::from_secs(MAX_ELAPSED_TIME as u64)).await; assert!(unreachable); assert!(idle);

nit: We can make the second sleep (line 770) be MAX_ELAPSED_TIME + MAX_RUN_TIME instead of MAX_ELAPSED_TIME for clarity. As this accounts for an extra MAX_RUN_TIME (last self.run round) after MAX_ELAPSED_TIME have passed. But it would be correct either way since L754 already waits MAX_RUN_TIME.

I think I'll add a comment there specifying that we are not adding it due to it had been accounted for right before, so we don't have to add an unnecessary wait (not that they make sense haha)

The pooling time for the Retrier was hardcoded to 1, let's at least use a constant for that. Also, `retrier::tests::test_manage_retry_while_idle` was randomly failing (for Ubuntu) when checking whether the Retrier was idle after giving up on a retry. This is due to the time of running a round not being taken into account.

sr-gi · 2023-01-27T16:13:14Z

Addressed the comments and slightly reduced MAX_RUN_TIME (from 0.5 to 0.2).

Tested it in a loop for both OSX and low-resource Ubuntu without issues.

mariocynicys · 2023-01-28T14:08:30Z

I tried to repro the failed test in the CI and wasn't able to (in a loop for about 12h).
But i was able to repro it when i lowered MAX_RUN_TIME to 0.01 in just a couple of runs. So it should happen a lot when MAX_RUN_TIME is decreased.

They only possible logical explanation i can think of is that CI boxes spend so much time in self.run, more than 0.2s for some reason, and I think the reason is that so many threads are run at once (each test function thread + threads spawned from inside the test functions/code) that makes the self.run thread starve sometimes and not finish quickly. Might also be network issues but not sure if that can actually take that much time without timing out.

I'm not totally sure whether the 2 explanations are actually the reasons for this failure (or even make much sense 😕), but i am sure that tests is logically correct at this point and through experimentation it also says the same.

mariocynicys · 2023-01-28T14:23:16Z

Might also be network issues but not sure if that can actually take that much time without timing out.

#[tokio::main]
async fn main() {
    let start = tokio::time::Instant::now();
    let res = reqwest::Client::new()
    .get(format!("{}", "http://unreachable.tower"))
    .send()
    .await;
    println!("Response: {:?}", res);
    println!("Took: {:?}", tokio::time::Instant::now() - start);
}

Looks like it's possible for a request to take that much time before getting a DNS error (at least on my slow network). So network issues might be one valid reason.

Note that it takes that much time 0.34s only in the first run, subsequent runs finish in less than 50ms. Probably the OS caches DNS results or something.

sr-gi mentioned this pull request Jan 26, 2023

Redefine waiting times in Retrier #182

Closed

mariocynicys reviewed Jan 26, 2023

View reviewed changes

sr-gi force-pushed the retrier-waits branch from 7d2c80e to b63a987 Compare January 26, 2023 21:48

mariocynicys approved these changes Jan 27, 2023

View reviewed changes

sr-gi force-pushed the retrier-waits branch from b63a987 to 73a9bff Compare January 27, 2023 16:11

sr-gi merged commit 0cf3d97 into talaia-labs:master Feb 6, 2023

sr-gi mentioned this pull request Feb 20, 2023

Some of the Retrier tests are still randomly failing on ubuntu-latest #193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defines Retrier polling waiting time as a constant and fixes tests #184

Defines Retrier polling waiting time as a constant and fixes tests #184

sr-gi commented Jan 26, 2023 •

edited

Loading

mariocynicys left a comment

mariocynicys Jan 26, 2023

sr-gi commented Jan 26, 2023

mariocynicys commented Jan 26, 2023

sr-gi commented Jan 26, 2023

mariocynicys left a comment

mariocynicys Jan 26, 2023

mariocynicys Jan 26, 2023

sr-gi Jan 27, 2023

mariocynicys Jan 26, 2023

mariocynicys Jan 26, 2023

mariocynicys Jan 27, 2023

sr-gi Jan 27, 2023

sr-gi commented Jan 27, 2023

mariocynicys commented Jan 28, 2023

mariocynicys commented Jan 28, 2023

		// Wait for one retry round and check to tower status
		tokio::time::sleep(Duration::from_secs_f64(MAX_RUN_TIME)).await;

Defines Retrier polling waiting time as a constant and fixes tests #184

Defines Retrier polling waiting time as a constant and fixes tests #184

Conversation

sr-gi commented Jan 26, 2023 • edited Loading

mariocynicys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi commented Jan 26, 2023

mariocynicys commented Jan 26, 2023

sr-gi commented Jan 26, 2023

mariocynicys left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sr-gi commented Jan 27, 2023

mariocynicys commented Jan 28, 2023

mariocynicys commented Jan 28, 2023

sr-gi commented Jan 26, 2023 •

edited

Loading