Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions installinator/src/write.rs
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,10 @@ impl<'a> ArtifactWriter<'a> {

// How many drives did we finish writing during the previous iteration?
let mut success_prev_iter = 0;
// Check if we've had the same number of successes as the previous
// iteraiton which can be a sign that something in the write path
// has a permanent error
let mut same_successes = false;

loop {
// How many drives did we finish writing during this iteration?
Expand Down Expand Up @@ -350,7 +354,13 @@ impl<'a> ArtifactWriter<'a> {
// 2. At least one drive was successfully written on a previous
// iteration, which implies all other drives got to retry during
// this iteration.
if success_this_iter == self.drives.len() || success_prev_iter > 0 {
// 3. We had the same number of successes as the previous iteration,
// which implies that we seem to be permanetly stuck and unlikely
// to succeed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a pretty significant behavior change to the real installinator, right? I think the intent was to loop forever if we're not having success, because there's no way to restart this on failure other than aborting the entire mupdate and starting over.

That's pretty terrible for tests, though. I wonder if we should have a cap on the number of attempts (which this code implicitly has, I think, with the cap set to "2"), and allow prod to pass a cap of "loop forever" while tests can pass something much smaller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we're getting into "timeouts timeouts always wrong" territory here. I don't actually mind looping forever for tests because that is a sign I need to fix something because tests should finish. Looping forever in production actually seems worse though if there's some kind of permanent error we'll just be stuck which seems bad. Maybe wicket will give more information than tests though? If we're actually okay with the current behavior I can also just close this.

if success_this_iter == self.drives.len()
|| success_prev_iter > 0
|| same_successes
{
break;
}

Expand All @@ -364,6 +374,10 @@ impl<'a> ArtifactWriter<'a> {
// Give it a short break, then keep trying.
tokio::time::sleep(Duration::from_secs(5)).await;

if success_this_iter == success_prev_iter {
same_successes = true;
}

success_prev_iter = success_this_iter;
}

Expand Down Expand Up @@ -1157,17 +1171,17 @@ mod tests {
// image, we return two concatenated lists of "fails then one success".
let success_strategy_host = prop::collection::vec(
partial_op_strategy(interrupted_would_block_strategy(), 1024),
0..16,
0..1,
);
let success_strategy_control_plane = prop::collection::vec(
partial_op_strategy(interrupted_would_block_strategy(), 1024),
0..16,
0..1,
);

(
0..16usize,
0..1usize,
success_strategy_host,
0..16usize,
0..1usize,
success_strategy_control_plane,
)
.prop_map(
Expand Down
Loading