-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (Restart flag did not clear after restart
) in ClusterConfigTest.test_restart
#8328
Comments
@jcsp there are a couple ways i can think of to make this more robust:
wdyt? |
I think option 2 makes sense: it should only be a few lines of code to pass in a ref to group_manager, register for leadership notifications & then replace the sleep_abortable with a condtiion variable wait with a timeout. |
@dotnwat why do you think it's |
@rystsov this is not a bug in redpanda nor a problem with the test, it was bad luck on timing by a few milliseconds because of a leadership transfer. the fix could be considered more of an optimization. |
updated BK link to most recent failure but original one was so old to have been deleted in BK so PT was failing. |
Not seen in at least two months, closing |
https://buildkite.com/redpanda/redpanda/builds/46890#_
Notice above that the last config status query before timing out occurred at
02:20:04,625
. Below, we can see that the restart flag was eventually flipped and the change was applied at02:20:04,792
right after the last check.the test will timeout after 10 seconds. so what took so long? here is where the polling starts in the ducktape test:
4 seconds later, sending the node status fails with not_leader error. in this case the config manager will sleep for 5 seconds before trying again. so the initial delay plus the backoff has eaten up 9 out 10 seconds.
Finally sending the update to the controller succeeds 5 seconds later, but this is only 25 milliseconds before the final timeout in the ducktape test, which does not leave much room for error.
And finally, why did the send that failed above not actually occur until 4 seconds into the start of the polling? Mostly just bad luck, I think. At start up there was no leader, which will lead to a 5 second sleep in the reconcile status loop:
which puts us 4 seconds into the start of the ducktape polling, and unfortunately, at that point it looks as if a leadership transfer happened or something resulting in the unfortunate failure to update the controller.
JIRA Link: CORE-1146
The text was updated successfully, but these errors were encountered: