-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry Environment Recovery #421
Conversation
9845459
to
a6d8e77
Compare
Codecov Report
@@ Coverage Diff @@
## main #421 +/- ##
==========================================
+ Coverage 75.69% 75.94% +0.24%
==========================================
Files 37 37
Lines 3506 3509 +3
==========================================
+ Hits 2654 2665 +11
+ Misses 686 679 -7
+ Partials 166 165 -1
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you! I successfully tested your changes locally (without starting Nomad in the first phase and then adding it later.
While testing, I noticed two smaller things:
- We not only log an error about recovering the environment, but also about "Stopped updating the runners". Is this actually true or should this process be blocked as well?
- Due to the exponential retry mechanism, it might take some time to reconnect (depending on the fact when Poseidon was started). This is nothing to worry (or change) right now, but we should keep it in mind (in case Poseidon has a too long delay to reconnect).
The error The error
Yes, let's keep that in mind. Anyway, I assume it to be very unlikely. Triggering this condition was unlikely back when we restarted all hosts at the same time. But now, Poseidon and two of the three Nomad server all have to reboot in a time span of 30 seconds (while the restart for each is randomize over 30 minutes) for a chance that this error might occur.. Even then, the delay should not exceed a few seconds (until the Nomad election is done and the next retry follows) |
Oh, I haven't checked the test execution (neither with the unit tests not the end-to-end-tests). Rather, I was describing my observations I had while checking out the code on my machine in a regular "run" setting: I just simulated the error condition on my development machine (by starting Poseidon without starting Nomad) and there I observed these issues. Many of them, such as
That's true :) |
I see, thanks for clarifying. Currently, we track this issue with |
Nice, thanks for clarifying the message and pointing to the Sentry issue. This makes sense now and resolves all open questions I had. |
as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts. Additionally, we now stop retrying to create a runner when the environment got deleted.
a8f745d
to
c94858f
Compare
Closes #408
The retry is blocking as (1) we require a proper Nomad connection for Poseidon to work and (2) we would risk doubling the number of idle runner when an external request creates an environment before we have recovered it.