Retry Environment Recovery #421

mpass99 · 2023-08-16T10:54:05Z

Closes #408

The retry is blocking as (1) we require a proper Nomad connection for Poseidon to work and (2) we would risk doubling the number of idle runner when an external request creates an environment before we have recovered it.

codecov · 2023-08-16T16:43:27Z

Codecov Report

Merging #421 (c94858f) into main (89c18ad) will increase coverage by 0.24%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
+ Coverage   75.69%   75.94%   +0.24%     
==========================================
  Files          37       37              
  Lines        3506     3509       +3     
==========================================
+ Hits         2654     2665      +11     
+ Misses        686      679       -7     
+ Partials      166      165       -1

Files Changed	Coverage Δ
pkg/util/util.go	`77.77% <70.00%> (+23.93%)`	⬆️
cmd/poseidon/main.go	`62.50% <100.00%> (ø)`
internal/environment/nomad_environment.go	`78.88% <100.00%> (-0.16%)`	⬇️
internal/environment/nomad_manager.go	`77.16% <100.00%> (+2.36%)`	⬆️
internal/runner/nomad_manager.go	`83.33% <100.00%> (ø)`
internal/runner/nomad_runner.go	`82.13% <100.00%> (ø)`

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

MrSerth

Nice, thank you! I successfully tested your changes locally (without starting Nomad in the first phase and then adding it later.

While testing, I noticed two smaller things:

We not only log an error about recovering the environment, but also about "Stopped updating the runners". Is this actually true or should this process be blocked as well?
Due to the exponential retry mechanism, it might take some time to reconnect (depending on the fact when Poseidon was started). This is nothing to worry (or change) right now, but we should keep it in mind (in case Poseidon has a too long delay to reconnect).

mpass99 · 2023-08-16T22:01:55Z

We not only log an error about recovering the environment, but also about "Stopped updating the runners". Is this actually true or should this process be blocked as well?

The error Error recovering the execution environments appears in all tests that create a Nomad Environment Manager. The error is expected as we are not launching a Nomad cluster and are not mocking the connection attempt (instead we just block/stop the connection attempts).

The error Stopped updating the runners appears in tests that create a Nomad Runner Manager, do not directly block/stop the Runner synchronization, and stop the Nomad Event Stream. For me (and the latest CI run) this is only the case for the test TestUpdateRunnersAddsIdleRunner. It does not harm the test, but we can also avoid this error if you like?!

Due to the exponential retry mechanism, it might take some time to reconnect (depending on the fact when Poseidon was started). This is nothing to worry (or change) right now, but we should keep it in mind (in case Poseidon has a too long delay to reconnect).

Yes, let's keep that in mind. Anyway, I assume it to be very unlikely. Triggering this condition was unlikely back when we restarted all hosts at the same time. But now, Poseidon and two of the three Nomad server all have to reboot in a time span of 30 seconds (while the restart for each is randomize over 30 minutes) for a chance that this error might occur.. Even then, the delay should not exceed a few seconds (until the Nomad election is done and the next retry follows)

MrSerth · 2023-08-16T22:38:58Z

The error [...] appears in all tests

Oh, I haven't checked the test execution (neither with the unit tests not the end-to-end-tests). Rather, I was describing my observations I had while checking out the code on my machine in a regular "run" setting: I just simulated the error condition on my development machine (by starting Poseidon without starting Nomad) and there I observed these issues. Many of them, such as Error recovering the execution environments were expected in this scenario (since Nomad was not running...), but the Stopped updating the runners was not. As Poseidon never had access to any Nomad server in my local scenario, I wondered why this error appeared (which runner was updated there?).

Yes, let's keep that in mind.

That's true :)

mpass99 · 2023-08-17T09:00:41Z

I see, thanks for clarifying.
The error Stopped updating the runners is thrown when the NomadRunnerManager cannot establish the Event Stream to Nomad (or the connection stops). The NomadRunnerManager is started directly when Poseidon starts. Therefore, this issue is expected when Poseidon does not have a connection to Nomad.

Currently, we track this issue with POSEIDON-3G. Anyway, I agree that the message is misleading. In the recent commit, I tried to improve the message.

MrSerth · 2023-08-17T09:50:26Z

Anyway, I agree that the message is misleading. In the recent commit, I tried to improve the message.

Nice, thanks for clarifying the message and pointing to the Sentry issue. This makes sense now and resolves all open questions I had.

as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts. Additionally, we now stop retrying to create a runner when the environment got deleted.

mpass99 requested a review from MrSerth August 16, 2023 10:54

mpass99 self-assigned this Aug 16, 2023

mpass99 force-pushed the fix/#408-reboot-race branch from 9845459 to a6d8e77 Compare August 16, 2023 16:35

MrSerth reviewed Aug 16, 2023

View reviewed changes

MrSerth approved these changes Aug 17, 2023

View reviewed changes

mpass99 added 3 commits August 17, 2023 11:50

Retry Environment Recovery

c4fd3bf

Refactor Nomad Event Stream log message.

c94858f

MrSerth force-pushed the fix/#408-reboot-race branch from a8f745d to c94858f Compare August 17, 2023 09:50

MrSerth merged commit 13cd19e into main Aug 18, 2023

MrSerth deleted the fix/#408-reboot-race branch August 18, 2023 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry Environment Recovery #421

Retry Environment Recovery #421

mpass99 commented Aug 16, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading

MrSerth left a comment

mpass99 commented Aug 16, 2023

MrSerth commented Aug 16, 2023

mpass99 commented Aug 17, 2023

MrSerth commented Aug 17, 2023

Retry Environment Recovery #421

Retry Environment Recovery #421

Conversation

mpass99 commented Aug 16, 2023

codecov bot commented Aug 16, 2023 • edited Loading

Codecov Report

MrSerth left a comment

Choose a reason for hiding this comment

mpass99 commented Aug 16, 2023

MrSerth commented Aug 16, 2023

mpass99 commented Aug 17, 2023

MrSerth commented Aug 17, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading