Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry Environment Recovery #421

Merged
merged 3 commits into from
Aug 18, 2023
Merged

Retry Environment Recovery #421

merged 3 commits into from
Aug 18, 2023

Conversation

mpass99
Copy link
Contributor

@mpass99 mpass99 commented Aug 16, 2023

Closes #408

The retry is blocking as (1) we require a proper Nomad connection for Poseidon to work and (2) we would risk doubling the number of idle runner when an external request creates an environment before we have recovered it.

@mpass99 mpass99 requested a review from MrSerth August 16, 2023 10:54
@mpass99 mpass99 self-assigned this Aug 16, 2023
@mpass99 mpass99 force-pushed the fix/#408-reboot-race branch from 9845459 to a6d8e77 Compare August 16, 2023 16:35
@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Merging #421 (c94858f) into main (89c18ad) will increase coverage by 0.24%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
+ Coverage   75.69%   75.94%   +0.24%     
==========================================
  Files          37       37              
  Lines        3506     3509       +3     
==========================================
+ Hits         2654     2665      +11     
+ Misses        686      679       -7     
+ Partials      166      165       -1     
Files Changed Coverage Δ
pkg/util/util.go 77.77% <70.00%> (+23.93%) ⬆️
cmd/poseidon/main.go 62.50% <100.00%> (ø)
internal/environment/nomad_environment.go 78.88% <100.00%> (-0.16%) ⬇️
internal/environment/nomad_manager.go 77.16% <100.00%> (+2.36%) ⬆️
internal/runner/nomad_manager.go 83.33% <100.00%> (ø)
internal/runner/nomad_runner.go 82.13% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@MrSerth MrSerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you! I successfully tested your changes locally (without starting Nomad in the first phase and then adding it later.

While testing, I noticed two smaller things:

  1. We not only log an error about recovering the environment, but also about "Stopped updating the runners". Is this actually true or should this process be blocked as well?
  2. Due to the exponential retry mechanism, it might take some time to reconnect (depending on the fact when Poseidon was started). This is nothing to worry (or change) right now, but we should keep it in mind (in case Poseidon has a too long delay to reconnect).

@mpass99
Copy link
Contributor Author

mpass99 commented Aug 16, 2023

We not only log an error about recovering the environment, but also about "Stopped updating the runners". Is this actually true or should this process be blocked as well?

The error Error recovering the execution environments appears in all tests that create a Nomad Environment Manager. The error is expected as we are not launching a Nomad cluster and are not mocking the connection attempt (instead we just block/stop the connection attempts).

The error Stopped updating the runners appears in tests that create a Nomad Runner Manager, do not directly block/stop the Runner synchronization, and stop the Nomad Event Stream. For me (and the latest CI run) this is only the case for the test TestUpdateRunnersAddsIdleRunner. It does not harm the test, but we can also avoid this error if you like?!

Due to the exponential retry mechanism, it might take some time to reconnect (depending on the fact when Poseidon was started). This is nothing to worry (or change) right now, but we should keep it in mind (in case Poseidon has a too long delay to reconnect).

Yes, let's keep that in mind. Anyway, I assume it to be very unlikely. Triggering this condition was unlikely back when we restarted all hosts at the same time. But now, Poseidon and two of the three Nomad server all have to reboot in a time span of 30 seconds (while the restart for each is randomize over 30 minutes) for a chance that this error might occur.. Even then, the delay should not exceed a few seconds (until the Nomad election is done and the next retry follows)

@MrSerth
Copy link
Member

MrSerth commented Aug 16, 2023

The error [...] appears in all tests

Oh, I haven't checked the test execution (neither with the unit tests not the end-to-end-tests). Rather, I was describing my observations I had while checking out the code on my machine in a regular "run" setting: I just simulated the error condition on my development machine (by starting Poseidon without starting Nomad) and there I observed these issues. Many of them, such as Error recovering the execution environments were expected in this scenario (since Nomad was not running...), but the Stopped updating the runners was not. As Poseidon never had access to any Nomad server in my local scenario, I wondered why this error appeared (which runner was updated there?).

Yes, let's keep that in mind.

That's true :)

@mpass99
Copy link
Contributor Author

mpass99 commented Aug 17, 2023

I see, thanks for clarifying.
The error Stopped updating the runners is thrown when the NomadRunnerManager cannot establish the Event Stream to Nomad (or the connection stops). The NomadRunnerManager is started directly when Poseidon starts. Therefore, this issue is expected when Poseidon does not have a connection to Nomad.

Currently, we track this issue with POSEIDON-3G. Anyway, I agree that the message is misleading. In the recent commit, I tried to improve the message.

@MrSerth
Copy link
Member

MrSerth commented Aug 17, 2023

Anyway, I agree that the message is misleading. In the recent commit, I tried to improve the message.

Nice, thanks for clarifying the message and pointing to the Sentry issue. This makes sense now and resolves all open questions I had.

as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts.
Additionally, we now stop retrying to create a runner when the environment got deleted.
@MrSerth MrSerth force-pushed the fix/#408-reboot-race branch from a8f745d to c94858f Compare August 17, 2023 09:50
@MrSerth MrSerth merged commit 13cd19e into main Aug 18, 2023
@MrSerth MrSerth deleted the fix/#408-reboot-race branch August 18, 2023 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poseidon looses an environment after restart
2 participants