-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix HSM tests #37241
Fix HSM tests #37241
Conversation
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
if err != nil { | ||
return trace.Wrap(err, "process unexpectedly exited while waiting for event %s", event) | ||
} | ||
return trace.Errorf("process unexpectedly exited while waiting for event %s", event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't err
nil in this case? Why are we returning an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's nil, but the only thing written to this errorChannel
is the result of service.Run
which should not return until the service is closed. It should not even return when the process is just reloading, only when it actually exits. Even if it somehow returned a nil error, in this function we are waiting for an event, if the service just exited before we got the event then that's an error
t.Cleanup(func() { | ||
require.NoError(t, s.close(), "error while closing %s during test cleanup", name) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why this cleanup is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just because it doesn't really matter, slows down the test, doesn't usually work if the test failed, and lots of people in #14172 were reporting the cleanup errors instead of the actual/initial error
@nklaassen See the table below for backport results.
|
This PR fixes the current flakiness of TestHSMMigrate and TestHSMDualAuthRotation.
I finally managed consistently to reproduce the errors we were seeing in CI by hacking lib/utils.LoadBalancer to introduce intermittent network errors. I found that connection problems to auth at certain times during proxy startup can cause the proxy process to exit without recovering. This is fairly likely to happen when a proxy is connected to two auth servers behind a load balancer, and they all reload ~simultaneously due to a CA rotation. The rough timeline to trigger the error is:
The tests were relying on the proxy to always reload successfully during CA rotations. Whenever the tests flaked, the proxy was not reaching a ready state, and the logs match those I was able to reproduce.
This got markedly worse after #36549, now that an auth no longer needs to be removed from the load balancer during a migration and the test was updated to reflect that.
We could arguably call this a product bug because the proxy is usually tolerant of temporarily losing a connection to Auth, but it seems non-trivial to recover from errors during a proxy reload. At that point there is already an old and new proxy running, the new one has failed to start and the old one hasn't completely shut down, it's not obvious what should be done. Currently the proxy process will exit and the systemd service should reload the process at that point.
For now, I have modified the HSM integration tests to avoid relying on any non-auth Teleport service reloads in any tests that use 2 auths behind a load balancer.
The changes in lib/auth/init.go fix a bug I found after re-enabling TestHSMDualAuthRotation, which I introduced in #36780
Fixes #14172
Fixes #20217