Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HSM tests #37241

Merged
merged 1 commit into from
Jan 25, 2024
Merged

Fix HSM tests #37241

merged 1 commit into from
Jan 25, 2024

Conversation

nklaassen
Copy link
Contributor

This PR fixes the current flakiness of TestHSMMigrate and TestHSMDualAuthRotation.

I finally managed consistently to reproduce the errors we were seeing in CI by hacking lib/utils.LoadBalancer to introduce intermittent network errors. I found that connection problems to auth at certain times during proxy startup can cause the proxy process to exit without recovering. This is fairly likely to happen when a proxy is connected to two auth servers behind a load balancer, and they all reload ~simultaneously due to a CA rotation. The rough timeline to trigger the error is:

  1. auth1 starts a rotation and reloads itself
  2. proxy detects the rotation and triggers a reload
  3. proxy connects to auth2 and gets a client
  4. proxy starts proxy.init with the auth2 client
  5. auth2 detects the rotation and triggers a reload
  6. proxy.init fails and the proxy exits without recovering

The tests were relying on the proxy to always reload successfully during CA rotations. Whenever the tests flaked, the proxy was not reaching a ready state, and the logs match those I was able to reproduce.

This got markedly worse after #36549, now that an auth no longer needs to be removed from the load balancer during a migration and the test was updated to reflect that.

We could arguably call this a product bug because the proxy is usually tolerant of temporarily losing a connection to Auth, but it seems non-trivial to recover from errors during a proxy reload. At that point there is already an old and new proxy running, the new one has failed to start and the old one hasn't completely shut down, it's not obvious what should be done. Currently the proxy process will exit and the systemd service should reload the process at that point.

For now, I have modified the HSM integration tests to avoid relying on any non-auth Teleport service reloads in any tests that use 2 auths behind a load balancer.

The changes in lib/auth/init.go fix a bug I found after re-enabling TestHSMDualAuthRotation, which I introduced in #36780

Fixes #14172
Fixes #20217

Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@nklaassen nklaassen added the no-changelog Indicates that a PR does not require a changelog entry label Jan 25, 2024
if err != nil {
return trace.Wrap(err, "process unexpectedly exited while waiting for event %s", event)
}
return trace.Errorf("process unexpectedly exited while waiting for event %s", event)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't err nil in this case? Why are we returning an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's nil, but the only thing written to this errorChannel is the result of service.Run which should not return until the service is closed. It should not even return when the process is just reloading, only when it actually exits. Even if it somehow returned a nil error, in this function we are waiting for an event, if the service just exited before we got the event then that's an error

Comment on lines -63 to -65
t.Cleanup(func() {
require.NoError(t, s.close(), "error while closing %s during test cleanup", name)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why this cleanup is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just because it doesn't really matter, slows down the test, doesn't usually work if the test failed, and lots of people in #14172 were reporting the cleanup errors instead of the actual/initial error

@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from r0mant January 25, 2024 21:58
@nklaassen nklaassen enabled auto-merge January 25, 2024 22:01
@nklaassen nklaassen added this pull request to the merge queue Jan 25, 2024
Merged via the queue into master with commit 38ccec5 Jan 25, 2024
36 of 37 checks passed
@nklaassen nklaassen deleted the nklaassen/fix-hsm-tests branch January 25, 2024 22:20
@public-teleport-github-review-bot

@nklaassen See the table below for backport results.

Branch Result
branch/v15 Create PR

@nklaassen nklaassen mentioned this pull request Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/branch/v15 no-changelog Indicates that a PR does not require a changelog entry size/sm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestHSMDualAuthRotation flakiness TestHSMMigrate flakiness
3 participants