Fix HSM tests #37241

nklaassen · 2024-01-25T03:01:57Z

This PR fixes the current flakiness of TestHSMMigrate and TestHSMDualAuthRotation.

I finally managed consistently to reproduce the errors we were seeing in CI by hacking lib/utils.LoadBalancer to introduce intermittent network errors. I found that connection problems to auth at certain times during proxy startup can cause the proxy process to exit without recovering. This is fairly likely to happen when a proxy is connected to two auth servers behind a load balancer, and they all reload ~simultaneously due to a CA rotation. The rough timeline to trigger the error is:

auth1 starts a rotation and reloads itself
proxy detects the rotation and triggers a reload
proxy connects to auth2 and gets a client
proxy starts proxy.init with the auth2 client
auth2 detects the rotation and triggers a reload
proxy.init fails and the proxy exits without recovering

The tests were relying on the proxy to always reload successfully during CA rotations. Whenever the tests flaked, the proxy was not reaching a ready state, and the logs match those I was able to reproduce.

This got markedly worse after #36549, now that an auth no longer needs to be removed from the load balancer during a migration and the test was updated to reflect that.

We could arguably call this a product bug because the proxy is usually tolerant of temporarily losing a connection to Auth, but it seems non-trivial to recover from errors during a proxy reload. At that point there is already an old and new proxy running, the new one has failed to start and the old one hasn't completely shut down, it's not obvious what should be done. Currently the proxy process will exit and the systemd service should reload the process at that point.

For now, I have modified the HSM integration tests to avoid relying on any non-auth Teleport service reloads in any tests that use 2 auths behind a load balancer.

The changes in lib/auth/init.go fix a bug I found after re-enabling TestHSMDualAuthRotation, which I introduced in #36780

Fixes #14172
Fixes #20217

github-actions · 2024-01-25T03:02:32Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

zmb3 · 2024-01-25T15:08:21Z

integration/hsm/helpers.go

+		if err != nil {
+			return trace.Wrap(err, "process unexpectedly exited while waiting for event %s", event)
+		}
+		return trace.Errorf("process unexpectedly exited while waiting for event %s", event)


Isn't err nil in this case? Why are we returning an error?

Yes it's nil, but the only thing written to this errorChannel is the result of service.Run which should not return until the service is closed. It should not even return when the process is just reloading, only when it actually exits. Even if it somehow returned a nil error, in this function we are waiting for an event, if the service just exited before we got the event then that's an error

greedy52 · 2024-01-25T21:54:26Z

integration/hsm/helpers.go

-	t.Cleanup(func() {
-		require.NoError(t, s.close(), "error while closing %s during test cleanup", name)
-	})


curious why this cleanup is removed?

just because it doesn't really matter, slows down the test, doesn't usually work if the test failed, and lots of people in #14172 were reporting the cleanup errors instead of the actual/initial error

public-teleport-github-review-bot · 2024-01-25T22:21:10Z

@nklaassen See the table below for backport results.

Branch	Result
branch/v15	Create PR

fix

a119908

nklaassen added the backport/branch/v15 label Jan 25, 2024

github-actions bot requested review from greedy52 and r0mant January 25, 2024 03:02

github-actions bot added the size/sm label Jan 25, 2024

nklaassen added the no-changelog Indicates that a PR does not require a changelog entry label Jan 25, 2024

zmb3 reviewed Jan 25, 2024

View reviewed changes

zmb3 approved these changes Jan 25, 2024

View reviewed changes

greedy52 approved these changes Jan 25, 2024

View reviewed changes

public-teleport-github-review-bot bot removed the request for review from r0mant January 25, 2024 21:58

nklaassen enabled auto-merge January 25, 2024 22:01

nklaassen added this pull request to the merge queue Jan 25, 2024

Merged via the queue into master with commit 38ccec5 Jan 25, 2024
36 of 37 checks passed

nklaassen deleted the nklaassen/fix-hsm-tests branch January 25, 2024 22:20

nklaassen mentioned this pull request Jan 25, 2024

[v15] Fix HSM tests #37293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HSM tests #37241

Fix HSM tests #37241

nklaassen commented Jan 25, 2024

github-actions bot commented Jan 25, 2024

zmb3 Jan 25, 2024

nklaassen Jan 25, 2024

greedy52 Jan 25, 2024

nklaassen Jan 25, 2024

public-teleport-github-review-bot bot commented Jan 25, 2024

Fix HSM tests #37241

Fix HSM tests #37241

Conversation

nklaassen commented Jan 25, 2024

github-actions bot commented Jan 25, 2024

zmb3 Jan 25, 2024

Choose a reason for hiding this comment

nklaassen Jan 25, 2024

Choose a reason for hiding this comment

greedy52 Jan 25, 2024

Choose a reason for hiding this comment

nklaassen Jan 25, 2024

Choose a reason for hiding this comment

public-teleport-github-review-bot bot commented Jan 25, 2024