-
Notifications
You must be signed in to change notification settings - Fork 203
Cleanup available rollbacks #11562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cleanup available rollbacks #11562
Conversation
|
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
4d4ccd2 to
3ac0d8a
Compare
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
blakerouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall looks good, but the part that really prevents me from giving this a +1 is an integration test. I think having this in an integration tests is critical. We should full observe in the test that the available rollback is removed once the upgrade is complete.
bee1173 to
3d6063a
Compare
@blakerouse have a look at 3d6063a and fea6eb1 |
…pired_rollback_after_upgrading_to_a_repackaged_version on windows
cmacknz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good after latest changes, just need CI to pass.
| } | ||
| }, | ||
| assertAfterUpgrade: func(t *testing.T, err error, installedFixture *atesting.Fixture, upgradeIndex int, upgrades []upgradeOperation) { | ||
| // Consecutive upgrades do not seem to do well on windows (we get a timeout waiting for the upgraded agent to be healthy), skip the test there for the moment if we have an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have this error captured somewhere? There isn't a reason why this shouldn't be working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the common code used by the upgrade tests an error is returned when waiting for the expected status in upgrade details when running install + upgrade or upgrades back-to-back on windows machines.
An example of the errors:
- error during first try on windows 2022: failed to get am upgrade details state on the second upgrade back-to-back.
- error on retry of the same step: test failed to get an upgrade details state even earlier, during the first upgrade.
There's no good reason for this to happen only for this test, only on windows, I tried to intercept the error and skip the test temporarily in the assertion function but it's too late and the test fails anyways.
If we skip the test for windows CI should be green but the root cause should be investigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks, looking at the test logic, I don't think it's correctly telling you why the previous executions failed. In the block below, the for loop and the ExecStatus call share the same context so once the context expires the last error is always context deadline exceeded even if that's not why the previous executions did not succeed.
elastic-agent/testing/upgradetest/upgrader.go
Lines 596 to 616 in 5f92818
| ctx, cancel := context.WithTimeout(ctx, timeout) | |
| defer cancel() | |
| t := time.NewTicker(interval) | |
| defer t.Stop() | |
| var lastErr error | |
| for { | |
| select { | |
| case <-ctx.Done(): | |
| if lastErr != nil { | |
| return fmt.Errorf("failed waiting for status: %w", errors.Join(ctx.Err(), lastErr)) | |
| } | |
| return ctx.Err() | |
| case <-t.C: | |
| status, err := f.ExecStatus(ctx) | |
| if err != nil && status.IsZero() { | |
| lastErr = err | |
| continue | |
| } | |
At the time diagnostics were collected, I see the following in state.yaml:
components: []
fleet_message: Not enrolled into Fleet
fleet_state: 6
log_level: debug
message: Running
state: 2I think we are probably hitting this UgradeDetails == nil case where we never observe the watching state during the test:
elastic-agent/testing/upgradetest/upgrader.go
Lines 627 to 630 in 5f92818
| if status.UpgradeDetails == nil { | |
| lastErr = fmt.Errorf("upgrade details not found in status but expected upgrade details state was [%s]", expectedState) | |
| continue | |
| } |
If I look at the timestamps of when the condition started checking for upgrade details in https://buildkite.com/elastic/elastic-agent/builds/32495#019b37ff-d188-4959-84d1-229c450202ce/L4685:
{"Time":"2025-12-19T19:53:57.9774844Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" upgrader.go:393: upgrade watcher started\n"}
{"Time":"2025-12-19T19:53:57.9801252Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" upgrader.go:398: Checking upgrade details state while Upgrade Watcher is running\n"}
{"Time":"2025-12-19T19:53:57.9801252Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" fixture.go:918: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe version --binary-only --yaml]\n"}
{"Time":"2025-12-19T19:54:08.0308306Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" fixture.go:869: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe status --output json]\n"}
{"Time":"2025-12-19T19:54:18.0315372Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" fixture.go:869: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe status --output json]\n"}Then compare to the agent logs where I can see at 2025-12-19T19:54:01.022Z upgrade details are set to null shortly after the version command completes at 19T19:53:57.9801252Z so it's possible the test never observed the upgrade details in the state it wants:
{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).logUpgradeDetails","file.name":"coordinator/coordinator.go","file.line":899},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":null,"ecs.version":"1.6.0"}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the null upgrade details is getting set when the upgrade marker gets removed
elastic-agent/internal/pkg/agent/application/upgrade/marker_watcher.go
Lines 104 to 113 in 5f92818
| case e.Op&(fsnotify.Remove) != 0: | |
| // Upgrade marker file was removed. | |
| // - Upgrade could've been rolled back | |
| // - Upgrade could've been successful | |
| // If last known Upgrade Details state is not `UPG_ROLLBACK`, assume | |
| // upgrade was successful | |
| if mfw.lastMarker != nil && mfw.lastMarker.Details != nil && mfw.lastMarker.Details.State != details.StateRollback { | |
| mfw.lastMarker.Details = nil | |
| mfw.updateCh <- *mfw.lastMarker | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp when upgrade details is null at 2025-12-19T19:54:01.022Z:
{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).logUpgradeDetails","file.name":"coordinator/coordinator.go","file.line":899},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":null,"ecs.version":"1.6.0"}Matches exactly with the timestamp of when the watcher starts upgrade cleanup and claims to not remove the marker which is peculiar:
{"log.level":"debug","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.appendAvailableRollbacks","file.name":"cmd/watch.go","file.line":293},"message":"Adding available rollback data\\elastic-agent-9.3.0-SNAPSHOT-c89881:{Version:9.3.0-SNAPSHOT Hash:c89881cbf4e712a82d42319a3fb09249e70aaa2c ValidUntil:2025-12-19 19:54:31.0116898 +0000 UTC} to the directories to keep during cleanup","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade.cleanup","file.name":"upgrade/rollback.go","file.line":140},"message":"Cleaning up upgrade","remove_marker":false,"ecs.version":"1.6.0"}…lear_expired_rollback_after_upgrading_to_a_repackaged_version on windows
💔 Build Failed
Failed CI Steps
History
cc @pchila |
What does this PR do?
This PR will cleanup available rollbacks when:
Why is it important?
To avoid having to clean up manually when upgrading again an agent still within the rollback window and not to wait until the agent restart to clean up obsolete installs
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragmentsusing the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
Related issues
Questions to ask yourself