Skip to content

Conversation

@pchila
Copy link
Member

@pchila pchila commented Dec 3, 2025

What does this PR do?

This PR will cleanup available rollbacks when:

  • initiating a new upgrade, to avoid increasing the disk space needed to 3x the size of an agent installation
  • when an available upgrade expires by running a goroutine that will periodically run a check and cleanup.

Why is it important?

To avoid having to clean up manually when upgrading again an agent still within the rollback window and not to wait until the agent restart to clean up obsolete installs

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila self-assigned this Dec 3, 2025
@pchila pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team labels Dec 3, 2025
@mergify
Copy link
Contributor

mergify bot commented Dec 3, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pchila pchila added the enhancement New feature or request label Dec 11, 2025
@pchila pchila marked this pull request as ready for review December 11, 2025 12:31
@pchila pchila requested a review from a team as a code owner December 11, 2025 12:31
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks good, but the part that really prevents me from giving this a +1 is an integration test. I think having this in an integration tests is critical. We should full observe in the test that the available rollback is removed once the upgrade is complete.

@pchila
Copy link
Member Author

pchila commented Dec 12, 2025

This overall looks good, but the part that really prevents me from giving this a +1 is an integration test. I think having this in an integration tests is critical. We should full observe in the test that the available rollback is removed once the upgrade is complete.

@blakerouse have a look at 3d6063a and fea6eb1

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good after latest changes, just need CI to pass.

}
},
assertAfterUpgrade: func(t *testing.T, err error, installedFixture *atesting.Fixture, upgradeIndex int, upgrades []upgradeOperation) {
// Consecutive upgrades do not seem to do well on windows (we get a timeout waiting for the upgraded agent to be healthy), skip the test there for the moment if we have an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this error captured somewhere? There isn't a reason why this shouldn't be working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the common code used by the upgrade tests an error is returned when waiting for the expected status in upgrade details when running install + upgrade or upgrades back-to-back on windows machines.
An example of the errors:

There's no good reason for this to happen only for this test, only on windows, I tried to intercept the error and skip the test temporarily in the assertion function but it's too late and the test fails anyways.
If we skip the test for windows CI should be green but the root cause should be investigated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks, looking at the test logic, I don't think it's correctly telling you why the previous executions failed. In the block below, the for loop and the ExecStatus call share the same context so once the context expires the last error is always context deadline exceeded even if that's not why the previous executions did not succeed.

ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
t := time.NewTicker(interval)
defer t.Stop()
var lastErr error
for {
select {
case <-ctx.Done():
if lastErr != nil {
return fmt.Errorf("failed waiting for status: %w", errors.Join(ctx.Err(), lastErr))
}
return ctx.Err()
case <-t.C:
status, err := f.ExecStatus(ctx)
if err != nil && status.IsZero() {
lastErr = err
continue
}

At the time diagnostics were collected, I see the following in state.yaml:

components: []
fleet_message: Not enrolled into Fleet
fleet_state: 6
log_level: debug
message: Running
state: 2

I think we are probably hitting this UgradeDetails == nil case where we never observe the watching state during the test:

if status.UpgradeDetails == nil {
lastErr = fmt.Errorf("upgrade details not found in status but expected upgrade details state was [%s]", expectedState)
continue
}

If I look at the timestamps of when the condition started checking for upgrade details in https://buildkite.com/elastic/elastic-agent/builds/32495#019b37ff-d188-4959-84d1-229c450202ce/L4685:

{"Time":"2025-12-19T19:53:57.9774844Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":"    upgrader.go:393: upgrade watcher started\n"}
{"Time":"2025-12-19T19:53:57.9801252Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":"    upgrader.go:398: Checking upgrade details state while Upgrade Watcher is running\n"}
{"Time":"2025-12-19T19:53:57.9801252Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":"    fixture.go:918: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe version --binary-only --yaml]\n"}
{"Time":"2025-12-19T19:54:08.0308306Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":"    fixture.go:869: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe status --output json]\n"}
{"Time":"2025-12-19T19:54:18.0315372Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":"    fixture.go:869: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe status --output json]\n"}

Then compare to the agent logs where I can see at 2025-12-19T19:54:01.022Z upgrade details are set to null shortly after the version command completes at 19T19:53:57.9801252Z so it's possible the test never observed the upgrade details in the state it wants:

{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).logUpgradeDetails","file.name":"coordinator/coordinator.go","file.line":899},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":null,"ecs.version":"1.6.0"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the null upgrade details is getting set when the upgrade marker gets removed

case e.Op&(fsnotify.Remove) != 0:
// Upgrade marker file was removed.
// - Upgrade could've been rolled back
// - Upgrade could've been successful
// If last known Upgrade Details state is not `UPG_ROLLBACK`, assume
// upgrade was successful
if mfw.lastMarker != nil && mfw.lastMarker.Details != nil && mfw.lastMarker.Details.State != details.StateRollback {
mfw.lastMarker.Details = nil
mfw.updateCh <- *mfw.lastMarker
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp when upgrade details is null at 2025-12-19T19:54:01.022Z:

{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).logUpgradeDetails","file.name":"coordinator/coordinator.go","file.line":899},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":null,"ecs.version":"1.6.0"}

Matches exactly with the timestamp of when the watcher starts upgrade cleanup and claims to not remove the marker which is peculiar:

{"log.level":"debug","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.appendAvailableRollbacks","file.name":"cmd/watch.go","file.line":293},"message":"Adding available rollback data\\elastic-agent-9.3.0-SNAPSHOT-c89881:{Version:9.3.0-SNAPSHOT Hash:c89881cbf4e712a82d42319a3fb09249e70aaa2c ValidUntil:2025-12-19 19:54:31.0116898 +0000 UTC} to the directories to keep during cleanup","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade.cleanup","file.name":"upgrade/rollback.go","file.line":140},"message":"Cleaning up upgrade","remove_marker":false,"ecs.version":"1.6.0"}

…lear_expired_rollback_after_upgrading_to_a_repackaged_version on windows
@ebeahan ebeahan added backport-9.3 Automated backport to the 9.3 branch and removed backport-skip labels Jan 5, 2026
@elasticmachine
Copy link
Contributor

elasticmachine commented Jan 7, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-9.3 Automated backport to the 9.3 branch enhancement New feature or request skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

5 participants