Cleanup available rollbacks #11562

pchila · 2025-12-03T15:19:01Z

What does this PR do?

This PR will cleanup available rollbacks when:

initiating a new upgrade, to avoid increasing the disk space needed to 3x the size of an agent installation
when an available upgrade expires by running a goroutine that will periodically run a check and cleanup.

Why is it important?

To avoid having to clean up manually when upgrading again an agent still within the rollback window and not to wait until the agent restart to clean up obsolete installs

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-12-03T15:19:36Z

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

elasticmachine · 2025-12-11T12:31:57Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

blakerouse

This overall looks good, but the part that really prevents me from giving this a +1 is an integration test. I think having this in an integration tests is critical. We should full observe in the test that the available rollback is removed once the upgrade is complete.

internal/pkg/agent/application/upgrade/manual_rollback.go

internal/pkg/agent/cmd/run.go

testing/integration/ess/upgrade_rollback_test.go

pchila · 2025-12-12T18:08:05Z

This overall looks good, but the part that really prevents me from giving this a +1 is an integration test. I think having this in an integration tests is critical. We should full observe in the test that the available rollback is removed once the upgrade is complete.

@blakerouse have a look at 3d6063a and fea6eb1

…pired_rollback_after_upgrading_to_a_repackaged_version on windows

cmacknz

Generally looks good after latest changes, just need CI to pass.

cmacknz · 2025-12-29T21:52:30Z

testing/integration/ess/upgrade_rollback_test.go

+				}
+			},
+			assertAfterUpgrade: func(t *testing.T, err error, installedFixture *atesting.Fixture, upgradeIndex int, upgrades []upgradeOperation) {
+				// Consecutive upgrades do not seem to do well on windows (we get a timeout waiting for the upgraded agent to be healthy), skip the test there for the moment if we have an error


Do we have this error captured somewhere? There isn't a reason why this shouldn't be working.

In the common code used by the upgrade tests an error is returned when waiting for the expected status in upgrade details when running install + upgrade or upgrades back-to-back on windows machines.
An example of the errors:

error during first try on windows 2022: failed to get am upgrade details state on the second upgrade back-to-back.

error on retry of the same step: test failed to get an upgrade details state even earlier, during the first upgrade.

There's no good reason for this to happen only for this test, only on windows, I tried to intercept the error and skip the test temporarily in the assertion function but it's too late and the test fails anyways.
If we skip the test for windows CI should be green but the root cause should be investigated.

OK thanks, looking at the test logic, I don't think it's correctly telling you why the previous executions failed. In the block below, the for loop and the ExecStatus call share the same context so once the context expires the last error is always context deadline exceeded even if that's not why the previous executions did not succeed.

elastic-agent/testing/upgradetest/upgrader.go

Lines 596 to 616 in 5f92818

ctx, cancel := context.WithTimeout(ctx, timeout)

defer cancel()

t := time.NewTicker(interval)

defer t.Stop()

var lastErr error

for {

select {

case <-ctx.Done():

if lastErr != nil {

return fmt.Errorf("failed waiting for status: %w", errors.Join(ctx.Err(), lastErr))

}

return ctx.Err()

case <-t.C:

status, err := f.ExecStatus(ctx)

if err != nil && status.IsZero() {

lastErr = err

continue

}

At the time diagnostics were collected, I see the following in state.yaml:

components: [] fleet_message: Not enrolled into Fleet fleet_state: 6 log_level: debug message: Running state: 2

I think we are probably hitting this UgradeDetails == nil case where we never observe the watching state during the test:

elastic-agent/testing/upgradetest/upgrader.go

Lines 627 to 630 in 5f92818

if status.UpgradeDetails == nil {

lastErr = fmt.Errorf("upgrade details not found in status but expected upgrade details state was [%s]", expectedState)

continue

}

If I look at the timestamps of when the condition started checking for upgrade details in https://buildkite.com/elastic/elastic-agent/builds/32495#019b37ff-d188-4959-84d1-229c450202ce/L4685:

{"Time":"2025-12-19T19:53:57.9774844Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" upgrader.go:393: upgrade watcher started\n"} {"Time":"2025-12-19T19:53:57.9801252Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" upgrader.go:398: Checking upgrade details state while Upgrade Watcher is running\n"} {"Time":"2025-12-19T19:53:57.9801252Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" fixture.go:918: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe version --binary-only --yaml]\n"} {"Time":"2025-12-19T19:54:08.0308306Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" fixture.go:869: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe status --output json]\n"} {"Time":"2025-12-19T19:54:18.0315372Z","Action":"output","Package":"github.com/elastic/elastic-agent/testing/integration/ess","Test":"TestCleanupRollbacks/agent_should_clear_expired_rollback_after_upgrading_to_a_repackaged_version","Output":" fixture.go:869: \u003e\u003e running binary with: [C:\\Program Files\\Elastic\\Agent\\elastic-agent.exe status --output json]\n"}

Then compare to the agent logs where I can see at 2025-12-19T19:54:01.022Z upgrade details are set to null shortly after the version command completes at 19T19:53:57.9801252Z so it's possible the test never observed the upgrade details in the state it wants:

{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).logUpgradeDetails","file.name":"coordinator/coordinator.go","file.line":899},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":null,"ecs.version":"1.6.0"}

I suspect the null upgrade details is getting set when the upgrade marker gets removed

elastic-agent/internal/pkg/agent/application/upgrade/marker_watcher.go

Lines 104 to 113 in 5f92818

case e.Op&(fsnotify.Remove) != 0:

// Upgrade marker file was removed.

// - Upgrade could've been rolled back

// - Upgrade could've been successful

// If last known Upgrade Details state is not `UPG_ROLLBACK`, assume

// upgrade was successful

if mfw.lastMarker != nil && mfw.lastMarker.Details != nil && mfw.lastMarker.Details.State != details.StateRollback {

mfw.lastMarker.Details = nil

mfw.updateCh <- *mfw.lastMarker

}

The timestamp when upgrade details is null at 2025-12-19T19:54:01.022Z:

{"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).logUpgradeDetails","file.name":"coordinator/coordinator.go","file.line":899},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":null,"ecs.version":"1.6.0"}

Matches exactly with the timestamp of when the watcher starts upgrade cleanup and claims to not remove the marker which is peculiar:

{"log.level":"debug","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.appendAvailableRollbacks","file.name":"cmd/watch.go","file.line":293},"message":"Adding available rollback data\\elastic-agent-9.3.0-SNAPSHOT-c89881:{Version:9.3.0-SNAPSHOT Hash:c89881cbf4e712a82d42319a3fb09249e70aaa2c ValidUntil:2025-12-19 19:54:31.0116898 +0000 UTC} to the directories to keep during cleanup","ecs.version":"1.6.0"} {"log.level":"info","@timestamp":"2025-12-19T19:54:01.022Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/upgrade.cleanup","file.name":"upgrade/rollback.go","file.line":140},"message":"Cleaning up upgrade","remove_marker":false,"ecs.version":"1.6.0"}

…lear_expired_rollback_after_upgrading_to_a_repackaged_version on windows

elasticmachine · 2026-01-07T10:12:33Z

💔 Build Failed

Buildkite Build
Commit: 8ae737a

Failed CI Steps

History

💔 Build #32914 failed 8279d1b
💔 Build #32885 failed 0966eaa
💔 Build #32769 failed 5f92818
💔 Build #32495 failed c89881c

cc @pchila

pchila self-assigned this Dec 3, 2025

pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team labels Dec 3, 2025

pchila added backport-skip skip-changelog labels Dec 3, 2025

pchila force-pushed the cleanup-rollbacks branch from 4d4ccd2 to 3ac0d8a Compare December 11, 2025 09:41

pchila added the enhancement New feature or request label Dec 11, 2025

pchila marked this pull request as ready for review December 11, 2025 12:31

pchila requested a review from a team as a code owner December 11, 2025 12:31

pchila requested review from blakerouse and swiatekm December 11, 2025 12:31

blakerouse reviewed Dec 11, 2025

View reviewed changes

internal/pkg/agent/application/upgrade/manual_rollback.go Outdated Show resolved Hide resolved

cmacknz reviewed Dec 11, 2025

View reviewed changes

internal/pkg/agent/cmd/run.go Outdated Show resolved Hide resolved

pchila added 11 commits December 12, 2025 15:28

Cleanup rollbacks when triggering a new upgrade

5944572

refactor available rollback normalization at startup

abb57f8

WIP - scheduled rollback cleanup

762c7b1

WIP - wire scheduled rollback cleanup at agent startup

77ba5c1

remove appDone handling from PeriodicallyCleanRollbacks

18eb035

Pass the correct relative path to the rollback cleanup goroutine

87de1ce

refactor from commit and repackaged fixtures for rollback tests

6720e81

Add Hash() to agent integration test fixture

6ac1bd7

Define a minimum cleanup interval for available rollbacks

f6867fe

create ttl marker files without world-readable permissions

589b42a

introduce integration test for automatic cleanup of expired rollbacks

3d6063a

pchila force-pushed the cleanup-rollbacks branch from bee1173 to 3d6063a Compare December 12, 2025 14:29

fixup! Define a minimum cleanup interval for available rollbacks

d447b9e

cmacknz reviewed Dec 12, 2025

View reviewed changes

testing/integration/ess/upgrade_rollback_test.go Show resolved Hide resolved

fixup! fixup! Define a minimum cleanup interval for available rollbacks

6e1a96c

pchila added 2 commits December 12, 2025 18:11

fixup! Add Hash() to agent integration test fixture

d918a5c

Add cleanup rollback test for multiple upgrades within the window

fea6eb1

pchila and others added 3 commits December 12, 2025 19:11

Use an additional subcontext for cleanup goroutine

4bb9432

add debug logging and skip TestCleanupRollbacks/agent_should_clear_ex…

30f8d3e

…pired_rollback_after_upgrading_to_a_repackaged_version on windows

Merge branch 'main' into cleanup-rollbacks

c89881c

This was referenced Dec 19, 2025

Allow elastic-agent to start a new upgrade by cleaning up the available rollbacks #6892

Open

Wait for elastic-agent watcher process to complete at startup and schedule cleanup #6882

Open

Set manual rollback default window to 7 days #11955

Draft

cmacknz reviewed Dec 29, 2025

View reviewed changes

fixup! add debug logging and skip TestCleanupRollbacks/agent_should_c…

5f92818

…lear_expired_rollback_after_upgrading_to_a_repackaged_version on windows

ebeahan added backport-9.3 Automated backport to the 9.3 branch and removed backport-skip labels Jan 5, 2026

ebeahan added 2 commits January 5, 2026 09:35

Merge branch 'main' into cleanup-rollbacks

0966eaa

Merge branch 'main' into cleanup-rollbacks

8279d1b

ebeahan mentioned this pull request Jan 5, 2026

Add rollback_window to elastic agent configuration with a default value of 7d #6881

Open

Merge branch 'main' into cleanup-rollbacks

8ae737a

	ctx, cancel := context.WithTimeout(ctx, timeout)
	defer cancel()

	t := time.NewTicker(interval)
	defer t.Stop()

	var lastErr error
	for {
	select {
	case <-ctx.Done():
	if lastErr != nil {
	return fmt.Errorf("failed waiting for status: %w", errors.Join(ctx.Err(), lastErr))
	}
	return ctx.Err()
	case <-t.C:
	status, err := f.ExecStatus(ctx)
	if err != nil && status.IsZero() {
	lastErr = err
	continue
	}

	if status.UpgradeDetails == nil {
	lastErr = fmt.Errorf("upgrade details not found in status but expected upgrade details state was [%s]", expectedState)
	continue
	}

	case e.Op&(fsnotify.Remove) != 0:
	// Upgrade marker file was removed.
	// - Upgrade could've been rolled back
	// - Upgrade could've been successful
	// If last known Upgrade Details state is not `UPG_ROLLBACK`, assume
	// upgrade was successful
	if mfw.lastMarker != nil && mfw.lastMarker.Details != nil && mfw.lastMarker.Details.State != details.StateRollback {
	mfw.lastMarker.Details = nil
	mfw.updateCh <- *mfw.lastMarker
	}

Cleanup available rollbacks #11562

Are you sure you want to change the base?

Cleanup available rollbacks #11562

Conversation

pchila commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

Uh oh!

mergify bot commented Dec 3, 2025

Uh oh!

elasticmachine commented Dec 11, 2025

Uh oh!

blakerouse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pchila commented Dec 12, 2025

Uh oh!

cmacknz left a comment

Choose a reason for hiding this comment

Uh oh!

cmacknz Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

pchila Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pchila commented Dec 3, 2025 •

edited

Loading

elasticmachine commented Jan 7, 2026 •

edited

Loading