Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vacuum): add vacuum management for cleaning expired packages #3442

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

NikitaCOEUR
Copy link

@NikitaCOEUR NikitaCOEUR commented Jan 7, 2025

Check List

Introduce Vacuum Feature for Expired Packages

Addresses #2942 and Discussion #3086.

This pull request introduces a new feature for vacuuming expired packages.

Workflow

  1. Environment Variable Requirement
    The feature activates only when the environment variable AQUA_VACUUM_DAYS is set with a positive integer greater than 0.

  2. Implementation in InstallPackage

    • During the execution of commands through exec or install (but not aqua update or aqua cp, which currently use provideNilVacuumController in their controller initialization), package-related information is added or updated in a BoltDB (vacuum.db) located in ROOT_DIR.
    • The stored information includes:
      • Package Type
      • Package Name
      • Version
      • Timestamp (LastUsage)
      • PkgPaths
  3. Expiration Criteria

    • Packages that are not used do not have their timestamp updated.
    • A package is considered expired if its last usage exceeds the number of days defined in AQUA_VACUUM_DAYS. Expired packages become eligible for "vacuum."
  4. Execution of aqua vacuum Command

    • The vacuum command performs the following operations:
      • Deletes the package and its version from ROOT_DIR/pkgs/type.../pkg/version.
      • Removes the corresponding entry from the BoltDB.
  5. Optional - Not Implemented: Automatic Vacuum Execution

  • Currently, the vacuum process must be executed manually. However, it is possible to implement a step during the aqua install command to trigger vacuum expiration automatically, enabling "auto-cleaning."

Visualizing Managed Packages

The aqua vacuum command includes two flags, -list and -expired:

  • -list lists all packages.
  • -expired lists only expired packages.

These flags enable the use of a fuzzy finder in the console to explore the vacuum.db database and retrieve package-related information:
Fuzzy Finder Example

Coverage

cmdx c pkg/controller/vacuum
+ bash scripts/coverage.sh pkg/controller/vacuum
ok      github.com/aquaproj/aqua/v2/pkg/controller/vacuum       9.178s  coverage: 80.6% of statements

A high latence is observed for test du to tests of failling access to vacuum.db

Performance Considerations

Significant effort has been made to minimize the overhead introduced by this feature on Aqua commands. Benchmark tests in the exec controller attest to this.

goos: linux
goarch: amd64
pkg: github.com/aquaproj/aqua/v2/pkg/controller/exec
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
Benchmark_controller_Exec/normal-20         	    2608	    482220 ns/op	  706238 B/op	     732 allocs/op
Benchmark_controller_Exec/vacuumEnabled-20  	    2740	    462216 ns/op	  706194 B/op	     731 allocs/op
PASS
ok  	github.com/aquaproj/aqua/v2/pkg/controller/exec	3.639s

Three methods for adding packages to the database were implemented (storePackage, storesPackages, and asyncStorePackage). However, the asynchronous method (asyncStorePackage) shows the least overhead and is preferred. I think i'll remove all others.

Benchmark with adding only one package
=== RUN   BenchmarkVacuum_OnlyOneStorePackage
BenchmarkVacuum_OnlyOneStorePackage
=== RUN   BenchmarkVacuum_OnlyOneStorePackage/Sync
BenchmarkVacuum_OnlyOneStorePackage/Sync
BenchmarkVacuum_OnlyOneStorePackage/Sync-20                  398           3323185 ns/op           26262 B/op        120 allocs/op
=== RUN   BenchmarkVacuum_OnlyOneStorePackage/SyncMultipleSameTime
BenchmarkVacuum_OnlyOneStorePackage/SyncMultipleSameTime
BenchmarkVacuum_OnlyOneStorePackage/SyncMultipleSameTime-20                  375           3227867 ns/op           26312 B/op         120 allocs/op
=== RUN   BenchmarkVacuum_OnlyOneStorePackage/Async
BenchmarkVacuum_OnlyOneStorePackage/Async
BenchmarkVacuum_OnlyOneStorePackage/Async-20                              103395             15436 ns/op             360 B/op           7 allocs/op
=== RUN   BenchmarkVacuum_OnlyOneStorePackage/AsyncMultiple
BenchmarkVacuum_OnlyOneStorePackage/AsyncMultiple
BenchmarkVacuum_OnlyOneStorePackage/AsyncMultiple-20                       76678             15373 ns/op             360 B/op           7 allocs/op
PASS
ok      github.com/aquaproj/aqua/v2/pkg/controller/vacuum       6.106s

Benchmark when adding 100 packages at the same time :

=== RUN   BenchmarkVacuum_StorePackages
BenchmarkVacuum_StorePackages
=== RUN   BenchmarkVacuum_StorePackages/Sync
BenchmarkVacuum_StorePackages/Sync
BenchmarkVacuum_StorePackages/Sync-20                  3         348535122 ns/op         3619378 B/op      13535 allocs/op
=== RUN   BenchmarkVacuum_StorePackages/SyncMultipleSameTime
BenchmarkVacuum_StorePackages/SyncMultipleSameTime
BenchmarkVacuum_StorePackages/SyncMultipleSameTime-20                386           4835653 ns/op           62134 B/op         714 allocs/op
=== RUN   BenchmarkVacuum_StorePackages/Async
BenchmarkVacuum_StorePackages/Async
BenchmarkVacuum_StorePackages/Async-20                               254           3973761 ns/op           50124 B/op         750 allocs/op
=== RUN   BenchmarkVacuum_StorePackages/AsyncMultiple
BenchmarkVacuum_StorePackages/AsyncMultiple
BenchmarkVacuum_StorePackages/AsyncMultiple-20                      1050           1135216 ns/op           36244 B/op         602 allocs/op
PASS

The asynchronous method is the best option in all cases, especially when adding one package at a time (the typical use case for Aqua).

Benchmark of exec command via hyperfine :

❯ AQUA_VACUUM_DAYS=5 hyperfine -N --warmup 3 '/home/nikitac/Github/aqua/dist/aqua exec -- cmdx -v' '/home/nikitac/.local/share/aquaproj-aqua/bin/aqua exec -- cmdx -v'
Benchmark 1: /home/nikitac/Github/aqua/dist/aqua exec -- cmdx -v
  Time (mean ± σ):      47.3 ms ±   2.8 ms    [User: 38.9 ms, System: 10.2 ms]
  Range (min … max):    42.5 ms …  57.9 ms    67 runs
 
Benchmark 2: /home/nikitac/.local/share/aquaproj-aqua/bin/aqua exec -- cmdx -v
  Time (mean ± σ):      46.0 ms ±   3.2 ms    [User: 40.2 ms, System: 7.3 ms]
  Range (min … max):    41.9 ms …  60.2 ms    49 runs
 
Summary
  /home/nikitac/.local/share/aquaproj-aqua/bin/aqua exec -- cmdx -v ran
    1.03 ± 0.09 times faster than /home/nikitac/Github/aqua/dist/aqua exec -- cmdx -v

Main impact of storing packages is the method PackageInfo.PkgPaths, but i think the best choice to have all Paths concerned by each package. There are used to remove package from system during vacuum operation.

Comment on lines +316 to +322
// Optionally stores the package information in vacuum DB if vacuum controler is instantied and vacuum Enabled
if is.VacuumCtrl != nil && is.VacuumCtrl.IsVacuumEnabled(logE) {
logE.Debug("store package in vacuum")
if err := is.VacuumCtrl.Vacuum(ctx, logE, vacuum.AsyncStorePackage, []*config.Package{pkg}); err != nil {
logE.WithError(err).Error("store package in vacuum during install")
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important in context of exec
Async operations are dropped if the target to exec is launched.

(Maybe i need to add this to Install controller too ?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 046ab6e (#3442)

@@ -770,6 +778,7 @@ func InitializeCopyCommandController(ctx context.Context, param *config.Param, h
installpackage.NewCargoPackageInstallerImpl,
wire.Bind(new(installpackage.CargoPackageInstaller), new(*installpackage.CargoPackageInstallerImpl)),
),
provideNilVacuumController, // vacuum controller is not used so we provide nil but it is required by installer and installpackage
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This permit to not enable vacuum in context of Copy command

Comment on lines +146 to +170
// handleStorePackage processes a list of configuration packages and stores the first package in the list.
func (vc *Controller) handleStorePackage(logE *logrus.Entry, vacuumPkg []*ConfigPackage) error {
if len(vacuumPkg) < 1 {
return errors.New("StorePackage requires at least one configPackage")
}
defer func() {
if err := vc.close(logE); err != nil {
logE.WithError(err).Error("Failed to close vacuum DB after storing package")
}
}()
return vc.storePackageInternal(logE, []*ConfigPackage{vacuumPkg[0]})
}

// handleStorePackages processes a list of configuration packages and stores them.
func (vc *Controller) handleStorePackages(logE *logrus.Entry, vacuumPkg []*ConfigPackage) error {
if len(vacuumPkg) < 1 {
return errors.New("StorePackages requires at least one configPackage")
}
defer func() {
if err := vc.close(logE); err != nil {
logE.WithError(err).Error("Failed to close vacuum DB after storing multiple packages")
}
}()
return vc.storePackageInternal(logE, vacuumPkg)
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theses functions have been implemented to benchmark and evaluate which are better to use with aqua.

I think async is the best one but we need to use Close function in parent controller to ensure process are terminated like in Exec controller.

@NikitaCOEUR NikitaCOEUR marked this pull request as ready for review January 8, 2025 07:25
@suzuki-shunsuke
Copy link
Member

Thank you for your contribution!

@suzuki-shunsuke
Copy link
Member

suzuki-shunsuke commented Jan 8, 2025

Thank you for your great work!
Detailed explanation and high test coverage, and benchmark test are very helpful.

According to the benchmark test, it looks like the overhead is really small.

Command interface

I don't think --list and --expired options are intuitive.
These options change the behavior completely.
Without these options, the command removes packages, but with these options the command shows the fuzzy finder.
I think it's better to separate the commands.

e.g.

aqua vacuum show [--expired]
aqua vacuum run

The command aqua vacuum run is a bit long, but we don't run this command often, so I think we can accept it.

DB schema

  • key: {pkg.Type},{pkg.Name}@{pkg.Version}
  • value:
{
  "LastUsageTime": "2025 ...",
  "PkgPath": [""]
}

The key is not good because we can change package name freely.
So we should use package install paths as a key because they must be unique and they are not changed even if package names are changed.

I think PkgPath should be a string, not an array.

I think we can add metadata such as type, name, etc to the value.
Do you have any reason not to add them?
For performance or file size?

Code review

  • I think VacuumDays should be an integer, not a pointer.
  • I think the vacuum method should be separated by method, not by mode.
  • The variable name should be plural if it is an array. (e.g. vacuumPkg -> vacuumPkgs)
  • The log field should be snake case. (e.g. PkgPath -> package_path)
    • name -> package_name
    • version -> package_version
  • You should use logerr.WithError when outputting logs
  • failed to is unnecessary in the log message

When we return an error or output error or warn log, definitely any operation fails.
So failed to has no meaning.
In Go, we wrap error at many times. So if we add failed to to error message,
the error message would become failed to ...: failed to ....: failed to ...: ....
failed to is noisy.

  • The log should not start with a capital letter
  • Variable names in functions should not start with a capital letter
    • ConfigPackageToRemove -> configPackageToRemove
  • Other packages should not depend on vacuum.Controller directly. It should depend on an interface.
    • Then we can replace it with a mock when vacuum is disabled.
  • Returning a list of errors is not common and looks strange.
  • errCh looks like a channel, but it is actually an array.
  • The function returns a list of errors, but it only checks the length and does not look at them.
  • If any error occurs, no data is removed from DB. This is not good.
    • Even if any error occurs, packages that can be removed from file system should be removed from DB.

Check

We should confirm if the treat of locale has no problem.

func (vc *Controller) isPackageExpired(pkg *PackageVacuumEntry) bool {
	const secondsInADay = 24 * 60 * 60
	threshold := int64(*vc.Param.VacuumDays) * secondsInADay
	return time.Since(pkg.PackageEntry.LastUsageTime).Seconds() > float64(threshold)
}

@@ -294,6 +294,7 @@ type Param struct {
Installed bool
PolicyConfigFilePaths []string
Commands []string
VacuumDays *int // When defined, vacuuming is enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VacuumDays *int // When defined, vacuuming is enabled
VacuumDays int // When defined, vacuuming is enabled

I don't think this should be a pointer.

@@ -26,13 +27,14 @@ type Controller struct {
fs afero.Fs
policyReader PolicyReader
enabledXSysExec bool
vacuumCtrl *vacuum.Controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an interface.
Then we can replace it with a mock when vacuum is disabled.

// If the closing vacuum db failed, we should not stop the process
// so we log the error and continue the process.
// Updating vacuum db will be retried next time.
logE.WithError(err).Error("close the vacuum db failed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logE.WithError(err).Error("close the vacuum db failed")
logerr.WithError(logE, err).Error("close the vacuum db failed")

logerr.WithError extracts logrus.Fields from err properly.

@NikitaCOEUR
Copy link
Author

I’ll revisit my draft and get back to you, taking your feedback into account!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants