Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Old package assets are not removed from ES storage when package is upgraded #94340

Closed
jen-huang opened this issue Mar 10, 2021 · 6 comments · Fixed by #112644
Closed
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team technical debt Improvement of the software architecture and operational architecture

Comments

@jen-huang
Copy link
Contributor

Package assets are not removed from ES storage when a package is upgraded. Example:

  1. Install older version of any package, for example apache
POST <KBN>/api/fleet/epm/packages/apache-0.3.3
{ "force": true }
  1. Check assets installed in ES storage:
GET /.kibana/_search?q=epm-packages-assets.package_name:apache%20AND%20epm-packages-assets.package_version:0.3.3
  1. Navigate to Integrations > Apache > Settings, and upgrade it to the latest version.
  2. Check that ES storage contains assets for both the old version and new version:

Old

GET /.kibana/_search?q=epm-packages-assets.package_name:apache%20AND%20epm-packages-assets.package_version:0.3.3

New

GET /.kibana/_search?q=epm-packages-assets.package_name:apache%20AND%20epm-packages-assets.package_version:0.3.4

On the same Settings page, if you trigger a delete of the package, the assets for the new version are removed, but the assets for the old version are still present. So this leads me to believe that the storage behavior on package upgrade is inconsistent, and we should be wiping the old package assets on successful upgrade.

We've had issues with large amount of SOs before, so not wiping the old assets will blow up the amount of epm-packages-assets pretty quickly, especially for default packages that are auto-upgraded.

@jen-huang jen-huang added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Mar 10, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jen-huang jen-huang added the technical debt Improvement of the software architecture and operational architecture label Apr 28, 2021
@jfsiii
Copy link
Contributor

jfsiii commented May 10, 2021

I don't know if this is a bug, but it is intentional. Perhaps we still want to change the behavior anyway.

From the design doc:

When are documents removed from the index?

When a package is uninstalled. After all the other steps required to delete an installation (deleting index templates, updating the epm-packages saved object, etc) have completed. Today, that happens a) in response to a DELETE /api/fleet/epm/packages/name-version API call or b) when rolling back a failed (re-)installation

Why?
Similar rationale to why they are added as late as possible. To ensure everything else has gone to plan or recovered as planned and avoid deleting data before we’re certain we don’t need it.

The rationale being, it's better to keep too many and let the user/system explicitly choose to deal with them (via the API, etc) than delete them too soon and not support a use case.

As I said, perhaps we do want to change this behavior. I just think we should be confident that there's agreement with Product on the desired scenarios and eng/QA to confirm the actual behavior before changing it.

@jen-huang
Copy link
Contributor Author

jen-huang commented May 25, 2021

Thanks for the background @jfsiii, I understand the rationale behind avoiding deleting too early:

Similar rationale to why they are added as late as possible. To ensure everything else has gone to plan or recovered as planned and avoid deleting data before we’re certain we don’t need it.

Right now, the only case that I see where we will need to keep an older package around is in case of upgrade failure. We should always be keeping a "last known working version" around for failover.

My proposal is to delete the older version after an upgrade is 100% successful. I do not see any other cases where keeping old assets around is useful. As default packages are auto-upgraded, keeping every old version around will blow up the storage really quickly as packages get released faster.

Thoughts from other team members?

@jen-huang
Copy link
Contributor Author

Oops, I forgot one other (major) case where we need to have older packages around: currently package policies are not upgraded automatically when a package is upgraded (#92612). This means that when editing those package policies, we retrieve the package information for the older package version. Currently we retrieve it from ES storage, which is nice since it mitigates risk of if the older version is somehow not available on the registry.

I'm assigning myself to do some research and proposal for #92612, so I will think about how ES storage of packages affects that. In the meantime I'm putting this in the backlog until a consensus is reached.

@jen-huang jen-huang removed the bug Fixes for quality problems that affect the customer experience label May 25, 2021
@joshdover
Copy link
Contributor

joshdover commented Aug 23, 2021

After getting up to speed with how the Fleet plugin uses these Saved Object types, I agree we should begin cleaning these old epm-package-assets objects. On a setup where the system, elastic-agent, and fleet-server packages are installed, 396 of these objects will be created. Upgrading each of these packages once results in doubling to 792 and so on.

Large number of Saved Objects can cause Kibana's Saved Object migration system to take longer on each Stack upgrade (including patch upgrades). One problem with delaying any changes here is that in the meantime, this problem will continue to grow so that upgrading to a future version where this is fixed won't be possible (or at least, not easy) since the migration system runs before Fleet would have a chance to delete these old objects.

The Saved Object migration system does have a basic mechanism for filtering out unneeded objects (such as SO types that are no longer used at all). The problem is that the basic mechansim only has the ability to provide simple filter, whereas in this case we cannot easily create a filter without first querying Elasticsearch. Migrations also have a workaround for this scenario, but it may be removed in a future version due to the risk it adds to upgrades (see #106991).

Since we Fleet was fairly recently released as GA, I think we should try to solve this problem without introducing further risk to upgrades by:

  • Adding cleanup logic to the package upgrade process that deletes the epm-packages-assets and epm-packages objects after a package successfully upgrades.
  • Add logic that runs on Kibana startup that checks for any existing packages that were already upgraded and deletes the old objects. This logic can likely be removed in a future release.

In both of these scenarios we should be pretty defensive about making the decision to delete an older package version. At minimum we need to check that no ingest-package-policies objects are referring to the epm-packages that is a candidate for deletion. This will ensure that any package upgrades (auto or manual) that could not proceed due to conflicts will not be deleted.

@kpollich Any other considerations that we need to make based on how the upgrade policy API works today?

@jen-huang
Copy link
Contributor Author

In both of these scenarios we should be pretty defensive about making the decision to delete an older package version. At minimum we need to check that no ingest-package-policies objects are referring to the epm-packages that is a candidate for deletion. This will ensure that any package upgrades (auto or manual) that could not proceed due to conflicts will not be deleted.

Once the policy conflicts are resolved, when do we check epm-package-assets for deletion eligibility again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants