Fix leaks with metadata processors #16349

jsoriano · 2020-02-14T20:38:18Z

What does this PR do?

Add a closer interface for processors so their resources can be released when the processor is not needed anymore.

Explicitly close publisher pipelines so their processors are closed.

Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata.

Why is it important?

Processors can be defined in dynamic configurations, as in autodiscover templates. Some processors open connections, files, or start goroutines that need to be closed or stopped.

Processors can be dynamically created at least when using autodiscover and when used in the script processor. When this issue appears it can be usually solved by configuration (as described in #14389). So this case is probably not so serious.

Processors can include an on-disk cache since #20775, files open by this cache should be properly closed on shutdown.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

Check that resources are released in:

Filebeat
Metricbeat
- ~~Processors at the module level are open multiple times?~~ (One per metricset, to be reviewed as a different issue, only a real problem when a processor cannnot be instantiated twice with the same config, what only happens with add_cloudfoundry_metadata)
Autodiscover
Config reload
Check if there are other different pipelines that use processors in other beats (Everything using libbeat publisher pipeline should be covered, to open new issues if something that is not covered by this assumption should be covered).
Processors used in the script processor. Options to solve this:
- ~~Somehow call close on all the instantiated processors once the script is finished.~~
- Avoid registering processors implementing Closer for the script processor.

Add tests:

Native processors with closer in script processor.
Processing pipelines created by supporters don't close the global processors.

Related issues

Fixes Filebeat too many open files using Docker autodiscover #14389
Fixes kubernetes watch connection don't release when reload create runner failed #12096
Fixes a pending issue from Add a persistent cache for cloudfoundry metadata based on badger #20775

Use cases

Add processors in dynamic configuration and release its resources when not needed anymore.
Use processors that make use of files that should be closed (as cache on disk used by add_cloudfoundry_metadata).

jsoriano · 2020-02-21T14:36:19Z

@andrewkroh @exekias I would like to have your opinions on this.

Summarizing, I think this issue would only affect add_docker_metadata and add_kubernetes_metadata, as they start watchers that need their own logistics.

The issue can be reproduced by using one of these processors in an autodiscover template, or in a script processor. But actually I cannot think on a use case for this. Metadata is added in any case in autodiscover, and even if needed these processors could be placed in the main static configuration. Regarding scripts, I don't think it is useful at all to call these processors from scripts (but we are exposing them at the moment, and they can be used as in the test case added here).

I started this PR to add closers to the processors that need them, and call these closers from all the places where they are instantiated. The problem I see with this approach:

We allow processors in many places, so a lot of things need to be touched.
Does it worth it? It only happens with some special processors, and when happens is because they are being used incorrectly. Calling the closers on scripts can be tricky, we would be creating and destroying heavy processors for every event.

I am thinking on continuing adding the closers support in common cases, as it may be also useful for other processors in the future, but adding a check in the script processor so processors that define a closer cannot be registered (this would imply to stop registering these processors for the script processor, what would be a breaking change, but I guess nobody is using them).

An alternative would be to do nothing, as this is something that can be solved with a proper configuration. Though it'd be nice to don't allow the problem to happen.

Opinions?

andrewkroh

I think its a good idea build in support for Close()ing processors. We have processors holding resources/state that need to be released so we need to release them. The script processor should disallow constructing stateful processors (more on this below).

For the script processor, I would consider it an error to instantiate any processor during the process() invocation. Any processors that are needed should be constructed once at initialization. So I wouldn't try to handle releasing resources that are allocated for every invocation of the script processor in the process() function. But we could release the resources associated with the processors constructed at initialization. We control their constructors so we can track what needs to be Close()ed in a list.

BUT I made an assumption that all processors were stateless (hey, that's how their interface is designed). So the script processor uses a sync.Pool containing javascript VMs to enable concurrent usage of the processor (a VM instance supports only one invocation at a time). Because it uses a sync.Pool, instances can be garbage collected any time they are not being used. This means we don't have a way to invoke Close() on the processors when an instance in GC'ed. You'd have to use something other than sync.Pool to make this all work. So it's easier to just disallow stateful processors from scripts.

libbeat/publisher/processing/processors.go

jsoriano · 2020-02-21T17:11:51Z

Any processors that are needed should be constructed once at initialization.

Oh, I don't know why I was thinking that the whole script was executed every time the processor is run. Then I think it would be ok to implement a closer for the script processor that closes the processors that need it.

instances can be garbage collected any time they are not being used.

Do you mean that the instance can be garbage collected even when the processor is still in use? Would that mean that if the processor is run again, it will run the initialization again?

jsoriano · 2020-02-21T17:31:46Z

instances can be garbage collected any time they are not being used.

Do you mean that the instance can be garbage collected even when the processor is still in use? Would that mean that if the processor is run again, it will run the initialization again?

Oh, ok, I understand it now looking at the code. Then I think that I will remove by now stateful processors from the registry and we can try in the future to look for an alternative to sync.Pool.

Thanks!

andrewkroh · 2020-02-21T21:02:29Z

I was still thinking about this issue for some reason 😆 and have another consideration that I haven't vetted. This comes into play if you plan on closing the processors associated with publisher clients. It's about this sharing of global processors in publisher pipelines. Anecdotally, I think global processors get shared by each publisher client's processor pipeline. If that is the case then Close()ing the processors from one client could affect another client. The publisher client would need to have two lists of processors (global, local) instead of one list to be able to free the right resources.

jsoriano · 2020-02-24T11:44:41Z

I was still thinking about this issue for some reason 😆

Thanks! 😂

I think global processors get shared by each publisher client's processor pipeline. If that is the case then Close()ing the processors from one client could affect another client.

Yes, I am afraid that you are right and in some places processors are reused, but I am not sure if only in static configurations, I don't see it happening in autodiscover for example.

I was also taking a look to the code and there are several places where we create processors, I think that supporting stateful processors properly is going to be a bigger refactor than I originally expected. As you mentioned, their interface is designed to be stateless (only New and Run methods), so actually just adding a Closer interface may be tricky as it wouldn't match with such interface. Also it is going to be difficult to control when to call these closers if we are reusing them and using them in contexts where we cannot control their finalization. I was thinking on using some kind of context or done channel that we can propagate, so processors are closed when their creator is closed, but we would need to pass it to New, what would still be a bit counterintuitive in this interface, we would need some kind of Open/Start method.

Another option for refactor would be to re-think the processors with state, maybe they should be a different kind of enrichers. But this would open other different rabbit holes for sure.

I have been talking with @exekias too about this, and we think that we might by now try to address this issue by avoiding the cases where this can be more problematic, and leave further refactors for the future. The solutions to implement by now would be:

Stop registering add_docker_metadata and add_kubernetes_metadata for the script processor.
Reject add_docker_metadata and add_kubernetes_metadata processors when validating configs in autodiscover. This would be hacky, but would prevent issues like Filebeat too many open files using Docker autodiscover #14389 to happen.

elasticmachine · 2020-03-09T11:42:32Z

Pinging @elastic/integrations-platforms (Team:Platforms)

elasticmachine · 2020-09-23T19:33:45Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: [Pull request #16349 updated]
Start Time: 2020-10-13T09:11:34.058+0000
Duration: 84 min 56 sec

Test stats 🧪

Test	Results
Failed	0
Passed	16325
Skipped	1343
Total	17668

…sors-leak

…l" processor

jsoriano · 2020-10-13T09:12:19Z

add_process_metadata also has a cache that needs to be stopped, I have added a closer that is only used when the cache is used. In the script processor the cache is disabled so the processor can still be used there.

exekias · 2020-10-13T10:41:48Z

Nice work here! LGTM

jsoriano · 2020-10-13T11:06:59Z

Thanks for the feedback and the reviews!

Add a closer interface for processors so their resources can be released when the processor is not needed anymore. Explicitly close publisher pipelines so their processors are closed. Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata. Script processor will fail if a processor that needs to be closed is used. (cherry picked from commit a3fe796)

* upstream/master: (127 commits) Update obs app links (elastic#21682) fix: update fleet test suite name (elastic#21738) Remove dot from file.extension value in Auditbeat FIM (elastic#21644) Fix leaks with metadata processors (elastic#16349) Add istiod metricset (elastic#21519) [Ingest Manager] Change Sync/Close call order (elastic#21735) [Ingest Manager] Syncing unpacked files (elastic#21706) Fix concurrent map read and write in socket dataset (elastic#21690) Fix conditional coding to remove seccomp info from Winlogbeat (elastic#21652) [Elastic Agent] Fix issue where inputs without processors defined would panic (elastic#21628) Add configuration of filestream input (elastic#21565) libbeat/logp: introduce Logger.WithOptions (elastic#21671) Make o365audit input cancellable (elastic#21647) fix: remove extra curly brace in script (elastic#21692) [Winlogbeat] Remove brittle configuration validation from wineventlog (elastic#21593) Fix function that parses from/to/contact headers (elastic#21672) [CI] Support Windows-2016 in pipeline 2.0 (elastic#21337) Skip publisher flaky tests (elastic#21657) backport: add 7.10 branch (elastic#21635) [CI: Packaging] fix: push ubi8 images too (elastic#21621) ...

Add a closer interface for processors so their resources can be released when the processor is not needed anymore. Explicitly close publisher pipelines so their processors are closed. Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata. Script processor will fail if a processor that needs to be closed is used. (cherry picked from commit a3fe796)

…dependencies-goals * upstream/master: (46 commits) Use badger code from upstream repository (elastic#21705) Disable writes sync in persistent cache (elastic#21754) Make API address and Shard ID required in Cloud Foundry settings (elastic#21759) [CI] Support skip-ci label (elastic#21377) Increase recommended memory when deploying in Cloud foundry (elastic#21755) typofix for dns timeout configuration option (elastic#21069) chore: create CI artifacts for DEV usage (elastic#21645) [Ingest Manager] Atomic installed forgotten changelog elastic#21780 [Ingest Manager] Agent atomic installer (elastic#21745) Add missing configuration annotations (elastic#21736) [Filebeat] Add check for context.DeadlineExceeded error (elastic#21732) Remove kafka partition ISR from Metricbeat docs (elastic#21709) Skip flaky test with oauth2 config in httpjson input (elastic#21749) Fix for azure retrieve resource by ids (elastic#21711) Update obs app links (elastic#21682) fix: update fleet test suite name (elastic#21738) Remove dot from file.extension value in Auditbeat FIM (elastic#21644) Fix leaks with metadata processors (elastic#16349) Add istiod metricset (elastic#21519) [Ingest Manager] Change Sync/Close call order (elastic#21735) ...

…ne-2.0-arm * upstream/master: (93 commits) Fix non-windows fields on system/filesystem (elastic#21758) disable TestReceiveEventsAndMetadata/TestSocketCleanup/TestReceiveNewEventsConcurrently in Windows (elastic#21750) Use badger code from upstream repository (elastic#21705) Disable writes sync in persistent cache (elastic#21754) Make API address and Shard ID required in Cloud Foundry settings (elastic#21759) [CI] Support skip-ci label (elastic#21377) Increase recommended memory when deploying in Cloud foundry (elastic#21755) typofix for dns timeout configuration option (elastic#21069) chore: create CI artifacts for DEV usage (elastic#21645) [Ingest Manager] Atomic installed forgotten changelog elastic#21780 [Ingest Manager] Agent atomic installer (elastic#21745) Add missing configuration annotations (elastic#21736) [Filebeat] Add check for context.DeadlineExceeded error (elastic#21732) Remove kafka partition ISR from Metricbeat docs (elastic#21709) Skip flaky test with oauth2 config in httpjson input (elastic#21749) Fix for azure retrieve resource by ids (elastic#21711) Update obs app links (elastic#21682) fix: update fleet test suite name (elastic#21738) Remove dot from file.extension value in Auditbeat FIM (elastic#21644) Fix leaks with metadata processors (elastic#16349) ...

jsoriano added bug [zube]: In Progress Team:Services (Deprecated) Label for the former Integrations-Services team labels Feb 14, 2020

jsoriano self-assigned this Feb 14, 2020

jsoriano mentioned this pull request Feb 14, 2020

Filebeat too many open files using Docker autodiscover #14389

Closed

jsoriano force-pushed the autodiscover-processors-leak branch 2 times, most recently from 315a8d6 to faaa97f Compare February 20, 2020 19:30

jsoriano added the discuss Issue needs further discussion. label Feb 21, 2020

jsoriano force-pushed the autodiscover-processors-leak branch 2 times, most recently from b2ca802 to a4860e2 Compare February 21, 2020 14:00

andrewkroh reviewed Feb 21, 2020

View reviewed changes

libbeat/publisher/processing/processors.go Outdated Show resolved Hide resolved

This was referenced Feb 24, 2020

Remove stateful processors from the script processor #16514

Merged

Cherry-pick #16514 to 7.x: Remove stateful processors from the script processor #16544

Merged

Allow stateful processors only in global config #16653

Closed

exekias mentioned this pull request Feb 27, 2020

Add Cloud Foundry Processor #16621

Merged

4 tasks

jsoriano mentioned this pull request Feb 28, 2020

[libbeat] Fix / clarify the module reload logic #16440

Merged

5 tasks

andresrc added the Team:Platforms Label for the Integrations - Platforms team label Mar 9, 2020

andresrc added [zube]: Inbox [zube]: In Progress and removed [zube]: In Progress [zube]: Inbox labels Mar 9, 2020

jsoriano mentioned this pull request Sep 16, 2020

Proposal: Cache metadata to local disk/db #19511

Closed

jsoriano force-pushed the autodiscover-processors-leak branch from 80fbda4 to b55a656 Compare September 23, 2020 18:24

jsoriano added 3 commits October 12, 2020 21:07

Merge remote-tracking branch 'origin/master' into autodiscover-proces…

27c5063

…sors-leak

Add a comment about using Close() to indicate that this is a "statefu…

b7e9d1d

…l" processor

Add closer to add_process_metadata

ab008f6

urso approved these changes Oct 13, 2020

View reviewed changes

jsoriano merged commit a3fe796 into elastic:master Oct 13, 2020

jsoriano deleted the autodiscover-processors-leak branch October 13, 2020 11:08

zube bot added [zube]: Done and removed [zube]: In Progress labels Oct 13, 2020

jsoriano mentioned this pull request Oct 13, 2020

Cherry-pick #16349 to 7.x: Fix leaks with metadata processors #21739

Merged

16 tasks

jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Oct 13, 2020

jsoriano mentioned this pull request Oct 13, 2020

Cherry-pick #16349 to 7.10: Fix leaks with metadata processors #21740

Merged

16 tasks

jsoriano mentioned this pull request Oct 13, 2020

Disable writes sync in persistent cache #21754

Merged

6 tasks

This was referenced Oct 14, 2020

Cherry-pick #21754 to 7.x: Disable writes sync in persistent cache #21788

Merged

Cherry-pick #21754 to 7.10: Disable writes sync in persistent cache #21789

Merged

jsoriano mentioned this pull request Dec 10, 2020

Rate limit processor #22883

Merged

6 tasks

zube bot removed the [zube]: Done label Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leaks with metadata processors #16349

Fix leaks with metadata processors #16349

jsoriano commented Feb 14, 2020 •

edited by zube bot

Loading

jsoriano commented Feb 21, 2020

andrewkroh left a comment •

edited

Loading

jsoriano commented Feb 21, 2020

jsoriano commented Feb 21, 2020

andrewkroh commented Feb 21, 2020

jsoriano commented Feb 24, 2020

elasticmachine commented Mar 9, 2020

elasticmachine commented Sep 23, 2020 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

jsoriano commented Oct 13, 2020 •

edited

Loading

exekias commented Oct 13, 2020

jsoriano commented Oct 13, 2020

Fix leaks with metadata processors #16349

Fix leaks with metadata processors #16349

Conversation

jsoriano commented Feb 14, 2020 • edited by zube bot Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

Related issues

Use cases

jsoriano commented Feb 21, 2020

andrewkroh left a comment • edited Loading

Choose a reason for hiding this comment

jsoriano commented Feb 21, 2020

jsoriano commented Feb 21, 2020

andrewkroh commented Feb 21, 2020

jsoriano commented Feb 24, 2020

elasticmachine commented Mar 9, 2020

elasticmachine commented Sep 23, 2020 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

jsoriano commented Oct 13, 2020 • edited Loading

exekias commented Oct 13, 2020

jsoriano commented Oct 13, 2020

jsoriano commented Feb 14, 2020 •

edited by zube bot

Loading

andrewkroh left a comment •

edited

Loading

elasticmachine commented Sep 23, 2020 •

edited by jenkins-beats-ci bot

Loading

jsoriano commented Oct 13, 2020 •

edited

Loading