Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat should fail to start when multiple filestream inputs have the same input ID #40540

Closed
2 tasks done
cmacknz opened this issue Aug 15, 2024 · 14 comments
Closed
2 tasks done
Assignees
Labels
breaking change Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Aug 15, 2024

When multiple filestream inputs exist in a Beat configuration without unique input IDs for each input, Filebeat should exit with an error.

Filestream inputs require unique IDs so that they can correctly track their state in the registry. Failing to provide unique IDs can lead to data loss and duplication.

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html

Each filestream input must have a unique ID. Omitting or changing the filestream ID may cause data duplication. Without a unique ID, filestream is unable to correctly track the state of files.

Elastic Agent already requires each input in the policy to have a unique ID. We should mirror that behavior into Filebeat for filestream inputs.

  • Failure at Beats startup time should log an error message telling users exactly why, including the specific duplicate inputs by name.
  • Parsing the whole file and dump out all duplicates at once.
@cmacknz cmacknz added breaking change Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Aug 15, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@AndersonQ
Copy link
Member

@cmacknz is filebeat supposed to also crash if it detects a duplicated input at runtime? For example when using autodiscover?

@cmacknz
Copy link
Member Author

cmacknz commented Nov 14, 2024

I would lean towards yes but in general we should mirror what Elastic Agent does. If you use a variable provider in Elastic Agent to create two inputs where the id is populated dynamically by a provider (like the local dynamic provider) and is not unique what happens?

I would expect agent refuses to create the inputs in an obvious way but I haven't specifically tested this in a while.

@AndersonQ
Copy link
Member

I was looking at the code and filestream does not receive the whole configuration with all inputs at once. So so far to have filestream to check all its inputs and report all duplicated IDs might require more refactor that it's worth.

Making filestream does not start an inout if it has a duplicated ID is an easy fix. Another alternative that seems reasonable would be to impose this restriction for all inputs (I'm not sure if it's desired) on a higher layer which has access to all inputs so it can check the IDs.

What I believe we need to avoid is having some filestream specific code on a layer that should be dealing with input specific code.
It could even be a validation just after loading the config from disk, but as I said, then I'd rather do it for all inputs than just for filestream

what do you think @cmacknz ?

@cmacknz
Copy link
Member Author

cmacknz commented Nov 20, 2024

Another alternative that seems reasonable would be to impose this restriction for all inputs (I'm not sure if it's desired) on a higher layer which has access to all inputs so it can check the IDs.

Is putting the logic at a higher level but applying it only to filestream inputs an option? There is no requirement that the logic must live in filestream, only the goal that there can't be two filestream inputs with the same ID.

@AndersonQ
Copy link
Member

Is putting the logic at a higher level but applying it only to filestream inputs an option?

it's always an option, I just want to avoid adding input specific logic on a layer that shouldn't have input-specific logic

@cmacknz
Copy link
Member Author

cmacknz commented Nov 21, 2024

It is possible other inputs may want to enforce uniqueness in the future, so from an extensibility perspective putting this outside of filestream could be better.

Had we taken a wider view on this originally, we probably would have required that all filebeat inputs have a unique ID.

@AndersonQ
Copy link
Member

agreed, also we already do what I was trying to avoid, so no reason to try to work it on a more generic way.

@cmacknz
Copy link
Member Author

cmacknz commented Dec 3, 2024

I created #41881 as a follow up from #41731 which doesn't address autodiscovery. It's not clear yet how much work it would be to fix this for autodiscovery, so let's investigate it and see what happens today and if it can be improved.

@AndersonQ
Copy link
Member

@cmacknz one thing that isn't clear yet for me. The core issue is to prevent filestream with duplicated IDs from running. The check on start up and any possible change on autodiscover only prevent filestream from receiving duplicated IDs, it does not prevent finlestream itself from running imputs with duplicated IDs.

My take here is ideally there should be a change on filestream itself so it refuses to run inputs with duplicated IDs.

No amount of work on other components will completely prevent it from happening. Whereas it seems logic and reasonable any thing sending inputs to filestream (either the standalone config, managed config or autodiscover) should not send duplicated IDs, ultimately the definitive prevention needs to be on filestream. So should we tackle both of them, ensure autodiscover does not send duplicated ids and make filestream reject duplicated ids?

If memory does not betray me, beats autodiscover already do not send duplicated IDs to filestream. The last issue was a false positive due to the way config is validated. Currently it's done by creating an input and not running it. And this validation happens before the autodiscover ensure each new config is unique. That's why there was the false positives regarding duplicated input IDs

@cmacknz
Copy link
Member Author

cmacknz commented Dec 4, 2024

My take here is ideally there should be a change on filestream itself so it refuses to run inputs with duplicated IDs.

This makes sense to me. We flag it as an error where we can, and on paths where we can't, we don't let more than one instance of filestream run for the same ID.

@AndersonQ
Copy link
Member

My take here is ideally there should be a change on filestream itself so it refuses to run inputs with duplicated IDs.

This makes sense to me. We flag it as an error where we can, and on paths where we can't, we don't let more than one instance of filestream run for the same ID.

Ok, so then, shall we have another sub-issue to tackle that? And I believe it should come before #41881. What do you think?

@cmacknz
Copy link
Member Author

cmacknz commented Dec 6, 2024

Makes sense to me, create the issue for tracking and link it here.

@jlind23 jlind23 assigned belimawr and unassigned AndersonQ Dec 9, 2024
@jlind23
Copy link
Collaborator

jlind23 commented Jan 4, 2025

Both sub-issues have been tackled, closing this as done.

@jlind23 jlind23 closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

5 participants