Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to require an ingest pipeline #46847

Merged
merged 6 commits into from
Sep 19, 2019

Conversation

jasontedor
Copy link
Member

This commit adds the ability to require an ingest pipeline on an index. Today we can have a default pipeline, but that could be overridden by a request pipeline parameter. This commit introduces a new index setting index.required_pipeline that acts similarly to index.default_pipeline, except that it can not be overridden by a request pipeline parameter. Additionally, a default pipeline and a request pipeline can not both be set. The required pipeline can be set to _none to ensure that no pipeline ever runs for index requests on that index.

This commit adds the ability to require an ingest pipeline on an
index. Today we can have a default pipeline, but that could be
overridden by a request pipeline parameter. This commit introduces a new
index setting index.required_pipeline that acts similarly to
index.default_pipeline, except that it can not be overridden by a
request pipeline parameter. Additionally, a default pipeline and a
request pipeline can not both be set. The required pipeline can be set
to _none to ensure that no pipeline ever runs for index requests on that
index.
@jasontedor jasontedor added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.5.0 labels Sep 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@jakelandis
Copy link
Contributor

@jasontedor could you add a near clone of https://github.com/elastic/elasticsearch/blob/master/modules/ingest-common/src/test/resources/rest-api-spec/test/ingest/200_default_pipeline.yml to help test required pipelines from the various ways to issue index request.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left a few comments.

@jasontedor
Copy link
Member Author

@elasticmachine run elasticsearch-ci/2

Copy link
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - did some manual testing with multi-node forwarding and all works as expected.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasontedor jasontedor merged commit 19b710a into elastic:master Sep 19, 2019
jasontedor added a commit that referenced this pull request Sep 19, 2019
This commit adds the ability to require an ingest pipeline on an
index. Today we can have a default pipeline, but that could be
overridden by a request pipeline parameter. This commit introduces a new
index setting index.required_pipeline that acts similarly to
index.default_pipeline, except that it can not be overridden by a
request pipeline parameter. Additionally, a default pipeline and a
request pipeline can not both be set. The required pipeline can be set
to _none to ensure that no pipeline ever runs for index requests on that
index.
@jasontedor jasontedor deleted the required-pipeline branch September 19, 2019 20:38
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Sep 24, 2019
Extracted ingest pipeline resolution logic into a static method
and added unit tests for pipeline resolution logic.

Followup from elastic#46847
martijnvg added a commit that referenced this pull request Sep 25, 2019
Extracted ingest pipeline resolution logic into a static method
and added unit tests for pipeline resolution logic.

Followup from #46847
martijnvg added a commit that referenced this pull request Sep 25, 2019
Extracted ingest pipeline resolution logic into a static method
and added unit tests for pipeline resolution logic.

Followup from #46847
@cwurm
Copy link
Contributor

cwurm commented Oct 8, 2019

Being able to force data to run through a pipeline before indexing is really useful. However, I worry about making the required pipeline the only one that executes when set.

As a concrete example, we've been talking about using the ingest timestamp instead of the event timestamp for running detection rules (searches/queries, but also ML jobs) in SIEM. This would make sure that we always consider all newly arrived data and are never fooled, e.g. by an attacker manipulating system time to send data pretending to be from the distant past or future.

For that, a required ingest pipeline with a set processor setting event.ingested to {{_ingest.timestamp}} would work. We could have all Beats setup such a pipeline on all the Beats indices. However, we would also still need the ability to have another pipeline execute before the required one, e.g. the data-parsing pipelines of any enabled Filebeat modules.

The required_pipeline could always execute last, but allow any other pipeline (default_pipeline or specified in the request) to run before it. What do you think?

/cc @tsg @MikePaquette @randomuserid

@randomuserid
Copy link

I am not sure I follow - is this a trade off between regular field parsing vs. having an ingest timestamp and using sort of alternative parsing scheme? For signal purposes, if we are going to use ingest timestamps, we would need to follow the general convention of giving each event both timestamps - ingest time and event time (the latter being the timestamp in the original event being ingested.) We would also need the events to be parsed into ECS fields as they are now in order for signals to evaluate them.

Independent of the ingest time idea, we always need the original event timestamps in order to create forensic timelines as analysts work on cases. If we had to make a Sophie’s choice, the original timestamp is more critical, even if it means running long and expensive searches in order to ensure solving for correctness.

@cwurm
Copy link
Contributor

cwurm commented Oct 8, 2019

@randomuserid In my example, the ingest timestamp would not replace the event timestamp - the event timestamp would still be in @timestamp, the ingest timestamp in event.ingested (field name tbd). Let's continue the discussion about this concrete use case off this PR.

@randomuserid
Copy link

OK what are are applications for the ingest pipeline - maybe the ability to make a signal on an event tagged as important or special by an endpoint agent without consideration of time? As an alternative method of ensuring important events with weird timestamps are made into signals?

@cwurm
Copy link
Contributor

cwurm commented Oct 8, 2019

@randomuserid It could be that, yeah. More straightforward maybe is adding an ingestion timestamp, or dropping a field (e.g. for privacy reasons), and probably many other things. Today, this kind of central control is often exercised in centrally managed Logstash pipelines, but required_pipeline would allow doing it in Elasticsearch as well.

ywangd added a commit that referenced this pull request Feb 10, 2020
The changes add more granularity for identiying the data ingestion user.
The ingest pipeline can now be configure to record authentication realm and
type. It can also record API key name and ID when one is in use. 
This improves traceability when data are being ingested from multiple agents
and will become more relevant with the incoming support of required
pipelines (#46847)

Resolves: #49106
ywangd added a commit to ywangd/elasticsearch that referenced this pull request Feb 10, 2020
The changes add more granularity for identiying the data ingestion user.
The ingest pipeline can now be configure to record authentication realm and
type. It can also record API key name and ID when one is in use. 
This improves traceability when data are being ingested from multiple agents
and will become more relevant with the incoming support of required
pipelines (elastic#46847)

Resolves: elastic#49106
ywangd added a commit that referenced this pull request Feb 11, 2020
The changes add more granularity for identiying the data ingestion user.
The ingest pipeline can now be configure to record authentication realm and
type. It can also record API key name and ID when one is in use. 
This improves traceability when data are being ingested from multiple agents
and will become more relevant with the incoming support of required
pipelines (#46847)

Resolves: #49106
joegallo added a commit to joegallo/elasticsearch that referenced this pull request Dec 19, 2022
It's about *final* pipelines (not *required* pipelines) -- see elastic#46847
and elastic#49470 for the history here, 'required' pipelines were renamed to
'final' pipelines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants