Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.0 Default pipelines #21101

Closed
niemyjski opened this issue Oct 24, 2016 · 34 comments · Fixed by #32286
Closed

5.0 Default pipelines #21101

niemyjski opened this issue Oct 24, 2016 · 34 comments · Fixed by #32286
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature help wanted adoptme

Comments

@niemyjski
Copy link
Contributor

The more I use pipelines the more useful they would be come if I could specify a list of pipelines that automatically get run on a type or index level.

This would also save some overhead of specifying what pipeline to use when a huge percentage of use cases that are using pipelines will never change. This would also make the api easier to use.

Some questions would be, if I specify a list of pipelines on an index, what would happen if I specify a pipeline to use, Would it be a merged list or just the specified pipeline to run.

@clintongormley
Copy link
Contributor

The more I use pipelines the more useful they would be come if I could specify a list of pipelines that automatically get run on a type or index level.

I've been against default pipelines because pipelines should only be run on the first ingestion. When you update or overwrite a document, you may not want the default to run. For this reason I prefer pipelines to be manually specified.

@clintongormley clintongormley added discuss :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Nov 5, 2016
@djschny
Copy link
Contributor

djschny commented Nov 12, 2016

With _timestamp removed if a user wants to add their own timestamp field, a pipeline processor is really the only way to do it. Having to force all clients to specify the same pipeline (or include it in theirs) is problematic. I reached for this within the first 30 minutes of using pipelines and feel it would be very helpful.

@inqueue
Copy link
Member

inqueue commented Feb 9, 2017

When you update or overwrite a document, you may not want the default to run.

I agree for a non-logging use case. The ability to enable default pipelines for a logging use case would be very helpful where document updates are non-existent. Further, I would only want to enable default pipelines for certain indices or have the capability to do so. Could it be an index setting?

Example uses for logging might be password stripping with the set processor or field truncation using a script processor.

@tipuban
Copy link

tipuban commented Mar 7, 2017

+1 to allowing default pipelines for indices.
Ideally you would also be able to specify on what type of operation the pipeline would be run: insert-time, update or both.

@cristimagda
Copy link

Ideally you would also be able to specify on what type of operation the pipeline would be run: insert-time, update or both.

+1 It would be very useful.

@marius-dr
Copy link
Member

+1 for this. It helps users that were affected by the removal of _timestamp field.

@zoellner
Copy link

another argument in favor of being able to specify a default pipeline:
There are scenarios where the PUT command (and some other document pre processing is outside the control of the ES operator)

AWS allows to feed an elasticsearch instance from a Amazon Kinesis Firehose Stream. However, the document _id is set by the Firehose Stream. Firehose also controls the command that is used to send the data to the elasticsearch instance, i.e. it is not possible to add the ?pipeline query parameter.
With a default ingest pipeline (based on index/type, ideally specified altogether in the index template) one could set the _id through a preprocessor based on the document _source.

@titzi
Copy link

titzi commented May 31, 2017

+1 especially as an option block in index templates, I guess this would be a perfect spot for it.

@whiteboardmonk
Copy link

+1 having this in the index template will be very useful

AWS allows to feed an elasticsearch instance from a Amazon Kinesis Firehose Stream. However, the document _id is set by the Firehose Stream. Firehose also controls the command that is used to send the data to the elasticsearch instance, i.e. it is not possible to add the ?pipeline query parameter.
With a default ingest pipeline (based on index/type, ideally specified altogether in the index template) one could set the _id through a preprocessor based on the document _source.

@zoellner I stumbled upon this issue while looking for the exact same option. Did you happen to figure out a way around this?

@zoellner
Copy link

zoellner commented Jun 8, 2017

@whiteboardmonk no, I've since stopped using Firehose Streams because of this issue.

@wpongra
Copy link

wpongra commented Aug 7, 2017

+1, _timestamp-replacement as the use-case

@niemyjski
Copy link
Contributor Author

Is there any updates on this?

@mr-mos
Copy link

mr-mos commented Aug 22, 2017

+1, for the default pipeline (regarding the use-case for _timestamp)

@redx177
Copy link

redx177 commented Aug 29, 2017

+1
I do understand @clintongormley argumentation against it. And it should be documented that adding a default pipeline will be executed for first ingestion as well as updates. But having the choice between specifying it on index level or per request gives the flexibility to use what ever is more appropriate for the current job.

@chs-bnet
Copy link

chs-bnet commented Sep 6, 2017

Being able to specify a default pipeline perhaps in an index template would be extremely useful for our case where we don't have control over the bulk put. We are using fluentd and its elasticsearch plugin and I don't believe there is a way for us to specify a pipeline using its output language.

@zfanswer
Copy link

zfanswer commented Sep 8, 2017

+1 for default pipeline.
And the setting should likely be done in index side, not pipeline.
also add option to skip pipeline like ?skip_pipline=true for some interfaces, e.g. reindex, special case. May avoid the case @clintongormley mentioned at beginning.

@de-robat
Copy link

+1 to add another real world usecase: we have a tracing implementation that persists to elasticsearch. it keeps track of the time-stamps in microseconds instead of milliseconds. adding an additional field that does the conversion while indexing would be extremely helpful. There is no chance to control the trace collectors PUT requests to ES and therefore no chance to configure a pipeline via queryparams :/

@dandrestor
Copy link

@clintongormley Good point. Perhaps a good idea would be having an index-wide setting for a default pipeline, with some parameters controlling for which operations the default applies? (By operations I mean index or update or whatever.)

@clintongormley
Copy link
Contributor

I think I could get behind the following:

  • an index setting which specifies the default pipeline to use for index or create operations only
  • update operations would not use the default pipeline
  • specifying ?pipeline=foo in an index request would result in the foo pipeline being applied instead of the default pipeline
  • specifying ?pipeline= in an index request would result in no pipeline being applied

@rjernst
Copy link
Member

rjernst commented Oct 9, 2017

specifying ?pipeline= in an index request would result in no pipeline being applied

It seems like this could easily be a malformed request. For this corner case (that someone wants to get around the default pipeline), one could create a dummy pipeline that does nothing and specify that explicitly here? Then specifying pipeline with an empty string can return an error?

@rjernst
Copy link
Member

rjernst commented Oct 9, 2017

Or, have a specially named value called _none?

@stevenwall
Copy link

+1

We sure could use this functionality as well.

Has there been update yet from ES whether this is on the roadmap? I'm not finding one.

@zfanswer
Copy link

Is this supported in ES 6.0?

@kafis
Copy link

kafis commented Mar 1, 2018

+1

I like the ingest pipeline, as it decouples me from any pre-processing of my logs in the source.
But if I cant enable it by default, I am still thrown back to manipulate my sources (that I dont have control over necessarily) to use a specific ingest pipeline

@prasadkhandagale
Copy link

+1

@sergii-sakharov
Copy link

One more use case with default pipelines could be e.g. custom validation/postprocessing of Kibana objects in case of introduction of a pipeline on .kibana index.

@SebC99
Copy link

SebC99 commented Mar 15, 2018

@clintongormley Why do you want to restrict something optional?
I mean letting default pipeline be used in updates could be very useful for calculated fields (in our case suggest fields for completion), while those who wants to use a pipeline only at index time could still do it manually with index parameters.
Or we could specify a default one for index and a default one for update?
Again, as using default pipeline would be mandatory I believe there's no point to make it restricted.
My 2 cents :)

@talevy talevy added the help wanted adoptme label Mar 15, 2018
@talevy talevy added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP discuss labels Mar 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@kunna-ujet
Copy link

+1

4 similar comments
@vanntomm
Copy link

vanntomm commented May 4, 2018

+1

@trippd6
Copy link

trippd6 commented May 10, 2018

+1

@lukeplausin
Copy link

+1

@romanpierson
Copy link

+1

@djptek
Copy link
Contributor

djptek commented Jul 11, 2018

+1

requested by Student in Engineer II training. Use-case data validation.

Q. This ^^discussion considers adding a pipeline to index settings. As an alternative, could a default pipeline be specified in an alias, which could be exposed for first ingest while allowing subsequent update or overwrite directly via the index or via an alternative alias using no (or a different) pipeline?

@original-brownbear original-brownbear self-assigned this Jul 20, 2018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 23, 2018
* Add `default_pipeline` index setting
* Empty string pipeline argument is interpreted as no pipeline
* closes elastic#21101
original-brownbear added a commit that referenced this issue Aug 2, 2018
* INGEST: Enable default pipelines

* Add `default_pipeline` index setting
* `_none` is interpreted as no pipeline
* closes #21101
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 2, 2018
* INGEST: Enable default pipelines

* Add `default_pipeline` index setting
* `_none` is interpreted as no pipeline
* closes elastic#21101
original-brownbear added a commit that referenced this issue Aug 2, 2018
* INGEST: Enable default pipelines

* Add `default_pipeline` index setting
* `_none` is interpreted as no pipeline
* closes #21101
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature help wanted adoptme
Projects
None yet
Development

Successfully merging a pull request may close this issue.