enhancement(elasticsearch sink): add support for data streams #5126

spencergilbert · 2020-11-19T18:27:11Z

Adds a configuration option to specify 'index' (default) or 'create' (for data streams)

Related: #4939

Ref fluent/fluent-bit#1670

Signed-off-by: Spencer Gilbert spencer.gilbert@gmail.com

spencergilbert · 2020-11-19T19:20:44Z

I'll look at the test failures later today

jszwedko

This seems like a reasonable implementation to me!

We'll just want to update the docs and probably add a test as you noted in discord.

src/sinks/elasticsearch.rs

jszwedko · 2020-11-19T19:37:32Z

You can ignore the windows test failure, we are working on that one.

spencergilbert · 2020-11-20T15:33:23Z

@jszwedko somewhat confused by what's needed for the test failures - I might have misunderstood you comment in discord, but compiler doesn't like adding serde(default)

I can add the field to the tests/config.rs file, but that seems wrong

jszwedko · 2020-11-20T15:59:22Z

@spencergilbert Ah, I meant I think you need #[serde(default)] on https://github.com/timberio/vector/pull/5126/files#diff-767cdc952613316aab8665fa7c9ae09a5126bde478de907bcfa3e8010e6ec31eR64 (see some of the other defaulted fields in that struct for examples).

portante · 2020-11-20T18:49:23Z

How does this code handle 409 conflicts when create is used and the object is already present?

portante · 2020-11-20T18:51:12Z

FWIW, we try to gracefully handle the Elasticsearch responses in pyesbulk.

spencergilbert · 2020-11-20T23:55:14Z

How does this code handle 409 conflicts when create is used and the object is already present?

🤔 I didn't add any specific handling so it would end up using the existing retry logic here - https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L289

EDIT: it wouldn't retry, looks like it'd match here? https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310

portante · 2020-11-21T00:16:41Z

https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310

That looks good, just might be a bit noisy for duplicates, if the payload contains pre-generated IDs ('_id').

If the client is not generating their own IDs for the Elasticsearch records, then create not get 409s, and depending on how efficient Vector ends up being sending _bulk requests, it could generate many duplicates in certain cases.

Does the request code time out on long requests? If so, that will be another problem when the Elasticsearch instance gets over loaded.

The logic looks like it retries on server errors for any 500 error. We only retry on 500, 503, and 504 with a backoff.

If the client is using index, or using create without generating their own IDs, then retries on 5xx errors can overwhelm and Elasticsearch instance.

spencergilbert · 2020-11-21T00:21:47Z

https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310

That looks good, just might be a bit noisy for duplicates, if the payload contains pre-generated IDs ('_id').

If the client is not generating their own IDs for the Elasticsearch records, then create not get 409s, and depending on how efficient Vector ends up being sending _bulk requests, it could generate many duplicates in certain cases.

Does the request code time out on long requests? If so, that will be another problem when the Elasticsearch instance gets over loaded.

The logic looks like it retries on server errors for any 500 error. We only retry on 500, 503, and 504 with a backoff.

If the client is using index, or using create without generating their own IDs, then retries on 5xx errors can overwhelm and Elasticsearch instance.

Awesome feedback @portante, thanks! I'll bother the timber folks to see if they have a preferred handling of this, I know 0.11.x is introducing some automated backoff behaviour which should alleviate some of the overwhelming aspect

jszwedko

I'm actually curious to get @binarylogic's thoughts here given his familiarity with ES.

Right now we don't have any handling of partial bulk insert failures (see #140 for discussion of improving that) so records rejected as duplicates should just be dropped as the request is not retried unless a 500-level error occurs. Assuming the 500 represents a partially processed request, the retried request should just see individual errors for the duplicate records (as I understand it) but still process the rest of the records.

In both the index and create case there is a chance for there to be duplicate records if _id is not set.

src/sinks/elasticsearch.rs

jszwedko

Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event is fine rather than a full integration test).

spencergilbert · 2020-12-03T19:44:30Z

Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event is fine rather than a full integration test).

Yep - If the implementation is a good enough option as is, I'll get the branch updated, tested, and documented. I'll keep an eye out in case @binarylogic has a different opinion

spencergilbert · 2020-12-05T17:15:38Z

Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event is fine rather than a full integration test).

I believe the integration tests would need to use a more recent version of Elasticsearch to test datastreams - I can open an issue around upgrading ES - if we're interested in that scenario.

jamtur01 · 2020-12-05T19:00:23Z

@spencergilbert Yes please open that issue.

jamtur01 · 2020-12-05T19:15:45Z

@spencergilbert Or if you want to include in this PR that'd be fine too.

spencergilbert · 2020-12-05T19:17:55Z

@spencergilbert Or if you want to include in this PR that'd be fine too.

I'll keep it separate, there might be some breaking changes between 6.6.x (I think this is what integration testing is using) and the introduction of data streams at ~7.9. Besides, technically you can do Create actions regardless of datastream usage.

spencergilbert · 2020-12-05T19:21:04Z

I've added a test (which probably needs tweaking to not use exclude_fields), docs, and improved the default code per @jszwedko's suggestion

binarylogic · 2020-12-06T19:20:14Z

Thanks, @spencergilbert!

How does this code handle 409 conflicts when create is used and the object is already present?

To confirm, Vector currently does not retry partial failures since Elasticsearch returns a 201 status code. The rationale is explained here. This is not to say we don't want to, but there are open questions around how to do this properly. There are competing scenarios where inverse behavior would be desirable. That said, I think we should address this separately as it's own project. I've scheduled #140 for the next sprint.

Yep - If the implementation is a good enough option as is, I'll get the branch updated, tested, and documented. I'll keep an eye out in case @binarylogic has a different opinion

Adding a new bulk_action option is a great first step, and as @jszwedko pointed out, we will need to improve the documentation so users understand the behavioral differences.

Finally, we should consider first-class support for data stream as proposed in elastic/logstash#12178. This PR should not close #4939, let's leave that open to discuss that. I've commented on the issue with my thoughts.

spencergilbert · 2020-12-06T20:40:16Z

Adding a new bulk_action option is a great first step, and as @jszwedko pointed out, we will need to improve the documentation so users understand the behavioral differences.

@binarylogic how in detail do we want to go on vector's docs versus linking to the elasticsearch documentation?

binarylogic · 2020-12-06T20:42:47Z

Definitely link out. I just want to point out that the create action must be used with data streams. I would add something to the requirements attribute about that, as well as a how_it_works section with a "Data streams" title.

spencergilbert · 2020-12-11T23:48:45Z

Definitely link out. I just want to point out that the create action must be used with data streams. I would add something to the requirements attribute about that, as well as a how_it_works section with a "Data streams" title.

Are requirements being displayed somewhere today? I checked the Clickhouse sink and WASM transform that have requirements specified in their cue file, but don't see it on the doc site? Wanted to check against existing examples before writing something up. EDIT: chatted with @lucperkins about this. Will add a quick note about the requirement.

Adds a configuration option to specify 'index' (default) or 'create' (for data streams) Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

…SearchAuth Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

…p field Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

docs/reference/components/sinks/elasticsearch.cue

portante · 2020-12-12T03:39:07Z

docs/reference/components/sinks/elasticsearch.cue

 				via the `index` action. In the case of an conflict, such as a document with the
-				same `id`, Vector will add or _replace_ the document as necessary.
+				same `id`, Vector will add or _replace_ the document as necessary. If `bulk_action` is


Is it Vector that will add or replace or Elasticsearch?

With index, if an action entry already has an _id field specified, won't Elasticsearch perform an "update" (causing segment merges due to the implicit delete that happens)? And if an action entry does not have an _id field, won't Elasticsearch will generate a unique _id value itself, creating a new document in the target index?

So with the index action there is really no notion of "conflict".

The notion of a conflict only comes with the create action and the action entry has an _id field specified. In that case, if a document exists with the same _id, Elasticsearch returns a 409 conflict, else the new document is added. If the create action is used without an _id field specified, no conflict will occur because a unique ID will be assigned by Elasticsearch and inserted.

If you specify an _id and "index" a document if there is an existing document with that _id it should be replaced upon indexing. Copying from the ES docs:

index (Optional, string) Indexes the specified document. If the document exists, replaces the document and increments the version.

I believe the original documentation is accurate if we consider the "conflict" being the same as replacing an existing document.

Hopefully we have somehow communicated our definition of "conflict", which is different from how Elasticsearch uses it (conflict only mentioned with "create").

And it would be great to make sure the wording here places the ownership of the behaviors taken on Elasticsearch and not on Vector (replace "Vector" with "Elasticsearch" in "Vector will add or replace the document as necessary").

@binarylogic let me know your thoughts on this - I had intentionally left the original descriptions "as is" as possible.

I tend to agree that "conflict" is a bad word for what happens for index actions. I'd reword this as:

Vector [batches](#buffers--batches) data flushes it to Elasticsearch's [`_bulk` API endpoint][urls.elasticsearch_bulk]. By default, all events are inserted via the `index` action which will update documents if an existing one has the same `id`. If `bulk_action` is configured with `create`, Elasticsearch will _not_ replace an existing document and instead return a conflict error.

… docs Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

jszwedko

This looks good to me, thanks @spencergilbert !

I agree with @binarylogic that it shouldn't close #5126 so we can think about any other changes that might improve data streams support, but this should unblock people who want to use it now.

jszwedko

A couple of documentation notes

docs/reference/components/sinks/elasticsearch.cue

jszwedko · 2020-12-17T20:37:57Z

docs/reference/components/sinks/elasticsearch.cue

 				via the `index` action. In the case of an conflict, such as a document with the
-				same `id`, Vector will add or _replace_ the document as necessary.
+				same `id`, Vector will add or _replace_ the document as necessary. If `bulk_action` is


I tend to agree that "conflict" is a bad word for what happens for index actions. I'd reword this as:

Vector [batches](#buffers--batches) data flushes it to Elasticsearch's [`_bulk` API endpoint][urls.elasticsearch_bulk]. By default, all events are inserted via the `index` action which will update documents if an existing one has the same `id`. If `bulk_action` is configured with `create`, Elasticsearch will _not_ replace an existing document and instead return a conflict error.

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

spencergilbert · 2020-12-18T00:21:25Z

Thanks @jszwedko I merged all your suggestions for the docs, they definitely look better 😄 how do we look now?

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

jszwedko

Looks good! Thank you.

jszwedko · 2020-12-18T21:21:44Z

Noting that the test failures have been fixed on master.

jszwedko · 2020-12-18T21:22:16Z

Thanks for all of your work on this @spencergilbert !

spencergilbert changed the title ~~enhancement(elasticsearch sink): add support for data streams~~ WIP: enhancement(elasticsearch sink): add support for data streams Nov 19, 2020

spencergilbert changed the title ~~WIP: enhancement(elasticsearch sink): add support for data streams~~ DRAFT: enhancement(elasticsearch sink): add support for data streams Nov 19, 2020

jszwedko reviewed Nov 19, 2020

View reviewed changes

src/sinks/elasticsearch.rs Outdated Show resolved Hide resolved

spencergilbert changed the title ~~DRAFT: enhancement(elasticsearch sink): add support for data streams~~ enhancement(elasticsearch sink): add support for data streams Nov 20, 2020

jamtur01 added provider: elastic Anything `elastic` service provider related sink: elasticsearch Anything `elasticsearch` sink related labels Dec 3, 2020

jszwedko approved these changes Dec 3, 2020

View reviewed changes

src/sinks/elasticsearch.rs Outdated Show resolved Hide resolved

jszwedko requested changes Dec 3, 2020

View reviewed changes

spencergilbert mentioned this pull request Dec 5, 2020

Update Elasticsearch version used in integration testing #5417

Closed

spencergilbert added 2 commits December 11, 2020 20:49

enhancement(elasticsearch sink): add support for data streams

234e4f3

Adds a configuration option to specify 'index' (default) or 'create' (for data streams) Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): rename to as_json_pointer

ba3905f

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

spencergilbert added 11 commits December 11, 2020 20:50

enhancement(elasticsearch sink): add serde default to BulkAction

d82cc00

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): move serde default to ES config struct

e8fa4aa

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): format code

dc508c0

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): improve default for BulkAction

d2c8e9b

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): include bulk_action in documentation

a4251e8

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): working, but not optimal unit test

e91eb3c

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): remove accidentally re-added Elastic…

2172a4b

…SearchAuth Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): fix docs for es sink

6a27e40

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): more cue fixes

2c8015a

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): update unit test to include timestam…

1e3409d

…p field Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): fmt, add data_streams how_it_works

38dd9f6

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

portante reviewed Dec 12, 2020

View reviewed changes

spencergilbert added 2 commits December 11, 2020 22:56

enhancement(elasticsearch sink): add requirements for data streams in…

a76318a

… docs Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

enhancement(elasticsearch sink): fix typo in es docs

0627628

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

jszwedko approved these changes Dec 17, 2020

View reviewed changes

jszwedko reviewed Dec 17, 2020

View reviewed changes

enhancement(elasticsearch sink): updates to documentation

8892547

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>

Remove trailing whitespace

d543d71

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>

jszwedko approved these changes Dec 18, 2020

View reviewed changes

jszwedko merged commit c62b515 into vectordotdev:master Dec 18, 2020

spencergilbert deleted the support-data-streams branch December 18, 2020 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement(elasticsearch sink): add support for data streams #5126

enhancement(elasticsearch sink): add support for data streams #5126

spencergilbert commented Nov 19, 2020 •

edited by binarylogic

Loading

spencergilbert commented Nov 19, 2020

jszwedko left a comment

jszwedko commented Nov 19, 2020

spencergilbert commented Nov 20, 2020 •

edited

Loading

jszwedko commented Nov 20, 2020 •

edited

Loading

portante commented Nov 20, 2020

portante commented Nov 20, 2020

spencergilbert commented Nov 20, 2020 •

edited

Loading

portante commented Nov 21, 2020

spencergilbert commented Nov 21, 2020

jszwedko left a comment

jszwedko left a comment

spencergilbert commented Dec 3, 2020

spencergilbert commented Dec 5, 2020

jamtur01 commented Dec 5, 2020

jamtur01 commented Dec 5, 2020

spencergilbert commented Dec 5, 2020

spencergilbert commented Dec 5, 2020

binarylogic commented Dec 6, 2020

spencergilbert commented Dec 6, 2020

binarylogic commented Dec 6, 2020

spencergilbert commented Dec 11, 2020 •

edited

Loading

portante Dec 12, 2020

spencergilbert Dec 12, 2020 •

edited

Loading

portante Dec 12, 2020

spencergilbert Dec 12, 2020

jszwedko Dec 17, 2020

jszwedko left a comment

jszwedko left a comment

jszwedko Dec 17, 2020

spencergilbert commented Dec 18, 2020

jszwedko left a comment

jszwedko commented Dec 18, 2020

jszwedko commented Dec 18, 2020

enhancement(elasticsearch sink): add support for data streams #5126

enhancement(elasticsearch sink): add support for data streams #5126

Conversation

spencergilbert commented Nov 19, 2020 • edited by binarylogic Loading

spencergilbert commented Nov 19, 2020

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko commented Nov 19, 2020

spencergilbert commented Nov 20, 2020 • edited Loading

jszwedko commented Nov 20, 2020 • edited Loading

portante commented Nov 20, 2020

portante commented Nov 20, 2020

spencergilbert commented Nov 20, 2020 • edited Loading

portante commented Nov 21, 2020

spencergilbert commented Nov 21, 2020

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

spencergilbert commented Dec 3, 2020

spencergilbert commented Dec 5, 2020

jamtur01 commented Dec 5, 2020

jamtur01 commented Dec 5, 2020

spencergilbert commented Dec 5, 2020

spencergilbert commented Dec 5, 2020

binarylogic commented Dec 6, 2020

spencergilbert commented Dec 6, 2020

binarylogic commented Dec 6, 2020

spencergilbert commented Dec 11, 2020 • edited Loading

portante Dec 12, 2020

Choose a reason for hiding this comment

spencergilbert Dec 12, 2020 • edited Loading

Choose a reason for hiding this comment

portante Dec 12, 2020

Choose a reason for hiding this comment

spencergilbert Dec 12, 2020

Choose a reason for hiding this comment

jszwedko Dec 17, 2020

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko Dec 17, 2020

Choose a reason for hiding this comment

spencergilbert commented Dec 18, 2020

jszwedko left a comment

Choose a reason for hiding this comment

jszwedko commented Dec 18, 2020

jszwedko commented Dec 18, 2020

spencergilbert commented Nov 19, 2020 •

edited by binarylogic

Loading

spencergilbert commented Nov 20, 2020 •

edited

Loading

jszwedko commented Nov 20, 2020 •

edited

Loading

spencergilbert commented Nov 20, 2020 •

edited

Loading

spencergilbert commented Dec 11, 2020 •

edited

Loading

spencergilbert Dec 12, 2020 •

edited

Loading