Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(elasticsearch sink): add support for data streams #5126

Merged
merged 17 commits into from
Dec 18, 2020
Merged

enhancement(elasticsearch sink): add support for data streams #5126

merged 17 commits into from
Dec 18, 2020

Conversation

spencergilbert
Copy link
Contributor

@spencergilbert spencergilbert commented Nov 19, 2020

Adds a configuration option to specify 'index' (default) or 'create' (for data streams)

Related: #4939

Ref fluent/fluent-bit#1670

Signed-off-by: Spencer Gilbert spencer.gilbert@gmail.com

@spencergilbert
Copy link
Contributor Author

I'll look at the test failures later today

@spencergilbert spencergilbert changed the title enhancement(elasticsearch sink): add support for data streams WIP: enhancement(elasticsearch sink): add support for data streams Nov 19, 2020
@spencergilbert spencergilbert changed the title WIP: enhancement(elasticsearch sink): add support for data streams DRAFT: enhancement(elasticsearch sink): add support for data streams Nov 19, 2020
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable implementation to me!

We'll just want to update the docs and probably add a test as you noted in discord.

src/sinks/elasticsearch.rs Outdated Show resolved Hide resolved
@jszwedko
Copy link
Member

You can ignore the windows test failure, we are working on that one.

@spencergilbert
Copy link
Contributor Author

spencergilbert commented Nov 20, 2020

@jszwedko somewhat confused by what's needed for the test failures - I might have misunderstood you comment in discord, but compiler doesn't like adding serde(default)

I can add the field to the tests/config.rs file, but that seems wrong

@jszwedko
Copy link
Member

jszwedko commented Nov 20, 2020

@spencergilbert Ah, I meant I think you need #[serde(default)] on https://github.com/timberio/vector/pull/5126/files#diff-767cdc952613316aab8665fa7c9ae09a5126bde478de907bcfa3e8010e6ec31eR64 (see some of the other defaulted fields in that struct for examples).

@spencergilbert spencergilbert changed the title DRAFT: enhancement(elasticsearch sink): add support for data streams enhancement(elasticsearch sink): add support for data streams Nov 20, 2020
@portante
Copy link

How does this code handle 409 conflicts when create is used and the object is already present?

@portante
Copy link

FWIW, we try to gracefully handle the Elasticsearch responses in pyesbulk.

@spencergilbert
Copy link
Contributor Author

spencergilbert commented Nov 20, 2020

How does this code handle 409 conflicts when create is used and the object is already present?

🤔 I didn't add any specific handling so it would end up using the existing retry logic here - https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L289

EDIT: it wouldn't retry, looks like it'd match here? https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310

@portante
Copy link

https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310

That looks good, just might be a bit noisy for duplicates, if the payload contains pre-generated IDs ('_id').

If the client is not generating their own IDs for the Elasticsearch records, then create not get 409s, and depending on how efficient Vector ends up being sending _bulk requests, it could generate many duplicates in certain cases.

Does the request code time out on long requests? If so, that will be another problem when the Elasticsearch instance gets over loaded.

The logic looks like it retries on server errors for any 500 error. We only retry on 500, 503, and 504 with a backoff.

If the client is using index, or using create without generating their own IDs, then retries on 5xx errors can overwhelm and Elasticsearch instance.

@spencergilbert
Copy link
Contributor Author

https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310

That looks good, just might be a bit noisy for duplicates, if the payload contains pre-generated IDs ('_id').

If the client is not generating their own IDs for the Elasticsearch records, then create not get 409s, and depending on how efficient Vector ends up being sending _bulk requests, it could generate many duplicates in certain cases.

Does the request code time out on long requests? If so, that will be another problem when the Elasticsearch instance gets over loaded.

The logic looks like it retries on server errors for any 500 error. We only retry on 500, 503, and 504 with a backoff.

If the client is using index, or using create without generating their own IDs, then retries on 5xx errors can overwhelm and Elasticsearch instance.

Awesome feedback @portante, thanks! I'll bother the timber folks to see if they have a preferred handling of this, I know 0.11.x is introducing some automated backoff behaviour which should alleviate some of the overwhelming aspect

@jamtur01 jamtur01 added provider: elastic Anything `elastic` service provider related sink: elasticsearch Anything `elasticsearch` sink related labels Dec 3, 2020
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually curious to get @binarylogic's thoughts here given his familiarity with ES.

Right now we don't have any handling of partial bulk insert failures (see #140 for discussion of improving that) so records rejected as duplicates should just be dropped as the request is not retried unless a 500-level error occurs. Assuming the 500 represents a partially processed request, the retried request should just see individual errors for the duplicate records (as I understand it) but still process the rest of the records.

In both the index and create case there is a chance for there to be duplicate records if _id is not set.

src/sinks/elasticsearch.rs Outdated Show resolved Hide resolved
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event is fine rather than a full integration test).

@spencergilbert
Copy link
Contributor Author

Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event is fine rather than a full integration test).

Yep - If the implementation is a good enough option as is, I'll get the branch updated, tested, and documented. I'll keep an eye out in case @binarylogic has a different opinion

@spencergilbert
Copy link
Contributor Author

Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event is fine rather than a full integration test).

I believe the integration tests would need to use a more recent version of Elasticsearch to test datastreams - I can open an issue around upgrading ES - if we're interested in that scenario.

@jamtur01
Copy link
Contributor

jamtur01 commented Dec 5, 2020

@spencergilbert Yes please open that issue.

@jamtur01
Copy link
Contributor

jamtur01 commented Dec 5, 2020

@spencergilbert Or if you want to include in this PR that'd be fine too.

@spencergilbert
Copy link
Contributor Author

@spencergilbert Or if you want to include in this PR that'd be fine too.

I'll keep it separate, there might be some breaking changes between 6.6.x (I think this is what integration testing is using) and the introduction of data streams at ~7.9. Besides, technically you can do Create actions regardless of datastream usage.

@spencergilbert
Copy link
Contributor Author

I've added a test (which probably needs tweaking to not use exclude_fields), docs, and improved the default code per @jszwedko's suggestion

@binarylogic
Copy link
Contributor

Thanks, @spencergilbert!

How does this code handle 409 conflicts when create is used and the object is already present?

To confirm, Vector currently does not retry partial failures since Elasticsearch returns a 201 status code. The rationale is explained here. This is not to say we don't want to, but there are open questions around how to do this properly. There are competing scenarios where inverse behavior would be desirable. That said, I think we should address this separately as it's own project. I've scheduled #140 for the next sprint.

Yep - If the implementation is a good enough option as is, I'll get the branch updated, tested, and documented. I'll keep an eye out in case @binarylogic has a different opinion

Adding a new bulk_action option is a great first step, and as @jszwedko pointed out, we will need to improve the documentation so users understand the behavioral differences.


Finally, we should consider first-class support for data stream as proposed in elastic/logstash#12178. This PR should not close #4939, let's leave that open to discuss that. I've commented on the issue with my thoughts.

@spencergilbert
Copy link
Contributor Author

Adding a new bulk_action option is a great first step, and as @jszwedko pointed out, we will need to improve the documentation so users understand the behavioral differences.

@binarylogic how in detail do we want to go on vector's docs versus linking to the elasticsearch documentation?

@binarylogic
Copy link
Contributor

Definitely link out. I just want to point out that the create action must be used with data streams. I would add something to the requirements attribute about that, as well as a how_it_works section with a "Data streams" title.

@spencergilbert
Copy link
Contributor Author

spencergilbert commented Dec 11, 2020

Definitely link out. I just want to point out that the create action must be used with data streams. I would add something to the requirements attribute about that, as well as a how_it_works section with a "Data streams" title.

Are requirements being displayed somewhere today? I checked the Clickhouse sink and WASM transform that have requirements specified in their cue file, but don't see it on the doc site? Wanted to check against existing examples before writing something up. EDIT: chatted with @lucperkins about this. Will add a quick note about the requirement.

Adds a configuration option to specify 'index' (default) or 'create' (for data streams)

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
…SearchAuth

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
…p field

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
docs/reference/components/sinks/elasticsearch.cue Outdated Show resolved Hide resolved
via the `index` action. In the case of an conflict, such as a document with the
same `id`, Vector will add or _replace_ the document as necessary.
same `id`, Vector will add or _replace_ the document as necessary. If `bulk_action` is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Vector that will add or replace or Elasticsearch?

With index, if an action entry already has an _id field specified, won't Elasticsearch perform an "update" (causing segment merges due to the implicit delete that happens)? And if an action entry does not have an _id field, won't Elasticsearch will generate a unique _id value itself, creating a new document in the target index?

So with the index action there is really no notion of "conflict".

The notion of a conflict only comes with the create action and the action entry has an _id field specified. In that case, if a document exists with the same _id, Elasticsearch returns a 409 conflict, else the new document is added. If the create action is used without an _id field specified, no conflict will occur because a unique ID will be assigned by Elasticsearch and inserted.

Copy link
Contributor Author

@spencergilbert spencergilbert Dec 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify an _id and "index" a document if there is an existing document with that _id it should be replaced upon indexing. Copying from the ES docs:

index

    (Optional, string) Indexes the specified document. If the document exists, replaces the document and increments the version.

I believe the original documentation is accurate if we consider the "conflict" being the same as replacing an existing document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we have somehow communicated our definition of "conflict", which is different from how Elasticsearch uses it (conflict only mentioned with "create").

And it would be great to make sure the wording here places the ownership of the behaviors taken on Elasticsearch and not on Vector (replace "Vector" with "Elasticsearch" in "Vector will add or replace the document as necessary").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@binarylogic let me know your thoughts on this - I had intentionally left the original descriptions "as is" as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that "conflict" is a bad word for what happens for index actions. I'd reword this as:

Vector [batches](#buffers--batches) data flushes it to Elasticsearch's [`_bulk` API endpoint][urls.elasticsearch_bulk]. By default, all events are inserted via the `index` action which will update documents if an existing one has the same `id`. If `bulk_action` is configured with `create`, Elasticsearch will _not_ replace an existing document and instead return a conflict error.

… docs

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks @spencergilbert !

I agree with @binarylogic that it shouldn't close #5126 so we can think about any other changes that might improve data streams support, but this should unblock people who want to use it now.

Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of documentation notes

docs/reference/components/sinks/elasticsearch.cue Outdated Show resolved Hide resolved
docs/reference/components/sinks/elasticsearch.cue Outdated Show resolved Hide resolved
via the `index` action. In the case of an conflict, such as a document with the
same `id`, Vector will add or _replace_ the document as necessary.
same `id`, Vector will add or _replace_ the document as necessary. If `bulk_action` is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree that "conflict" is a bad word for what happens for index actions. I'd reword this as:

Vector [batches](#buffers--batches) data flushes it to Elasticsearch's [`_bulk` API endpoint][urls.elasticsearch_bulk]. By default, all events are inserted via the `index` action which will update documents if an existing one has the same `id`. If `bulk_action` is configured with `create`, Elasticsearch will _not_ replace an existing document and instead return a conflict error.

Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
@spencergilbert
Copy link
Contributor Author

Thanks @jszwedko I merged all your suggestions for the docs, they definitely look better 😄 how do we look now?

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you.

@jszwedko
Copy link
Member

Noting that the test failures have been fixed on master.

@jszwedko jszwedko merged commit c62b515 into vectordotdev:master Dec 18, 2020
@jszwedko
Copy link
Member

Thanks for all of your work on this @spencergilbert !

@spencergilbert spencergilbert deleted the support-data-streams branch December 18, 2020 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider: elastic Anything `elastic` service provider related sink: elasticsearch Anything `elasticsearch` sink related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants