-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(elasticsearch sink): add support for data streams #5126
enhancement(elasticsearch sink): add support for data streams #5126
Conversation
I'll look at the test failures later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable implementation to me!
We'll just want to update the docs and probably add a test as you noted in discord.
You can ignore the windows test failure, we are working on that one. |
@jszwedko somewhat confused by what's needed for the test failures - I might have misunderstood you comment in discord, but compiler doesn't like adding serde(default) I can add the field to the tests/config.rs file, but that seems wrong |
@spencergilbert Ah, I meant I think you need |
How does this code handle 409 conflicts when |
FWIW, we try to gracefully handle the Elasticsearch responses in pyesbulk. |
🤔 I didn't add any specific handling so it would end up using the existing retry logic here - https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L289 EDIT: it wouldn't retry, looks like it'd match here? https://github.com/timberio/vector/blob/master/src/sinks/elasticsearch.rs#L310 |
That looks good, just might be a bit noisy for duplicates, if the payload contains pre-generated IDs ('_id'). If the client is not generating their own IDs for the Elasticsearch records, then Does the request code time out on long requests? If so, that will be another problem when the Elasticsearch instance gets over loaded. The logic looks like it retries on server errors for any 500 error. We only retry on 500, 503, and 504 with a backoff. If the client is using |
Awesome feedback @portante, thanks! I'll bother the timber folks to see if they have a preferred handling of this, I know 0.11.x is introducing some automated backoff behaviour which should alleviate some of the overwhelming aspect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually curious to get @binarylogic's thoughts here given his familiarity with ES.
Right now we don't have any handling of partial bulk insert failures (see #140 for discussion of improving that) so records rejected as duplicates should just be dropped as the request is not retried unless a 500-level error occurs. Assuming the 500 represents a partially processed request, the retried request should just see individual errors for the duplicate records (as I understand it) but still process the rest of the records.
In both the index
and create
case there is a chance for there to be duplicate records if _id
is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to leave that review as a "Comment". We still need to update the docs and, if we could, add a test here (I think just testing encode_event
is fine rather than a full integration test).
Yep - If the implementation is a good enough option as is, I'll get the branch updated, tested, and documented. I'll keep an eye out in case @binarylogic has a different opinion |
I believe the integration tests would need to use a more recent version of Elasticsearch to test datastreams - I can open an issue around upgrading ES - if we're interested in that scenario. |
@spencergilbert Yes please open that issue. |
@spencergilbert Or if you want to include in this PR that'd be fine too. |
I'll keep it separate, there might be some breaking changes between 6.6.x (I think this is what integration testing is using) and the introduction of data streams at ~7.9. Besides, technically you can do Create actions regardless of datastream usage. |
I've added a test (which probably needs tweaking to not use exclude_fields), docs, and improved the default code per @jszwedko's suggestion |
Thanks, @spencergilbert!
To confirm, Vector currently does not retry partial failures since Elasticsearch returns a 201 status code. The rationale is explained here. This is not to say we don't want to, but there are open questions around how to do this properly. There are competing scenarios where inverse behavior would be desirable. That said, I think we should address this separately as it's own project. I've scheduled #140 for the next sprint.
Adding a new Finally, we should consider first-class support for data stream as proposed in elastic/logstash#12178. This PR should not close #4939, let's leave that open to discuss that. I've commented on the issue with my thoughts. |
@binarylogic how in detail do we want to go on vector's docs versus linking to the elasticsearch documentation? |
Definitely link out. I just want to point out that the |
Are |
Adds a configuration option to specify 'index' (default) or 'create' (for data streams) Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
…SearchAuth Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
…p field Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
via the `index` action. In the case of an conflict, such as a document with the | ||
same `id`, Vector will add or _replace_ the document as necessary. | ||
same `id`, Vector will add or _replace_ the document as necessary. If `bulk_action` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it Vector that will add or replace or Elasticsearch?
With index
, if an action entry already has an _id
field specified, won't Elasticsearch perform an "update" (causing segment merges due to the implicit delete that happens)? And if an action entry does not have an _id
field, won't Elasticsearch will generate a unique _id
value itself, creating a new document in the target index?
So with the index
action there is really no notion of "conflict".
The notion of a conflict only comes with the create
action and the action entry has an _id
field specified. In that case, if a document exists with the same _id
, Elasticsearch returns a 409 conflict, else the new document is added. If the create
action is used without an _id
field specified, no conflict will occur because a unique ID will be assigned by Elasticsearch and inserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you specify an _id
and "index" a document if there is an existing document with that _id
it should be replaced upon indexing. Copying from the ES docs:
index
(Optional, string) Indexes the specified document. If the document exists, replaces the document and increments the version.
I believe the original documentation is accurate if we consider the "conflict" being the same as replacing an existing document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we have somehow communicated our definition of "conflict", which is different from how Elasticsearch uses it (conflict only mentioned with "create").
And it would be great to make sure the wording here places the ownership of the behaviors taken on Elasticsearch and not on Vector (replace "Vector" with "Elasticsearch" in "Vector will add or replace the document as necessary").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@binarylogic let me know your thoughts on this - I had intentionally left the original descriptions "as is" as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that "conflict" is a bad word for what happens for index
actions. I'd reword this as:
Vector [batches](#buffers--batches) data flushes it to Elasticsearch's [`_bulk` API endpoint][urls.elasticsearch_bulk]. By default, all events are inserted via the `index` action which will update documents if an existing one has the same `id`. If `bulk_action` is configured with `create`, Elasticsearch will _not_ replace an existing document and instead return a conflict error.
… docs Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks @spencergilbert !
I agree with @binarylogic that it shouldn't close #5126 so we can think about any other changes that might improve data streams support, but this should unblock people who want to use it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of documentation notes
via the `index` action. In the case of an conflict, such as a document with the | ||
same `id`, Vector will add or _replace_ the document as necessary. | ||
same `id`, Vector will add or _replace_ the document as necessary. If `bulk_action` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that "conflict" is a bad word for what happens for index
actions. I'd reword this as:
Vector [batches](#buffers--batches) data flushes it to Elasticsearch's [`_bulk` API endpoint][urls.elasticsearch_bulk]. By default, all events are inserted via the `index` action which will update documents if an existing one has the same `id`. If `bulk_action` is configured with `create`, Elasticsearch will _not_ replace an existing document and instead return a conflict error.
Signed-off-by: Spencer Gilbert <spencer.gilbert@gmail.com>
Thanks @jszwedko I merged all your suggestions for the docs, they definitely look better 😄 how do we look now? |
Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you.
Noting that the test failures have been fixed on master. |
Thanks for all of your work on this @spencergilbert ! |
Adds a configuration option to specify 'index' (default) or 'create' (for data streams)
Related: #4939
Ref fluent/fluent-bit#1670
Signed-off-by: Spencer Gilbert spencer.gilbert@gmail.com