Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] aws-s3 - Document _id generation behavior #42127

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 94 additions & 7 deletions x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,26 @@ the message doesn't return to the queue before processing is complete.
If an error occurs during the processing of the S3 object, the processing will
be stopped, and the SQS message will be returned to the queue for reprocessing.

[float]
=== Configuration Examples

[float]
==== SQS with JSON files

This example reads s3:ObjectCreated notifications from SQS, and assumes that
all the S3 objects have a `Content-Type` of `application/json`.
It splits the `Records` array in the JSON into separate events.

["source","yaml",subs="attributes"]
----
{beatname_lc}.inputs:
- type: aws-s3
queue_url: https://sqs.ap-southeast-1.amazonaws.com/1234/test-s3-queue
credential_profile_name: elastic-beats
expand_event_list_from_field: Records
----

[float]
==== S3 bucket listing

When using the direct polling list of S3 objects in an S3 buckets,
a number of workers that will process the S3 objects listed must be set
Expand All @@ -64,6 +75,9 @@ Listing of the S3 bucket will be polled according the time interval defined by
expand_event_list_from_field: Records
----

[float]
==== S3-compatible services

The `aws-s3` input can also poll third party S3-compatible services such as the
Minio. Using non-AWS S3 compatible buckets requires the use of
`access_key_id` and `secret_access_key` for authentication. To specify the S3
Expand All @@ -88,6 +102,79 @@ that require a different endpoint.
expand_event_list_from_field: Records
----

[float]
=== Document ID Generation

This aws-s3 input feature prevents the duplication of events in Elasticsearch by
generating a custom document `_id` for each event, rather than relying on
Elasticsearch to automatically generate one. Each document in an Elasticsearch
index must have a unique `_id`, and {beatname_uc} uses this property to avoid
ingesting duplicate events.

The custom `_id` is based on several pieces of information from the S3 object:
the Last-Modified timestamp, the bucket ARN, the object key, and the byte
offset of the data in the event.

Duplicate prevention is particularly useful in scenarios where {beatname_uc}
needs to retry an operation. {beatname_uc} guarantees at-least-once delivery,
meaning it will retry any failed or incomplete operations. These retries may be
triggered by issues with the host, `{beatname_uc}`, network connectivity, or
services such as Elasticsearch, SQS, or S3.

[float]
==== Limitations of `_id`-Based Deduplication

There are some limitations to consider when using `_id`-based deduplication in
Elasticsearch:

* Deduplication works only within a single index. The same `_id` can exist in
different indices, which is important if you're using data streams or index
aliases. When the backing index rolls over, a duplicate may be ingested.

* Indexing operations in Elasticsearch may take longer when an `_id` is
specified. Elasticsearch needs to check if the ID already exists before
writing, which can increase the time required for indexing.

[float]
==== Disabling Duplicate Prevention

If you want to disable the `_id`-based deduplication, you can remove the
document `_id` using the <<drop-fields,`drop_fields`>> processor in
{beatname_uc}.

["source","yaml",subs="attributes"]
----
{beatname_lc}.inputs:
- type: aws-s3
queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue
processors:
- drop_fields:
fields:
- '@metadata._id'
ignore_missing: true
----

Alternatively, you can remove the `_id` field using an Elasticsearch Ingest
Node pipeline.

["source","json",subs="attributes"]
----
{
"processors": [
{
"remove": {
"if": "ctx.input?.type == \"aws-s3\"",
"field": "_id",
"ignore_missing": true
}
}
]
}
----

[float]
=== Configuration

The `aws-s3` input supports the following configuration options plus the
<<{beatname_lc}-input-{type}-common-options>> described later.

Expand Down Expand Up @@ -600,6 +687,9 @@ Controls whether fully processed files will be deleted from the bucket.

This option can only be used together with the backup functionality.

[id="{beatname_lc}-input-{type}-common-options"]
include::../../../../filebeat/docs/inputs/input-common-options.asciidoc[]

[float]
=== AWS Permissions

Expand Down Expand Up @@ -1003,6 +1093,9 @@ Will produce the following output:

|===

[id="aws-credentials-config"]
include::{libbeat-xpack-dir}/docs/aws-credentials-config.asciidoc[]

[float]
=== Metrics

Expand Down Expand Up @@ -1032,10 +1125,4 @@ observe the activity of the input.
| `s3_object_processing_time` | Histogram of the elapsed S3 object processing times in nanoseconds (start of download to completion of parsing).
|=======

[id="{beatname_lc}-input-{type}-common-options"]
include::../../../../filebeat/docs/inputs/input-common-options.asciidoc[]

[id="aws-credentials-config"]
include::{libbeat-xpack-dir}/docs/aws-credentials-config.asciidoc[]

:type!:
Loading