Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat][GCS] - Improved documentation #41143

Merged
merged 4 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.next.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
- Add support to CEL for reading host environment variables. {issue}40762[40762] {pull}40779[40779]
- Add CSV decoder to awss3 input. {pull}40896[40896]
- Change request trace logging to include headers instead of complete request. {pull}41072[41072]
- Improved GCS input documentation. {pull}41143[41143]

*Auditbeat*

Expand Down
14 changes: 11 additions & 3 deletions x-pack/filebeat/docs/inputs/input-gcs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -216,14 +216,16 @@ It can be defined in the following formats : `{{x}}s`, `{{x}}m`, `{{x}}h`, here
If no value is specified for this, by default its initialized to `50 seconds`. This attribute can be specified both at the root level of the configuration as well at the bucket level.
The bucket level values will always take priority and override the root level values if both are specified.

NOTE: The `bucket_timeout` should depend on the size of the files and the network speed. If the timeout is too low, the input will not be able to read the file completely and `context_deadline_exceeded` errors will be seen in the logs. If the timeout is too high, the input will wait for a long time for the file to be read, which can cause the input to be slow. The ratio between the `bucket_timeout` and `poll_interval` should be considered while setting both the values. A low `poll_interval` and a very high `bucket_timeout` can cause resource utilization issues and as schedule ops will be spawned every poll iteration, so if previous poll ops are still running, it could cause a bottleneck over time.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-max_workers-gcs"]
[float]
==== `max_workers`

This attribute defines the maximum number of workers (go routines / lightweight threads) are allocated in the worker pool (thread pool) for processing jobs
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
which read contents of file. More number of workers equals a greater amount of concurrency achieved. There is an upper cap of `5000` workers per bucket that
can be defined due to internal sdk constraints. This attribute can be specified both at the root level of the configuration as well at the bucket level.
The bucket level values will always take priority and override the root level values if both are specified.
which read contents of file. This attribute can be specified both at the root level of the configuration as well at the bucket level. The bucket level values will always take priority and override the root level values if both are specified. More number of workers does not always equal to a greater amount of concurrency and this should be carefully tuned based on the number of files, the size of the files being processed and resources available. Increasing `max_workers` to very high thresholds can cause resource utilization issues and can lead to a bottleneck in processing. Usually a maximum cap of `2000` workers is recommended.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

NOTE: The value of `max_workers` is currently tied to the `batch_size` internally. This `batch_size` determines how many objects will be fetched in one single call. The `max_workers` value should be set based on the number of the and the network speed. A very low `max_worker` count will drastically increase the number of network calls required to fetch the objects, which can cause a bottleneck in processing. The `max_workers` size is tied to the `batch_size` currently to ensure even distribution of workloads across all go routines. This ensures that the input is able to process the files in an efficient manner.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-poll-gcs"]
[float]
Expand All @@ -243,6 +245,8 @@ Example : `10s` would mean we would like the polling to occur every 10 seconds.
This attribute can be specified both at the root level of the configuration as well at the bucket level. The bucket level values will always take priority
and override the root level values if both are specified.

NOTE: In an ideal scenario the `poll_interval` should be set to a value that is equal to the `bucket_timeout` value. This would ensure that another schedule op is not started before the current buckets have all been processed. If the `poll_interval` is set to a value that is less than the `bucket_timeout`, then the input will start another schedule op before the current one has finished, which can cause a bottleneck over time. Having a lower `poll_interval` can make the input faster at the cost of more resource utilization.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-parse_json"]
[float]
==== `parse_json`
Expand Down Expand Up @@ -276,6 +280,8 @@ filebeat.inputs:
- regex: '/Security-Logs/'
----

NOTE: The `file_selectors` op is performed within the agent locally which scales vertically, hence using this option will cause the agent to download all the files and then filter them. This can cause a bottleneck in processing if the number of files is very high. It is recommended to use this attribute only when the number of files is limited or ample resources are available.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="attrib-expand_event_list_from_field-gcs"]
[float]
==== `expand_event_list_from_field`
Expand Down Expand Up @@ -341,6 +347,8 @@ filebeat.inputs:
timestamp_epoch: 1630444800
----

NOTE: The GCS api's don't provide a direct way to filter files based on the timestamp, so the input will download all the files and then filter them based on the timestamp. This can cause a bottleneck in processing if the number of files are very high. It is recommended to use this attribute only when the number of files are limited or ample resources are available. This option scales vertically and not horizontally.
ShourieG marked this conversation as resolved.
Show resolved Hide resolved
ShourieG marked this conversation as resolved.
Show resolved Hide resolved

[id="bucket-overrides"]
*The sample configs below will explain the bucket level overriding of attributes a bit further :-*

Expand Down
Loading