-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libbeat] Add configurable exponential backoff for disk queue write errors #21493
Conversation
Pinging @elastic/integrations (Team:Integrations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
💔 Build FailedExpand to view the summary
Build stats
Test stats 🧪
Steps errorsExpand to view the steps failures
Log outputExpand to view the last 100 lines of log output
|
…rrors (elastic#21493) (cherry picked from commit b0236ee)
…-matches-found * upstream/master: (21 commits) Skip filestream flaky tests (elastic#21490) Ignore unsupported metrics in the azure module (elastic#21486) Do not run symlink tests on Windows (elastic#21472) Map `cloud.account.id` to azure sub id (elastic#21483) Add support for app_state metricset (elastic#20639) Include original error when metricbeat fails to connect with Kafka (elastic#21484) Prompt only when agent is already enrolled (elastic#21473) Fix leftover delpoyment example (elastic#21474) Bump version to ECS 1.6 in modules without ECS updates (elastic#21455) Clarify input type configuration options (elastic#19284) Increase index pattern size check to 10MiB (elastic#21487) Migrate S3 Input to Filebeat Input V2 (elastic#20005) [libbeat] Add configurable exponential backoff for disk queue write errors (elastic#21493) Revert "Revert "[JJBB] Set shallow cloning to 10 (elastic#21409)" (elastic#21447)" (elastic#21467) Fix format of debug messages in tlscommon (elastic#21482) [CI] Change x-pack/auditbeat build events (comments, labels) (elastic#21463) [CI] changeset from elastic#20603 was not added to CI2.0 (elastic#21464) Add new log file reader for filestream input (elastic#21450) [CI] Send slack message with build status (elastic#21428) Remove duplicated sources url in dependencies report (elastic#21462) ...
* upstream/master: (26 commits) [Ingest Manager] Send updating state (elastic#21461) [Filebeat][New Fileset] Cisco Umbrella support (elastic#21504) [Ingest Manager] Download asc from artifact store specified in spec (elastic#21488) Implementation of fileProspector (elastic#21479) [Metricbeat] Add latency config option into aws module (elastic#20875) Skip filestream flaky tests (elastic#21490) Ignore unsupported metrics in the azure module (elastic#21486) Do not run symlink tests on Windows (elastic#21472) Map `cloud.account.id` to azure sub id (elastic#21483) Add support for app_state metricset (elastic#20639) Include original error when metricbeat fails to connect with Kafka (elastic#21484) Prompt only when agent is already enrolled (elastic#21473) Fix leftover delpoyment example (elastic#21474) Bump version to ECS 1.6 in modules without ECS updates (elastic#21455) Clarify input type configuration options (elastic#19284) Increase index pattern size check to 10MiB (elastic#21487) Migrate S3 Input to Filebeat Input V2 (elastic#20005) [libbeat] Add configurable exponential backoff for disk queue write errors (elastic#21493) Revert "Revert "[JJBB] Set shallow cloning to 10 (elastic#21409)" (elastic#21447)" (elastic#21467) Fix format of debug messages in tlscommon (elastic#21482) ...
What does this PR do?
This PR adds user-configurable fields
retry_interval
andmax_retry_interval
to the disk queue, and uses them to perform exponential backoff when encountering fatal errors writing to disk.I'm aware that there are some existing helper wrappers for this functionality, e.g.
ExpBackoff
inlibbeat/common/backoff
. Unfortunately they didn't fit the cancellation or error handling model in the queue, so the backoff here is done "by hand." I've tried to restrict the moving parts to self-contained helper functions.I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature worksI have added an entry inCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Enable the disk queue (with e.g.
queue.disk.max_size: 1GB
in the beat config) and start the beat. While it's running, remove write permissions todata/diskqueue
. This should log errors for the writer and deleter (if applicable), e.g.:By default, any such errors should start 1 second apart and grow by powers of 2 up to 30 seconds. This default can be changed by setting
queue.disk.{retry_interval, max_retry_interval}
.