From 67f4e6273c2c62b72edd4d5675e559f4cd464b91 Mon Sep 17 00:00:00 2001 From: Dane Strandboge <136023093+DStrand1@users.noreply.github.com> Date: Mon, 4 Mar 2024 11:43:10 -0600 Subject: [PATCH 1/3] docs(specs): Add specification for output buffer persistence strategy --- docs/specs/tsd-003-output-buffer-strategy.md | 72 ++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 docs/specs/tsd-003-output-buffer-strategy.md diff --git a/docs/specs/tsd-003-output-buffer-strategy.md b/docs/specs/tsd-003-output-buffer-strategy.md new file mode 100644 index 0000000000000..e9ed3a8d7d6d8 --- /dev/null +++ b/docs/specs/tsd-003-output-buffer-strategy.md @@ -0,0 +1,72 @@ +# Telegraf Output Buffer Strategy + +## Objective + +Introduce a new agent-level config option to choose a disk buffer strategy for +output plugin metric queues. + +## Overview + +Currently when a Telegraf output metric queue fills, either due to incoming +metrics being too fast or various issues with writing to the output, such as +connection failures or rate limiting, new metrics are dropped and never +written to the output. This specification defines a set of options to make +this output queue more durable by persisting pending metrics to disk rather +than only an in-memory limited size queue. + +## Keywords + +output plugins, agent configuration, persist to disk + +## Agent Configuration + +The configuration is at the agent-level, with options for: +- **Memory**, the current implementation, with no persistance to disk +- **Write-through**, all metrics are also written to disk using a + Write Ahead Log (WAL) file +- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a + WAL file to avoid dropping overflow metrics + +As well as an option to specify a directory to store the WAL files on disk, +with a default value. These configurations are global, and no change means +memory only mode, retaining current behavior. + +## Metric Ordering and Tracking + +Tracking metrics will be accepted either on a successful write to the output +source like currently, or on write to the WAL file in the case of the +disk-overflow option. Metrics will be written to their appropriate output in +the order they are received in the buffer still no matter which buffer +strategy is chosen. + +## Disk Utilization and File Handling + +Each output plugin has its own in-memory output buffer, and therefore will +each have their own WAL file (or potentially files) for buffer persistence. +Telegraf will not make any attempt to limit the size on disk taken by these +files, beyond cleaning up WAL files for metrics that have successfully been +flushed to their output source. It is the user's responsibility to ensure +these files do not entirely fill the disk, both during Telegraf uptime and +with lingering files from previous instances of the program. + +Telegraf should provide a way to easily flush WAL files from previous +instances of the program in the event that a crash or system failure +happens. Telegraf makes no guarantee that in these cases, all metrics will +be kept. This may be as simple as a plugin which can read these WAL files +as an input. The file names should be clear to the user what order they are +in so that if metric order for writing to output is crucial, it can be +retained. This plugin should not be required for use to allow the buffer +strategy to work at all, but as a backup option for the user in the event +that files linger across multiple runs of Telegraf. + +## Is/Is-not +- Is a way to increase the durability of metrics and reduce the potential + for metrics to be dropped due to a full in-memory buffer +- Is not a way to guarantee data safety in the event of a crash or system failure +- Is not a way to manage file system allocation size, file space will be used + until the disk is full + +## Prior art + +[Initial issue](https://github.com/influxdata/telegraf/issues/802) +[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805) From 40f5852482b1d9560256f56caa0ce60148e928cb Mon Sep 17 00:00:00 2001 From: Dane Strandboge <136023093+DStrand1@users.noreply.github.com> Date: Thu, 14 Mar 2024 11:16:15 -0500 Subject: [PATCH 2/3] chore: rename to tsd-005 and fix linter issues --- ...put-buffer-strategy.md => tsd-005-output-buffer-strategy.md} | 2 ++ 1 file changed, 2 insertions(+) rename docs/specs/{tsd-003-output-buffer-strategy.md => tsd-005-output-buffer-strategy.md} (99%) diff --git a/docs/specs/tsd-003-output-buffer-strategy.md b/docs/specs/tsd-005-output-buffer-strategy.md similarity index 99% rename from docs/specs/tsd-003-output-buffer-strategy.md rename to docs/specs/tsd-005-output-buffer-strategy.md index e9ed3a8d7d6d8..9ac184c5f83b3 100644 --- a/docs/specs/tsd-003-output-buffer-strategy.md +++ b/docs/specs/tsd-005-output-buffer-strategy.md @@ -21,6 +21,7 @@ output plugins, agent configuration, persist to disk ## Agent Configuration The configuration is at the agent-level, with options for: + - **Memory**, the current implementation, with no persistance to disk - **Write-through**, all metrics are also written to disk using a Write Ahead Log (WAL) file @@ -60,6 +61,7 @@ strategy to work at all, but as a backup option for the user in the event that files linger across multiple runs of Telegraf. ## Is/Is-not + - Is a way to increase the durability of metrics and reduce the potential for metrics to be dropped due to a full in-memory buffer - Is not a way to guarantee data safety in the event of a crash or system failure From ec6a0df488e0e95ac6d2c241074eaa13f2214a49 Mon Sep 17 00:00:00 2001 From: Dane Strandboge <136023093+DStrand1@users.noreply.github.com> Date: Thu, 14 Mar 2024 11:30:57 -0500 Subject: [PATCH 3/3] chore: reviews --- docs/specs/tsd-005-output-buffer-strategy.md | 52 +++++++++++--------- 1 file changed, 28 insertions(+), 24 deletions(-) diff --git a/docs/specs/tsd-005-output-buffer-strategy.md b/docs/specs/tsd-005-output-buffer-strategy.md index 9ac184c5f83b3..c2307d2a10a78 100644 --- a/docs/specs/tsd-005-output-buffer-strategy.md +++ b/docs/specs/tsd-005-output-buffer-strategy.md @@ -7,12 +7,11 @@ output plugin metric queues. ## Overview -Currently when a Telegraf output metric queue fills, either due to incoming -metrics being too fast or various issues with writing to the output, such as -connection failures or rate limiting, new metrics are dropped and never -written to the output. This specification defines a set of options to make -this output queue more durable by persisting pending metrics to disk rather -than only an in-memory limited size queue. +Currently, when a Telegraf output metric queue fills, either due to incoming +metrics being too fast or various issues with writing to the output, new +metrics are dropped and never written to the output. This specification +defines a set of options to make this output queue more durable by persisting +pending metrics to disk rather than only an in-memory limited size queue. ## Keywords @@ -22,7 +21,7 @@ output plugins, agent configuration, persist to disk The configuration is at the agent-level, with options for: -- **Memory**, the current implementation, with no persistance to disk +- **Memory**, the current implementation, with no persistence to disk - **Write-through**, all metrics are also written to disk using a Write Ahead Log (WAL) file - **Disk-overflow**, when the memory buffer fills, metrics are flushed to a @@ -35,35 +34,40 @@ memory only mode, retaining current behavior. ## Metric Ordering and Tracking Tracking metrics will be accepted either on a successful write to the output -source like currently, or on write to the WAL file in the case of the -disk-overflow option. Metrics will be written to their appropriate output in -the order they are received in the buffer still no matter which buffer -strategy is chosen. +source like currently, or on write to the WAL file. Metrics will be written +to their appropriate output in the order they are received in the buffer +regardless of which buffer strategy is chosen. ## Disk Utilization and File Handling Each output plugin has its own in-memory output buffer, and therefore will -each have their own WAL file (or potentially files) for buffer persistence. +each have their own WAL file for buffer persistence. This file may not exist +if Telegraf is successfully able to write all of its metrics without filling +the in-memory buffer in disk-overflow mode, or not at all in memory mode. +Telegraf should use one file per output plugin, and remove entries from the +WAL file as they are written to the output. + Telegraf will not make any attempt to limit the size on disk taken by these -files, beyond cleaning up WAL files for metrics that have successfully been +files beyond cleaning up WAL files for metrics that have successfully been flushed to their output source. It is the user's responsibility to ensure these files do not entirely fill the disk, both during Telegraf uptime and with lingering files from previous instances of the program. -Telegraf should provide a way to easily flush WAL files from previous -instances of the program in the event that a crash or system failure -happens. Telegraf makes no guarantee that in these cases, all metrics will -be kept. This may be as simple as a plugin which can read these WAL files -as an input. The file names should be clear to the user what order they are -in so that if metric order for writing to output is crucial, it can be -retained. This plugin should not be required for use to allow the buffer -strategy to work at all, but as a backup option for the user in the event -that files linger across multiple runs of Telegraf. +If WAL files exist for an output plugin from previous instances of Telegraf, +they will be picked up and flushed before any new metrics that are written +to the output. This is to ensure that these metrics are not lost, and to +ensure that output write order remains consistent. + +Telegraf must additionally provide a way to manually flush WAL files via +some separate plugin or similar. This could be used as a way to ensure that +WAL files are properly written in the event that the output plugin changes +and the WAL file is unable to be detected by a new instance of Telegraf. +This plugin should not be required for use to allow the buffer strategy to +work. ## Is/Is-not -- Is a way to increase the durability of metrics and reduce the potential - for metrics to be dropped due to a full in-memory buffer +- Is a way to prevent metrics from being dropped due to a full memory buffer - Is not a way to guarantee data safety in the event of a crash or system failure - Is not a way to manage file system allocation size, file space will be used until the disk is full