Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(specs): Add specification for output buffer persistence strategy #14928

Merged
merged 3 commits into from
Mar 15, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions docs/specs/tsd-003-output-buffer-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Telegraf Output Buffer Strategy

## Objective

Introduce a new agent-level config option to choose a disk buffer strategy for
output plugin metric queues.

## Overview

Currently when a Telegraf output metric queue fills, either due to incoming
metrics being too fast or various issues with writing to the output, such as
connection failures or rate limiting, new metrics are dropped and never
written to the output. This specification defines a set of options to make
DStrand1 marked this conversation as resolved.
Show resolved Hide resolved
this output queue more durable by persisting pending metrics to disk rather
than only an in-memory limited size queue.

## Keywords

output plugins, agent configuration, persist to disk

## Agent Configuration

The configuration is at the agent-level, with options for:
- **Memory**, the current implementation, with no persistance to disk
- **Write-through**, all metrics are also written to disk using a
Write Ahead Log (WAL) file
- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a
WAL file to avoid dropping overflow metrics

As well as an option to specify a directory to store the WAL files on disk,
with a default value. These configurations are global, and no change means
memory only mode, retaining current behavior.

## Metric Ordering and Tracking

Tracking metrics will be accepted either on a successful write to the output
source like currently, or on write to the WAL file in the case of the
disk-overflow option. Metrics will be written to their appropriate output in
DStrand1 marked this conversation as resolved.
Show resolved Hide resolved
the order they are received in the buffer still no matter which buffer
strategy is chosen.

## Disk Utilization and File Handling

Each output plugin has its own in-memory output buffer, and therefore will
each have their own WAL file (or potentially files) for buffer persistence.
Telegraf will not make any attempt to limit the size on disk taken by these
files, beyond cleaning up WAL files for metrics that have successfully been
flushed to their output source. It is the user's responsibility to ensure
these files do not entirely fill the disk, both during Telegraf uptime and
with lingering files from previous instances of the program.

Telegraf should provide a way to easily flush WAL files from previous
instances of the program in the event that a crash or system failure
happens. Telegraf makes no guarantee that in these cases, all metrics will
DStrand1 marked this conversation as resolved.
Show resolved Hide resolved
be kept. This may be as simple as a plugin which can read these WAL files
DStrand1 marked this conversation as resolved.
Show resolved Hide resolved
as an input. The file names should be clear to the user what order they are
in so that if metric order for writing to output is crucial, it can be
DStrand1 marked this conversation as resolved.
Show resolved Hide resolved
retained. This plugin should not be required for use to allow the buffer
strategy to work at all, but as a backup option for the user in the event
that files linger across multiple runs of Telegraf.
DStrand1 marked this conversation as resolved.
Show resolved Hide resolved

## Is/Is-not
- Is a way to increase the durability of metrics and reduce the potential
for metrics to be dropped due to a full in-memory buffer
- Is not a way to guarantee data safety in the event of a crash or system failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse me if I am imposing on an internal review, saw the ping over at #802

The write-through feature seems to be able to support improved data safety in cases of crash or system failures as far as I have read about it above. Is the emphasis here on "guarantee"? E.g in that no guarantees provided, but a design goal of the feature is to improve data safety in these cases too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this spec is not to ensure data safety, but to not drop metrics when the memory buffer fills. Inherently this will improve data safety, especially with write-through, though this is not the goal of this spec and as such cannot be completely guaranteed

- Is not a way to manage file system allocation size, file space will be used
until the disk is full

## Prior art

[Initial issue](https://github.com/influxdata/telegraf/issues/802)
[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805)
Loading