Data loss in the remote write when the remote endpoint has an outage #400
Labels
frozen-due-to-age
Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
We had a relatively short Cortex outage (10m) on the write path, where the server returned 5xx errors and we've noticed missing samples when the agent was catching up once the outage has been resolved.
Suspected root cause
After some investigation and discussion with @rfratto, it looks like the cause is the aggressive checkpointing done by the agent.
The agent tries to truncate the WAL every
wal_truncate_frequency
period (defaults to 1 minute):https://github.com/grafana/agent/blob/master/pkg/prom/instance/instance.go#L658
The
Storage.Truncate()
starts a new WAL segment and then creates a new checkpoint if the number of WAL segments is >= 3 (since a segment is created every 1m by default, there will be at least 3 segments every 2 minutes):https://github.com/grafana/agent/blob/master/pkg/prom/wal/wal.go#L414
The
wal.Checkpoint()
moves older segments to the checkpoint. This means that segments are removed from thewal/
directory and moved inside thewal/checkpoint.xxxxx/
directory.The remote write replays samples only from the WAL segments (not the checkpoint, because
readSegment()
is called withtail=false
), so when an outage occurs on the remote endpoint (eg. Cortex) all samples contained in segments which have been moved to the checkpoint will be skipped (not remote written) once the remote endpoint recovers from the outage.Reproduced in the local env
I setup a local Cortex cluster and simulated the outage we had in production and I've verified the agent effectively loose data once Cortex gets back online. To easily show it, I've added a log to the Cortex distributor to print the min/max timestamp of samples within each remote write, I've simulated the outage and seen out the agent behaves:
https://gist.github.com/pracucci/b83c192d253e730b2cf59adeb0fc9e50
Once Cortex is back online, the TSDB
wal.Watcher
(which was paused because all remote write shards queues were full due to the outage) continues to relay the WAL and fails once it tries to read from the current segment, which has already been moved to the checkpoint:This leads the
wal.Watcher
to replay the WAL again. It replays the checkpoint first (but it skips samples, it only reads series) and then starts replaying segments. The samples which have been moved to the checkpoint are skipped and this leads to the data loss.The text was updated successfully, but these errors were encountered: