Time series database TSDB causing Logstash to crash #104839

mbudge · 2024-01-27T23:53:45Z

Elasticsearch Version

8.11.4

Installed Plugins

No response

Java Version

bundled

OS Version

Na

Problem Description

Hi

If something goes wrong like a Logstash server goes down and it goes un-noticed (the Logstash server is still running, it’s just hanging), and the Fleet managed metrics TSDB index rolls over to Frozen (read only) after 20 days a downstream Logstash server can get stuck trying to write to the Frozen index. Elastic returns a 403 Forbidden which Logstash retries indefinitely until the Elasticagent Logstash pipeline gets stuck which leads to data loss. We work in a regulated environment so can’t loose data due to TSDS, which has happened recently and I need to provide screenshots to auditors which show how long we retain data, however there’s a gap because this happened again.

elastic/logstash#15832

Can TSDB return a 400 status code to prevent Logstash getting stuck OR talk to the Logstash team about a Max retries setting or way to stop retrying 403 Forbidden responses.

We don’t want to use DLQ as it means other status codes aren’t logged in the logstash log. We use these logs to fix other issues.

thanks

Steps to Reproduce

Write to a TSDB in frozen

Logs (if relevant)

Na

DaveCTurner · 2024-01-28T11:07:30Z

Hi @mbudge,

Can TSDB return a 400 status code to prevent Logstash getting stuck OR talk to the Logstash team about a Max retries setting or way to stop retrying 403 Forbidden responses.

Changing the response status code in this situation would count as a breaking change so I'm pretty sure we won't be doing that. We'll use elastic/logstash#15832 and the associated support case to work out what needs to change in Logstash to avoid this situation. If we identify any changes on the Elasticsearch side needed to support that work then we'll open issues or PRs here, but for now it'd be best to avoid fragmenting the discussion any further. I'm going to close this as there's no specific action for the Elasticsearch dev team to take yet.

mbudge · 2024-02-13T15:37:47Z

A better option might be to add a step in the metrics ingest pipelines to drop late metrics data.

Drop metrics data which is arriving 24 hours or 7 days late.

This would massively reduce the risk of logstash getting into a muddle.

The late metrics data would be dropped instead of elastic returning a 409 forbidden error.

mbudge added >bug needs:triage Requires assignment of a team area label labels Jan 27, 2024

mbudge mentioned this issue Jan 27, 2024

TSDB followups #98877

Open

14 tasks

DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time series database TSDB causing Logstash to crash #104839

Time series database TSDB causing Logstash to crash #104839

mbudge commented Jan 27, 2024

DaveCTurner commented Jan 28, 2024

mbudge commented Feb 13, 2024

Time series database TSDB causing Logstash to crash #104839

Time series database TSDB causing Logstash to crash #104839

Comments

mbudge commented Jan 27, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

DaveCTurner commented Jan 28, 2024

mbudge commented Feb 13, 2024