Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time series database TSDB causing Logstash to crash #104839

Closed
mbudge opened this issue Jan 27, 2024 · 2 comments
Closed

Time series database TSDB causing Logstash to crash #104839

mbudge opened this issue Jan 27, 2024 · 2 comments
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@mbudge
Copy link

mbudge commented Jan 27, 2024

Elasticsearch Version

8.11.4

Installed Plugins

No response

Java Version

bundled

OS Version

Na

Problem Description

Hi

If something goes wrong like a Logstash server goes down and it goes un-noticed (the Logstash server is still running, it’s just hanging), and the Fleet managed metrics TSDB index rolls over to Frozen (read only) after 20 days a downstream Logstash server can get stuck trying to write to the Frozen index. Elastic returns a 403 Forbidden which Logstash retries indefinitely until the Elasticagent Logstash pipeline gets stuck which leads to data loss. We work in a regulated environment so can’t loose data due to TSDS, which has happened recently and I need to provide screenshots to auditors which show how long we retain data, however there’s a gap because this happened again.

elastic/logstash#15832

Can TSDB return a 400 status code to prevent Logstash getting stuck OR talk to the Logstash team about a Max retries setting or way to stop retrying 403 Forbidden responses.

We don’t want to use DLQ as it means other status codes aren’t logged in the logstash log. We use these logs to fix other issues.

thanks

Steps to Reproduce

Write to a TSDB in frozen

Logs (if relevant)

Na

@mbudge mbudge added >bug needs:triage Requires assignment of a team area label labels Jan 27, 2024
@mbudge mbudge mentioned this issue Jan 27, 2024
14 tasks
@DaveCTurner
Copy link
Contributor

Hi @mbudge,

Can TSDB return a 400 status code to prevent Logstash getting stuck OR talk to the Logstash team about a Max retries setting or way to stop retrying 403 Forbidden responses.

Changing the response status code in this situation would count as a breaking change so I'm pretty sure we won't be doing that. We'll use elastic/logstash#15832 and the associated support case to work out what needs to change in Logstash to avoid this situation. If we identify any changes on the Elasticsearch side needed to support that work then we'll open issues or PRs here, but for now it'd be best to avoid fragmenting the discussion any further. I'm going to close this as there's no specific action for the Elasticsearch dev team to take yet.

@DaveCTurner DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale Jan 28, 2024
@mbudge
Copy link
Author

mbudge commented Feb 13, 2024

A better option might be to add a step in the metrics ingest pipelines to drop late metrics data.

Drop metrics data which is arriving 24 hours or 7 days late.

This would massively reduce the risk of logstash getting into a muddle.

The late metrics data would be dropped instead of elastic returning a 409 forbidden error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants