-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem? Please describe.
Discussions link: #10360
Refer to https://docs.fluentbit.io/manual/administration/scheduling-and-retries#configure-retries , there are 3 types of values for scheduler Retry_Limit: N, no_limits and no_retries.
For our case, the output is Loki and we set Retry_Limit with no_limites in order to not lost any logs when Loki is down.
In Loki site, we allow to accept these old logs within 2 days by setting proper value of ingester.max_chunk_age.
Now our scenario is that in a rainy case, we manually make Loki to offline and let fluent-bit to buffer/cache local chunk/logs more than 2 days, then online the Loki.
With the scheduler of retry, local chunks will be flushed to Loki one by one with each dedicated task_id.
However, for these timestamps that older than 48h, Loki will reject to accept them by reporting "write operation failed, older acceptable timestamp is xxxx" error. For fluent-bit, the flush action is not successful and will loop in the endless retries.
Describe the solution you'd like
Above is the actual issue for us, when Loki could be back within 48h, all local buffered chunks could be flushed to Loki with no_limits successfully, but when it is larger than 48h, some logs will be sent to Loki endless and Loki will always reject them.
Therefore, I would like to propose another values of Retry_Limit which is a timeout-based value.
For example, when Retry_Limit is set to 24h, then for one specific retry action will be same as no_limits from start, but when the retry action(maybe one task_id) reached 24h, it will stop the retry.
Describe alternatives you've considered
Additional context
Appreciate for any comments.