-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760
Comments
It seems good. |
I just hit this issue as part of a Kubernetes deployment. It would be really great if fluentd could handle this kind of failure without going into a crash backoff loop until I manually SSH in and delete the corrupted buffers. |
run into the same, also a Kubernetes Deploymdent |
I wrote a patch for avoiding this problem: #1874 I want to know which storage use for file buffer. Local storage or network/distributed storage like NFS? |
@repeatedly I use local storage for the fluentd buffer before sending to elasticsearch. |
Local storage |
Google Persistent Disk storage mounted into pods via Persistent Volume Claims, closer in behavor to local storage than distributed storage. |
buf_file: Skip and delete broken file chunks during resume. fix #1760
A bit of context first: I'm deploying fluentd on kubernetes using the
fluent/fluentd:v0.14.23
docker image as three pods managed by ReplicationControllers with PersistentVolumes for storing buffers (cannot use a StatefulSet here but that's another story).Since I have upgraded from v0.14.22 to v0.14.23, only two of those pods are running fine.
The third one is having its worker to crash loop as soon as it starts, apparently when it tries to read its buffers metadata:
This happens at https://github.com/fluent/fluentd/blob/v0.14.23/lib/fluent/plugin/buffer/file_chunk.rb#L219
The buffers of this pod look corrupt:
I've deleted those buffers to workaround this issue for now and the pod is back to normal.
Not sure the upgrade is the culprit, perhaps it was just the kill that happened at the wrong time.
Maybe the
restore_metadata
could add checks to avoid this issue.The text was updated successfully, but these errors were encountered: