-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat not closing deleted/rotated files (when output broken) #2395
Comments
Unfortunately this is kind of expected. The reason is that filebeat keeps the file handler open until it gets a confirmation that the file content was sent. This guarantees that all file content is sent. Otherwise it could happen that the close files are removed and we loose log lines. As soon as the output becomes available again, the data will be sent and the files closed. We are aware that this is can become an issue in some cases. The same behaviour exist if ES or LS cannot receive data fast enough and some back pressure is created. One of the solutions we are working on is #1926 that a fixed time can be set until the file is closed. It is important to note that this can lead to data loss and breaks the "at least once" principle. BTW: The behaviour is the same in filebeat 5.0 I suggest we use this issue for further discussion on this topic and potential solutions. |
For comparison, when using Redis as a buffer of a Logstash Input, we have a similar problem in the case where Elasticsearch goes down. By default, the queue will fill up and consume all memory, and so setting an eviction policy remedies the situation. If the system stops consuming, then mustn't data loss have to occur somewhere, eventually? |
As beats currently doesn't have a queue, the question would be after which time data loss starts to happen. I kind of like the idea that this could be configurable and we would delegate this to the output instead of trying to do some magic in the harvesters. There could be a config option |
+1 drop_events_older - very intuitively named. BTW I was considering the file itself to the "buffer/queue" |
+1 for adding a queue/cache. Will you create queue/cache in the memory or disk? If you keep it in the disk then setting the max cache disk size is another choice since we don't know how fast the log message is generated, but we do know what is free disk space when we start filebeat. Ex: Set the cache disk size as 200MB, then we can keep adding the log message to queue/cache, drop the oldest log event if the cache/queue is full. Any other workaround to close those files in Filebeat 1.2.3? Our co-worker is a little bit scared to deploy filebeat to their production environment now after finding this file lock issue. If the disk is full it will affect their running application. ELK does not have some kind agent manager, so we have no idea about the current filebeat status without login to the client machine to check it. Output could be broken for many reasons: logstash is down, network connection problem, certificate is expired..... Thanks, |
@fjiang212 , we do not plan to implement any disk or memory queue, as the source files themselves already serve as a disk queue. The problem is when to release the lock on these files to allow delete. I think the plan will be to eventually close deleted files, even if the output is malfunctioning, after a certain time, defined by We agree that not doing so can result in a full disk or max file descriptor limit. |
@fjiang212 Can you please open a topic on discuss for your filebeat questions? https://discuss.elastic.co/c/beats/filebeat This will allow us to discuss on a broader scale unrelated to this issue. |
k, drop_events_older is acceptable solution for us too. Will it fix in the next 1.2.x release ? When will it be available? |
@fjiang212 we agree that this is a critical fix, as it directly affects the stability of the end system, but it's a complex change that cannot be integrated with the 1.x codebase. Please follow this thread for updates on which version will contain a fix. |
We ran into this issue in production when an unreachable disk caused ES to spew 540gb of logs over 12 hours and fill up the disk (see elastic/elasticsearch#18279 (comment) for context). @ruflin's idea to add a It was very easy to clear up what had happened with a restart, but it was a little unexpected at 11 pm on a Sat. Would you guys be open to a PR to document the behavior and how to identify it? For us, a simple |
@EvanV A PR about the behaviour and how to resolve it would be more then welcome. One new feature that was added since this discussion started is |
bump on this issue, as this particular problem can take down production applications writing logs. |
Closing this as it should be solved by #3511 @PhaedrusTheGreek It would be great to get some testing on this one. |
RHEL 6.7
Filebeat 1.2.3
The following utility will just generate logs and rotate them out: (redacted)
To reproduce:
./filebeat -c above-config.yml
java -Dcom.<redacted>.elk.loggenerator.logcount=1000000 -Dcom.<redacted>.elk.loggenerator.sleeptime=2 -jar loggenerator.jar
watch "lsof | grep filebeat | grep deleted "
This only seems to happen when the output is broken. Behaviour is normal when changed to a working output:
The text was updated successfully, but these errors were encountered: