Filebeat not closing deleted/rotated files (when output broken) #2395

PhaedrusTheGreek · 2016-08-26T18:59:41Z

RHEL 6.7
Filebeat 1.2.3

filebeat:
  # List of prospectors to fetch data.
  prospectors:
    -
      paths:
        - /tmp/logs/tester*.log
      input_type: log
      ignore_older: 168h
      close_older: 5m
      document_type: bw
      timeout: 5s
      backoff: 5s
  registry_file: /tmp/filebeat/registry

output:
  logstash:
    hosts: ["unknownhost:5045"]

The following utility will just generate logs and rotate them out: (redacted)

To reproduce:

./filebeat -c above-config.yml
java -Dcom.<redacted>.elk.loggenerator.logcount=1000000 -Dcom.<redacted>.elk.loggenerator.sleeptime=2 -jar loggenerator.jar
watch "lsof | grep filebeat | grep deleted "

Every 2.0s: lsof | grep filebeat | grep deleted                                                                                                              Fri Aug 26 14:48:01 2016

filebeat   9179      root   13r      REG              253,0 20971546    3014698 /tmp/logs/tester11.log.4 (deleted)
filebeat   9179      root   14u      REG              253,0 20971546    3014695 /tmp/logs/tester11.log.4 (deleted)
filebeat   9179      root   15r      REG              253,0 20971562    3014699 /tmp/logs/tester11.log.4 (deleted)
filebeat   9179      root   16u      REG              253,0 20971529    3014700 /tmp/logs/tester11.log.4 (deleted)

This only seems to happen when the output is broken. Behaviour is normal when changed to a working output:

output:
  file:
    path: "/tmp/filebeat"
    filename: filebeat
    rotate_every_kb: 1000
    number_of_files: 7

The text was updated successfully, but these errors were encountered:

ruflin · 2016-08-29T07:17:47Z

Unfortunately this is kind of expected. The reason is that filebeat keeps the file handler open until it gets a confirmation that the file content was sent. This guarantees that all file content is sent. Otherwise it could happen that the close files are removed and we loose log lines. As soon as the output becomes available again, the data will be sent and the files closed.

We are aware that this is can become an issue in some cases. The same behaviour exist if ES or LS cannot receive data fast enough and some back pressure is created. One of the solutions we are working on is #1926 that a fixed time can be set until the file is closed. It is important to note that this can lead to data loss and breaks the "at least once" principle. BTW: The behaviour is the same in filebeat 5.0

I suggest we use this issue for further discussion on this topic and potential solutions.

PhaedrusTheGreek · 2016-08-29T14:15:52Z

For comparison, when using Redis as a buffer of a Logstash Input, we have a similar problem in the case where Elasticsearch goes down. By default, the queue will fill up and consume all memory, and so setting an eviction policy remedies the situation.

If the system stops consuming, then mustn't data loss have to occur somewhere, eventually?

ruflin · 2016-08-29T14:45:09Z

As beats currently doesn't have a queue, the question would be after which time data loss starts to happen. I kind of like the idea that this could be configurable and we would delegate this to the output instead of trying to do some magic in the harvesters. There could be a config option output.drop_events_older: 1m which will start dropping events which it tried to send for more then a minute. By default this would be disabled and the same behaviour as now would be in place.

PhaedrusTheGreek · 2016-08-29T15:42:08Z

+1 drop_events_older - very intuitively named.

BTW I was considering the file itself to the "buffer/queue"

fjiang212 · 2016-08-29T18:10:40Z

+1 for adding a queue/cache. Will you create queue/cache in the memory or disk? If you keep it in the disk then setting the max cache disk size is another choice since we don't know how fast the log message is generated, but we do know what is free disk space when we start filebeat. Ex: Set the cache disk size as 200MB, then we can keep adding the log message to queue/cache, drop the oldest log event if the cache/queue is full.

Any other workaround to close those files in Filebeat 1.2.3? Our co-worker is a little bit scared to deploy filebeat to their production environment now after finding this file lock issue. If the disk is full it will affect their running application. ELK does not have some kind agent manager, so we have no idea about the current filebeat status without login to the client machine to check it. Output could be broken for many reasons: logstash is down, network connection problem, certificate is expired.....

Thanks,
Feng Jiang

PhaedrusTheGreek · 2016-08-29T18:15:47Z

@fjiang212 , we do not plan to implement any disk or memory queue, as the source files themselves already serve as a disk queue. The problem is when to release the lock on these files to allow delete.

I think the plan will be to eventually close deleted files, even if the output is malfunctioning, after a certain time, defined by output.drop_events_older.

We agree that not doing so can result in a full disk or max file descriptor limit.

ruflin · 2016-08-29T18:21:39Z

@fjiang212 Can you please open a topic on discuss for your filebeat questions? https://discuss.elastic.co/c/beats/filebeat This will allow us to discuss on a broader scale unrelated to this issue.

fjiang212 · 2016-08-29T18:22:02Z

k, drop_events_older is acceptable solution for us too. Will it fix in the next 1.2.x release ? When will it be available?

PhaedrusTheGreek · 2016-08-29T18:41:06Z

@fjiang212 we agree that this is a critical fix, as it directly affects the stability of the end system, but it's a complex change that cannot be integrated with the 1.x codebase. Please follow this thread for updates on which version will contain a fix.

evanvolgas · 2016-10-24T21:59:18Z

We ran into this issue in production when an unreachable disk caused ES to spew 540gb of logs over 12 hours and fill up the disk (see elastic/elasticsearch#18279 (comment) for context). @ruflin's idea to add a drop_events_older flag seems very reasonable to us as well.

It was very easy to clear up what had happened with a restart, but it was a little unexpected at 11 pm on a Sat. Would you guys be open to a PR to document the behavior and how to identify it? For us, a simple lsof +L1 was all it took to highlight what was wrong and why we weren't freeing up space after we deleted the massive log.

ruflin · 2016-10-25T07:58:51Z

@EvanV A PR about the behaviour and how to resolve it would be more then welcome.

One new feature that was added since this discussion started is close_timeout: https://www.elastic.co/guide/en/beats/filebeat/5.0/configuration-filebeat-options.html#close-timeout This gives each harvester a predefined lifetime. This can mitigate the problem in case an output is not fast enough, but it will NOT help in case the output is not working anymore.

PhaedrusTheGreek · 2017-01-26T15:35:39Z

bump on this issue, as this particular problem can take down production applications writing logs.

ruflin · 2017-02-14T08:04:03Z

Closing this as it should be solved by #3511

@PhaedrusTheGreek It would be great to get some testing on this one.

ruflin added the Filebeat Filebeat label Aug 29, 2016

ruflin mentioned this issue Aug 29, 2016

Filebeat reading deleted files, caused FS 100% used #2011

Closed

ruflin added the enhancement label Aug 29, 2016

ruflin closed this as completed Feb 14, 2017

ewgRa mentioned this issue Mar 5, 2018

filebeat disk usage execption #6497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat not closing deleted/rotated files (when output broken) #2395

Filebeat not closing deleted/rotated files (when output broken) #2395

PhaedrusTheGreek commented Aug 26, 2016 •

edited

Loading

ruflin commented Aug 29, 2016 •

edited

Loading

PhaedrusTheGreek commented Aug 29, 2016 •

edited

Loading

ruflin commented Aug 29, 2016

PhaedrusTheGreek commented Aug 29, 2016 •

edited

Loading

fjiang212 commented Aug 29, 2016

PhaedrusTheGreek commented Aug 29, 2016

ruflin commented Aug 29, 2016

fjiang212 commented Aug 29, 2016

PhaedrusTheGreek commented Aug 29, 2016

evanvolgas commented Oct 24, 2016

ruflin commented Oct 25, 2016

PhaedrusTheGreek commented Jan 26, 2017

ruflin commented Feb 14, 2017

Filebeat not closing deleted/rotated files (when output broken) #2395

Filebeat not closing deleted/rotated files (when output broken) #2395

Comments

PhaedrusTheGreek commented Aug 26, 2016 • edited Loading

ruflin commented Aug 29, 2016 • edited Loading

PhaedrusTheGreek commented Aug 29, 2016 • edited Loading

ruflin commented Aug 29, 2016

PhaedrusTheGreek commented Aug 29, 2016 • edited Loading

fjiang212 commented Aug 29, 2016

PhaedrusTheGreek commented Aug 29, 2016

ruflin commented Aug 29, 2016

fjiang212 commented Aug 29, 2016

PhaedrusTheGreek commented Aug 29, 2016

evanvolgas commented Oct 24, 2016

ruflin commented Oct 25, 2016

PhaedrusTheGreek commented Jan 26, 2017

ruflin commented Feb 14, 2017

PhaedrusTheGreek commented Aug 26, 2016 •

edited

Loading

ruflin commented Aug 29, 2016 •

edited

Loading

PhaedrusTheGreek commented Aug 29, 2016 •

edited

Loading

PhaedrusTheGreek commented Aug 29, 2016 •

edited

Loading