Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat not closing deleted/rotated files (when output broken) #2395

Closed
PhaedrusTheGreek opened this issue Aug 26, 2016 · 13 comments
Closed

Comments

@PhaedrusTheGreek
Copy link
Contributor

PhaedrusTheGreek commented Aug 26, 2016

RHEL 6.7
Filebeat 1.2.3

filebeat:
  # List of prospectors to fetch data.
  prospectors:
    -
      paths:
        - /tmp/logs/tester*.log
      input_type: log
      ignore_older: 168h
      close_older: 5m
      document_type: bw
      timeout: 5s
      backoff: 5s
  registry_file: /tmp/filebeat/registry

output:
  logstash:
    hosts: ["unknownhost:5045"]

The following utility will just generate logs and rotate them out: (redacted)

To reproduce:

  1. ./filebeat -c above-config.yml
  2. java -Dcom.<redacted>.elk.loggenerator.logcount=1000000 -Dcom.<redacted>.elk.loggenerator.sleeptime=2 -jar loggenerator.jar
  3. watch "lsof | grep filebeat | grep deleted "
Every 2.0s: lsof | grep filebeat | grep deleted                                                                                                              Fri Aug 26 14:48:01 2016

filebeat   9179      root   13r      REG              253,0 20971546    3014698 /tmp/logs/tester11.log.4 (deleted)
filebeat   9179      root   14u      REG              253,0 20971546    3014695 /tmp/logs/tester11.log.4 (deleted)
filebeat   9179      root   15r      REG              253,0 20971562    3014699 /tmp/logs/tester11.log.4 (deleted)
filebeat   9179      root   16u      REG              253,0 20971529    3014700 /tmp/logs/tester11.log.4 (deleted)

This only seems to happen when the output is broken. Behaviour is normal when changed to a working output:

output:
  file:
    path: "/tmp/filebeat"
    filename: filebeat
    rotate_every_kb: 1000
    number_of_files: 7
@ruflin
Copy link
Contributor

ruflin commented Aug 29, 2016

Unfortunately this is kind of expected. The reason is that filebeat keeps the file handler open until it gets a confirmation that the file content was sent. This guarantees that all file content is sent. Otherwise it could happen that the close files are removed and we loose log lines. As soon as the output becomes available again, the data will be sent and the files closed.

We are aware that this is can become an issue in some cases. The same behaviour exist if ES or LS cannot receive data fast enough and some back pressure is created. One of the solutions we are working on is #1926 that a fixed time can be set until the file is closed. It is important to note that this can lead to data loss and breaks the "at least once" principle. BTW: The behaviour is the same in filebeat 5.0

I suggest we use this issue for further discussion on this topic and potential solutions.

@PhaedrusTheGreek
Copy link
Contributor Author

PhaedrusTheGreek commented Aug 29, 2016

For comparison, when using Redis as a buffer of a Logstash Input, we have a similar problem in the case where Elasticsearch goes down. By default, the queue will fill up and consume all memory, and so setting an eviction policy remedies the situation.

If the system stops consuming, then mustn't data loss have to occur somewhere, eventually?

@ruflin
Copy link
Contributor

ruflin commented Aug 29, 2016

As beats currently doesn't have a queue, the question would be after which time data loss starts to happen. I kind of like the idea that this could be configurable and we would delegate this to the output instead of trying to do some magic in the harvesters. There could be a config option output.drop_events_older: 1m which will start dropping events which it tried to send for more then a minute. By default this would be disabled and the same behaviour as now would be in place.

@PhaedrusTheGreek
Copy link
Contributor Author

PhaedrusTheGreek commented Aug 29, 2016

+1 drop_events_older - very intuitively named.

BTW I was considering the file itself to the "buffer/queue"

@fjiang212
Copy link

+1 for adding a queue/cache. Will you create queue/cache in the memory or disk? If you keep it in the disk then setting the max cache disk size is another choice since we don't know how fast the log message is generated, but we do know what is free disk space when we start filebeat. Ex: Set the cache disk size as 200MB, then we can keep adding the log message to queue/cache, drop the oldest log event if the cache/queue is full.

Any other workaround to close those files in Filebeat 1.2.3? Our co-worker is a little bit scared to deploy filebeat to their production environment now after finding this file lock issue. If the disk is full it will affect their running application. ELK does not have some kind agent manager, so we have no idea about the current filebeat status without login to the client machine to check it. Output could be broken for many reasons: logstash is down, network connection problem, certificate is expired.....

Thanks,
Feng Jiang

@PhaedrusTheGreek
Copy link
Contributor Author

@fjiang212 , we do not plan to implement any disk or memory queue, as the source files themselves already serve as a disk queue. The problem is when to release the lock on these files to allow delete.

I think the plan will be to eventually close deleted files, even if the output is malfunctioning, after a certain time, defined by output.drop_events_older.

We agree that not doing so can result in a full disk or max file descriptor limit.

@ruflin
Copy link
Contributor

ruflin commented Aug 29, 2016

@fjiang212 Can you please open a topic on discuss for your filebeat questions? https://discuss.elastic.co/c/beats/filebeat This will allow us to discuss on a broader scale unrelated to this issue.

@fjiang212
Copy link

k, drop_events_older is acceptable solution for us too. Will it fix in the next 1.2.x release ? When will it be available?

@PhaedrusTheGreek
Copy link
Contributor Author

@fjiang212 we agree that this is a critical fix, as it directly affects the stability of the end system, but it's a complex change that cannot be integrated with the 1.x codebase. Please follow this thread for updates on which version will contain a fix.

@evanvolgas
Copy link

We ran into this issue in production when an unreachable disk caused ES to spew 540gb of logs over 12 hours and fill up the disk (see elastic/elasticsearch#18279 (comment) for context). @ruflin's idea to add a drop_events_older flag seems very reasonable to us as well.

It was very easy to clear up what had happened with a restart, but it was a little unexpected at 11 pm on a Sat. Would you guys be open to a PR to document the behavior and how to identify it? For us, a simple lsof +L1 was all it took to highlight what was wrong and why we weren't freeing up space after we deleted the massive log.

@ruflin
Copy link
Contributor

ruflin commented Oct 25, 2016

@EvanV A PR about the behaviour and how to resolve it would be more then welcome.

One new feature that was added since this discussion started is close_timeout: https://www.elastic.co/guide/en/beats/filebeat/5.0/configuration-filebeat-options.html#close-timeout This gives each harvester a predefined lifetime. This can mitigate the problem in case an output is not fast enough, but it will NOT help in case the output is not working anymore.

@PhaedrusTheGreek
Copy link
Contributor Author

bump on this issue, as this particular problem can take down production applications writing logs.

@ruflin
Copy link
Contributor

ruflin commented Feb 14, 2017

Closing this as it should be solved by #3511

@PhaedrusTheGreek It would be great to get some testing on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants