Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filebeat preventing freeing of disk space on deleted files #896

Closed
nheath opened this issue Feb 1, 2016 · 13 comments
Closed

filebeat preventing freeing of disk space on deleted files #896

nheath opened this issue Feb 1, 2016 · 13 comments
Labels

Comments

@nheath
Copy link

nheath commented Feb 1, 2016

We have a bunch of windows services that generate large amounts of logs, with Enterprise Library logging application block. I've installed the latest filebeat 1.0.1 as a windows service, that forwards to a cluster of linux logstash servers, with output to a cluster of linux elasticsearch servers. We have scheduled tasks running daily that deletes log files older than a given time interval.

This weekend, we received some alerts about servers disks filling up. Our on-call staff tried to review and delete log files, but were getting odd permissions issues suggesting they have no access. At a loss for what to do next, they restarted the server, and a bunch of log files disappeared. This morning I came in and caught up, and realized what happened; the scheduled task had deleted the files, but they still were showing in explorer, and the disk had not released the space. I confirmed on another server by restarting the filebeat service and watching a number of files disappear.

My filebeat config looks like this:

############################# Filebeat ######################################

filebeat:
  registry_file: C:\logstash-filebeat\.filebeat
  prospectors:
    -
      paths:
        - C:\Logs\*\*.log
      input_type: log
      encoding: plain
      document_type: application
      # due to enterprise library not updating modified dates,
      # some servers/log files may need longer harvest times
      # ignore_older: 720h # 30 days
      ignore_older: 168h # 7 days
      force_close_files: true
    -
      paths:
        - C:\inetpub\logs\LogFiles\*\*.log
      input_type: log
      encoding: utf8
      document_type: iis
      ignore_older: 24h

############################# Output ##########################################

output:
  logstash:
    enabled: true
    hosts:
      - sac-logstash1:5044
      - sac-logstash2:5044
    loadbalance: true

############################# Logging #########################################

logging:
  to_files: true
  level: info
  files:
    path: C:\logstash-filebeat\logs\
    name: filebeat.log
    rotateeverybytes: 10485760 # = 10MB
    keepfiles: 10

Some more context:

Our servers disks are undersized and of various capacities, different services generate varying amount of logs, some up to 300MB/hr. This has generally been troublesome for our management of free space, and regularly causes issues. Our IT dept has installed a bunch of one-off scheduled tasks to keep disks from filling up, deleting files by last modified date, some retaining a week, some retaining a day worth of logs.

Our logging library does not release file handles on actively logging files, and modified dates do not get updated until a filehandle is released. Files are set to roll over at 10MB, which depending on the rate of logging one 10MB file can last for a few minutes or for a month. This leads to a poor scenario where files can be seen as 'older' than the ignore_older setting, and are not harvested. Consequently, we've bumped up the setting from 1 day to 7 days to try to pick up these files. This setting is a compromise and will probably need to be adjusted on a per-server basis.

Lastly, I realize that this ignore_older setting is likely at play here and that filebeat will continue to monitor files up until this interval. Is it expected behavior that filebeat will keep an open file handle and prevent deletion/release of files? Second, while i have not seen it directly, our on-call team reports some files were up to a month old, which suggests that there is a bug where filebeat is monitoring and holding files older than the specified interval.

@ruflin
Copy link
Member

ruflin commented Feb 2, 2016

Unfortunately in 1.0.1 ignore_older has multiple "features" which overlap. That is the reason we are introducing close_older in addition: #718 In the future you can set close_older to a low value to release file handles and ignore_older to a high value to make sure still all files are picked up.

Filebeat should definitively not keep file handlers open longer then ignore_older but it is expected that it keeps the files open until ignore_older (which blocks file deletion). There are a lot of improvements in the upcoming 1.1 related to this handling files, unfortunately close_older is not part of 1.1 but the next major release.

There is an additional flag which I don't like to recommend, but can help sometimes: force_close_files: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-configuration-details.html#_force_close_files Have a look if this could work in your setup.

An other option is that you try our nightly builds which already include the close_older feature: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

@nheath
Copy link
Author

nheath commented Feb 2, 2016

Thanks for the feedback, the close_older setting may help our particular combination of factors quite well, though ultimately the root problem is with our logging framework not updating timestamps. I'll keep an eye out for that and try it out when it becomes available in a release.

I've dropped the ignore_older setting significantly (to 12 hours) on one of our problematic servers that generates high volumes. If everything is working as expected, we should see the older files being deleted tonight.

I am actually already using the force_close_files setting as posted in my config above. If my understanding of that setting is correct, that should allow deletion of the files and release of the free space, since filbeat should release the handle when the deletion/move is detected. (or does it close the file after each check interval? -- more technical details here would be awesome) However, the observed behavior suggests this isn't working (or I am misunderstanding.) I'm also confused by the file apparently being deleted, but space not being freed.. Is it the force_close_files setting that allows the file to be deleted? To restate, the file appears to be deleted, since it cannot be opened, nor can you view or change its permissions, though it does show in explorer. In this state, is the file really deleted? Is it possible for processes to append lines to it? If deleted, how will filebeat ever harvest new lines from the file?

@ruflin
Copy link
Member

ruflin commented Feb 3, 2016

We just released 1.1. It would be nice if you could upgrade to 1.1 and see if you still have the same issues.

For the timestamp not updated: We see this issue sometimes on shared drives as it seems to cache some information. Could this also apply to your case?

force_close_files works as following: Normally filebeat detects a file based on the 3 identifier listed here (https://github.com/elastic/beats/blob/master/filebeat/input/file_windows.go#L13). The filename itself is only secondary as it can change over time (rotation). If a filename disappears, filebeat keeps the harvester open as it assumes the file was rotated and waits until the ids show up again under a different file name to make sure it can continue reading the file imidiately. What force_close_files does is that if a file is not found anymore under the same file name, it closes the file handler (

if r.config.forceClose {
). The disadvantage to this is, that the file is only discovered / read again after scan_frequency.

So force_close_file should definitively close the file handler and make it possible to delete it. If this is not the case, this sounds like a bug.

@vjsamuel
Copy link
Contributor

I have noticed the same issue in atomic linux where files being removed by logrotate are not fully deleted because of filebeat holding the file handler.

bash-4.3# lsof | grep delete | wc -l
469
bash-4.3# ps aux | grep filebeat
root     14049  0.0  0.0  11356  2236 pts/2    S+   18:01   0:00 grep filebeat
root     27539  5.4  0.0 1363460 31568 ?       Sl   May12 291:45 /filebeat/filebeat
bash-4.3# kill -9 27539
bash-4.3# ps aux | grep filebeat
root     14069  0.5  0.0 1028208 19852 ?       Sl   18:01   0:02 /filebeat/filebeat
root     15643  0.0  0.0  11356  2136 pts/2    S+   18:09   0:00 grep filebeat
bash-4.3# lsof | grep delete | wc -l
45

@ruflin
Copy link
Member

ruflin commented May 17, 2016

@vjsamuel Which filebeat version are you using? Can you share your config?

@vjsamuel
Copy link
Contributor

@ruflin Im on commit e7b2a92

This is my config:

filebeat:
  # List of prospectors to fetch data.
  prospectors:
    - input_type: log
      document_type: beats
      paths:
        - /var/log/*beat.log
      #tail_files: false

    - input_type: log
      document_type: app-logs
      paths:
        - /var/lib/docker/containers/*/*.log
      # Decode JSON options. Enable this if your logs are structured in JSON.
      json:
        message_key: log
        keys_under_root: true
        #overwrite_keys: false
        #add_error_key: false

    - input_type: log
      document_type: system-messages
      paths:
        - /var/log/messages
      # Decode JSON options. Enable this if your logs are structured in JSON.
      json:
        message_key: MESSAGE
        keys_under_root: true
        #overwrite_keys: false
        #add_error_key: false

output:
  ### Logstash as output
  logstash:
    hosts: ["localhost:5044"]
    worker: 10
shipper:
  tags: ["log"]

logging:
  to_files: true
  to_syslog: false

  files:
    path: /var/log
    name: filebeat.log
    keepfiles: 3
  #selectors: [ ]
  level: info

The issue got resolved when I added:

      force_close_files: true
      close_older: 1h

@ruflin
Copy link
Member

ruflin commented May 19, 2016

Hm, strange. You seem to be on alpha2. close_older should be set to 1h by default and not require force_close_files. Did you add the lines above to all prospectors?

@knap1930
Copy link

knap1930 commented Jul 3, 2016

I am on version filebeat-1.2.3-1.x86_64 but still seem to have the same issue. If my logstash server goes down, filebeat will keep my logs open forever, causing my server to run out of hard drive space.

I have configured my application server to rotate logs every hour, or once they reach 100MB.

My config file, and a list showing all the open files logstash is keeping open

filebeat:
  prospectors:
    -
      paths:
    - /var/log/glassfish/server.log
      encoding: utf-8
      input_type: log
      close_older: 5m
      ignore_older: 2h
      fields_under_root: false

output:
  logstash:
    hosts: ["logstash.xxxxxxxxx.com:5044"]

shipper:
  tags: ["stream-service"]
  fields_under_root: false
  fields:
    ip: "stream-service"

logging:
  files:
    rotateeverybytes: 10485760
[root@stream-service-prod-01 4382]# ls -l /proc/4382/fd
total 0
lrwx------ 1 root root 64 Jul  3 01:33 0 -> socket:[15489]
lrwx------ 1 root root 64 Jul  3 01:33 1 -> anon_inode:[eventpoll]
lr-x------ 1 root root 64 Jul  3 01:33 10 -> /var/log/glassfish/server.log.2016-07-01-15.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 11 -> /var/log/glassfish/server.log.2016-07-01-19.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 12 -> /var/log/glassfish/server.log.2016-07-01-20.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 13 -> /var/log/glassfish/server.log.2016-07-01-18.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 14 -> /var/log/glassfish/server.log.2016-07-02-03.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 15 -> /var/log/glassfish/server.log.2016-07-01-21.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 16 -> /var/log/glassfish/server.log.2016-07-01-19.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 17 -> /var/log/glassfish/server.log.2016-07-01-21.5 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 18 -> /var/log/glassfish/server.log.2016-07-01-22.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 19 -> /var/log/glassfish/server.log.2016-07-01-20.0 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 2 -> socket:[15490]
lr-x------ 1 root root 64 Jul  3 01:33 20 -> /var/log/glassfish/server.log.2016-07-01-14.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 21 -> /var/log/glassfish/server.log.2016-07-01-22.3 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 22 -> /var/log/glassfish/server.log.2016-07-01-15.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 23 -> /var/log/glassfish/server.log.2016-07-01-22.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 24 -> /var/log/glassfish/server.log.2016-07-01-16.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 25 -> /var/log/glassfish/server.log.2016-07-01-16.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 26 -> /var/log/glassfish/server.log.2016-07-01-17.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 27 -> /var/log/glassfish/server.log.2016-07-01-17.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 28 -> /var/log/glassfish/server.log.2016-07-01-21.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 29 -> /var/log/glassfish/server.log.2016-07-01-18.0 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 3 -> socket:[15491]
lr-x------ 1 root root 64 Jul  3 01:33 30 -> /var/log/glassfish/server.log.2016-07-01-19.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 31 -> /var/log/glassfish/server.log.2016-07-01-20.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 32 -> /var/log/glassfish/server.log.2016-07-01-21.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 33 -> /var/log/glassfish/server.log.2016-07-01-21.3 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 34 -> /var/log/glassfish/server.log.2016-07-01-21.4 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 35 -> /var/log/glassfish/server.log.2016-07-01-22.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 36 -> /var/log/glassfish/server.log.2016-07-01-23.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 37 -> /var/log/glassfish/server.log.2016-07-01-22.4 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 38 -> /var/log/glassfish/server.log.2016-07-01-23.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 39 -> /var/log/glassfish/server.log.2016-07-01-23.2 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 4 -> socket:[15492]
lr-x------ 1 root root 64 Jul  3 01:33 40 -> /var/log/glassfish/server.log.2016-07-01-23.3 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 41 -> /var/log/glassfish/server.log.2016-07-02-00.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 42 -> /var/log/glassfish/server.log.2016-07-02-00.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 43 -> /var/log/glassfish/server.log.2016-07-02-00.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 44 -> /var/log/glassfish/server.log.2016-07-02-00.3 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 45 -> /var/log/glassfish/server.log.2016-07-02-00.4 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 46 -> /var/log/glassfish/server.log.2016-07-02-00.5 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 47 -> /var/log/glassfish/server.log.2016-07-02-00.6 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 48 -> /var/log/glassfish/server.log.2016-07-02-01.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 49 -> /var/log/glassfish/server.log.2016-07-02-01.1 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 5 -> socket:[15493]
lr-x------ 1 root root 64 Jul  3 01:33 50 -> /var/log/glassfish/server.log.2016-07-02-01.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 51 -> /var/log/glassfish/server.log.2016-07-02-01.3 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 52 -> /var/log/glassfish/server.log.2016-07-02-02.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 53 -> /var/log/glassfish/server.log.2016-07-02-01.4 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 54 -> /var/log/glassfish/server.log.2016-07-02-02.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 55 -> /var/log/glassfish/server.log.2016-07-02-02.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 56 -> /var/log/glassfish/server.log.2016-07-02-02.3 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 57 -> /var/log/glassfish/server.log.2016-07-02-03.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 58 -> /var/log/glassfish/server.log.2016-07-02-03.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 59 -> /var/log/glassfish/server.log.2016-07-02-04.0 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 6 -> socket:[15494]
lr-x------ 1 root root 64 Jul  3 01:33 60 -> /var/log/glassfish/server.log.2016-07-02-04.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 61 -> /var/log/glassfish/server.log.2016-07-02-04.2 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 62 -> /var/log/glassfish/server.log.2016-07-02-05.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 63 -> /var/log/glassfish/server.log.2016-07-02-06.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 64 -> /var/log/glassfish/server.log.2016-07-02-05.1 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 65 -> /var/log/glassfish/server.log.2016-07-02-07.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 66 -> /var/log/glassfish/server.log.2016-07-02-08.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 67 -> /var/log/glassfish/server.log.2016-07-02-09.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 68 -> /var/log/glassfish/server.log.2016-07-02-10.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 69 -> /var/log/glassfish/server.log.2016-07-02-11.0 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 7 -> socket:[15495]
lr-x------ 1 root root 64 Jul  3 01:33 70 -> /var/log/glassfish/server.log.2016-07-02-13.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 71 -> /var/log/glassfish/server.log.2016-07-02-12.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 72 -> /var/log/glassfish/server.log.2016-07-02-15.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 73 -> /var/log/glassfish/server.log.2016-07-02-14.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 74 -> /var/log/glassfish/server.log.2016-07-02-16.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 75 -> /var/log/glassfish/server.log.2016-07-02-17.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 76 -> /var/log/glassfish/server.log.2016-07-02-18.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 77 -> /var/log/glassfish/server.log
lr-x------ 1 root root 64 Jul  3 01:33 78 -> /var/log/glassfish/server.log.2016-07-02-20.0 (deleted)
lrwx------ 1 root root 64 Jul  3 01:33 79 -> socket:[1094486]
lrwx------ 1 root root 64 Jul  3 01:33 8 -> socket:[15496]
lr-x------ 1 root root 64 Jul  3 01:33 80 -> /var/log/glassfish/server.log.2016-07-03-00.0 (deleted)
lr-x------ 1 root root 64 Jul  3 01:33 9 -> /var/log/glassfish/server.log.2016-07-01-22.5 (deleted)

@ruflin
Copy link
Member

ruflin commented Jul 5, 2016

@knap1930 So this problem also persists when logstash comes back online?

@knap1930
Copy link

knap1930 commented Jul 5, 2016

@ruflin When logstash comes back online, things will recover.

The problem is that if the host's hard drive fills up before logstash comes back online. I have a cron job that moves rotated logs over to s3 after the file is rotated. Ideally if there is a problem with logstash, filebeat will give up eventually and allow the space to be reclaimed, not keep trying forever. I thought that was the purpose of the close_older configuration option?

@ruflin
Copy link
Member

ruflin commented Jul 6, 2016

Filebeat will not release the file handler until it successfully sent the data. So close_older will only apply after the data is sent.

The option you are looking for is probably close_ttl / close_timeout. See #1600 for more details. We are still discussing the correct naming. Happy to have your feedback here. The problem with close_ttl is that it can lead to data loss (from the filebeat perspective) but it seems like you are aware of that.

@inf2ravikumar
Copy link

any suggestion for me trying to add

close_older: 30m

My Configuration follows 👍

filebeat.prospectors:

Each - is a prospector. Most options can be set at the prospector level, so

you can use different prospectors for various configurations.

Below are the prospector specific configurations.

  • input_type: log
    paths:

    • /var/log/teakd.log
      document_type: teakd
  • input_type: log
    paths:

    • /var/log/puppet/puppet.log
      document_type: puppet
  • input_type: log
    paths:

    • /var/log/messages
      document_type: messeges
  • input_type: log
    paths:

    • /var/log/watchdog_teak.log
      document_type: watchdog
  • input_type: log
    paths:

    • /var/log/xmt.log
      document_type: xmt
  • input_type: log
    paths:

    • /var/log/drive.db
      document_type: drive
      exclude_lines: ['^#','^--']
  • input_type: log
    paths:

    • /var/log/secure
      document_type: secure
  • input_type: log
    paths:

    • /opt/trafficserver/var/log/trafficserver/manager.log
      document_type: manager
  • input_type: log
    paths:

    • /opt/trafficserver/var/log/trafficserver/diags.log
      document_type: diags
  • input_type: log
    paths:

    • /opt/trafficserver/var/log/trafficserver/error.log
      document_type: error
  • input_type: log
    paths:

    • /opt/trafficserver/var/log/trafficserver/traffic.out
      document_type: traffic
  • input_type: log
    paths:

    • /opt/trafficserver/var/log/trafficserver/custom_ats_2.log
      document_type: custom_ats_2

    exclude_lines: ["DEBUG"]

    include_lines: ["FATAL", "ERROR", "WARN", "WARNING"]

    filebeat.publish_async: true

    fields_under_root: true

    tail_files: true

    fields_under_root: true

    ignore_older: 1h

    close_older: 30m

#----------------------------- Logstash output --------------------------------
output.logstash:

The Logstash hosts

hosts: ["10.146.134.15:5044","10.146.134.16:5044","10.146.134.17:5044"]
loadbalance: true
worker: 2
bulk_max_size: 2048
rotateeverybytes: 10485760 # = 10MB
keepfiles: 2

@inf2ravikumar
Copy link

trying all formats no luck any help you can do

  • input_type: log
    paths:
    • /var/log/teakd.log
      document_type: teakd
      close_older: 5m
      ignore_older: 2h
      fields_under_root: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants