Fluentd stops processing logs, but keeps running #1630

dvydra · 2017-07-14T04:42:35Z

We are seeing an issue where fluentd will stop processing logs after some time, but the parent and child processes seem to be running normally.

We are running fluentd in a docker container on a kubernetes cluster, mounting the docker log volume /var/log/containers on the host.

In a recent incident, we saw logs cease being forwarded to the sumologic output, but activity continued in the fluentd log until 12 minutes after that time, eventually no longer picking up new logs (e.g. "following tail of...") at some point after that. containers.log.pos continued being updated for 1 hour 13 minutes after the first sign of problems, until it stopped being updated.

Killing the fluentd child process gets everything going again.

Config, strace, lsof and sigdump included below.

Details:

fluentd or td-agent version.
fluentd 0.12.37
Environment information, e.g. OS.
host: 4.9.9-coreos-r1
container: debian jessie
Your configuration
see attachments

Attachments:
fluentd config
lsof of child process
sigdump of child process
strace of child process
fluentd log

The text was updated successfully, but these errors were encountered:

dvydra · 2017-07-19T01:16:24Z

We continue to see this issue regularly. Is there any additional detail that would be helpful in assisting diagnosis?

repeatedly · 2017-07-19T01:28:42Z

containers.log.pos continued being updated for 1 hour 13 minutes after the first sign of problems, until it stopped being updated.

Does this mean file rotation happened but pos file is not updated?

dvydra · 2017-07-19T03:40:01Z

I don't know for sure. I definitely saw the index positions being incremented. I am not sure if new logs were added to the pos file.

cehoffman · 2017-09-16T12:59:16Z

I see something similar as well on 0.14.20 for logs in /var/log/containers on a kubernetes cluster. Eventually fluentd will not pickup any new logs in the folder. I have not gotten into looking at the pos file yet. I can say that restarting fluentd will pickup the logs and catchup with past history so I expect the pos file is not updating.

jsleeio · 2017-11-01T21:25:40Z

We are also seeing this intermittently in our Kubernetes platform. strace of the child process shows it blocking indefinitely on a futex system call, with the pos files untouched for multiple days. We run our own Fluentd image derived from fluent/fluentd:v0.12.34 that adds the Graylog GELF output plugin.

root@ip-172-20-32-42:~# ps aux | grep fluentd
root     10349  0.0  0.3 137924 56784 ?        Ssl  Oct26   1:48 /usr/bin/ruby /usr/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins
root     10407  0.2  0.4 213472 71500 ?        Sl   Oct26  20:49 /usr/bin/ruby -Eascii-8bit:ascii-8bit /usr/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --under-supervisor
root@ip-172-20-32-42:~# strace -p 10407
Process 10407 attached
futex(0x7ffef24452c4, FUTEX_WAIT_PRIVATE, 2, NULL

The futex(2) manpage on my system indicates that the NULL shown here indicates a lack of timeout and thus indefinite blocking.

the pos files in the fluentd container:

/var/log # ls -ltr *pos
-rw-r--r--    1 root     root            61 Oct 26 03:33 fluentd-startupscript.log.pos
-rw-r--r--    1 root     root            55 Oct 26 03:33 fluentd-salt.pos
-rw-r--r--    1 root     root            59 Oct 26 03:33 fluentd-rescheduler.log.pos
-rw-r--r--    1 root     root            55 Oct 26 03:33 fluentd-kubelet.log.pos
-rw-r--r--    1 root     root            71 Oct 26 03:33 fluentd-kube-controller-manager.log.pos
-rw-r--r--    1 root     root            52 Oct 26 03:33 fluentd-glbc.log.pos
-rw-r--r--    1 root     root            54 Oct 26 03:33 fluentd-docker.log.pos
-rw-r--r--    1 root     root            66 Oct 26 03:33 fluentd-cluster-autoscaler.log.pos
-rw-r--r--    1 root     root          2354 Oct 26 20:32 fluentd-containers.log.pos
-rw-r--r--    1 root     root            52 Oct 26 20:44 fluentd-etcd.log.pos
-rw-r--r--    1 root     root            62 Oct 26 20:49 fluentd-kube-scheduler.log.pos
-rw-r--r--    1 root     root            62 Oct 26 20:51 fluentd-kube-apiserver.log.pos
-rw-r--r--    1 root     root            58 Oct 31 05:17 fluentd-kube-proxy.log.pos

goakley · 2018-03-12T18:01:00Z

I am seeing the same issue running td-agent tailing PostgreSQL logs and sending to Elasticsearch.

The pos files stopped updating at a specific time
Logs stopped flowing to Elasticsearch at the same time
Fluentd stopped writing to its log file at the same time (no log entries after that time)
Fluentd logs contained no errors

This happens on our relatively high-volume PostrgreSQL servers, with 12 CSV log files (one for each hour) rotated every 12 hours, each about 1.2GB with 4120603 lines.

Details here: https://gist.github.com/goakley/d0968cb9411841e33eda7ccb896cde36

repeatedly · 2018-03-12T19:11:55Z

Sorry, I missed several comments above.

@goakley Could you get the gstack, tool in gdb, result of hanging child process?
It gets stack trace of all threads.

goakley · 2018-03-12T23:27:25Z

@repeatedly I couldn't get pstack to successfully run on the process.

$ sudo pstack 20479
20479: /opt/td-agent/embedded/bin/ruby -Eascii-8bit:ascii-8bit /usr/sbin/td-agent --log /var/log/td-agent/td-agent.log --daemon /va...
pstack: Input/output error
failed to read target.

Will a thread apply all backtrace from GDB do?

https://gist.github.com/goakley/b410ad35ab2a10694088432149de06e1

goakley · 2018-03-19T19:16:47Z

Is there any hope that someone can pick this up? We'd like to know if we should investigate different logging solutions to see if this is a consistent behaviour or not.

repeatedly · 2018-03-20T04:27:22Z

Sorry, I forgot to reply...
From your log, one thread is stopped in read with inotify .

Thread 4 (Thread 0x7fb809bff700 (LWP 20596)):
#0  0x00007fb81c8c9a9d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fb8177c8545 in infy_cb (loop=0x7fb814872900, w=<optimized out>, revents=<optimized out>)
    at ../libev/ev.c:4470
#2  0x00007fb8177c3ae3 in ev_invoke_pending (loop=0x7fb814872900) at ../libev/ev.c:3319
#3  0x00007fb8177c7271 in ev_run (loop=0x7fb814872900, flags=flags@entry=2) at ../libev/ev.c:3809
#4  0x00007fb8177ca242 in ev_loop (flags=2, loop=<optimized out>) at ../libev/ev.h:835
#5  Coolio_Loop_run_once (argc=<optimized out>, argv=<optimized out>, self=140428616634920) at loop.c:206
#6  0x00007fb81cc98996 in vm_call_cfunc_with_frame (th=0x7fb8125bac00, reg_cfp=0x7fb809d0d1c0, ci=<optimized out>)
    at vm_insnhelper.c:1510
#7  0x00007fb81cc9cd26 in vm_exec_core (th=th@entry=0x7fb8125bac00, initial=initial@entry=0) at insns.def:1025
#8  0x00007fb81cca1a91 in vm_exec (th=th@entry=0x7fb8125bac00) at vm.c:1427
#9  0x00007fb81cca5610 in invoke_block_from_c (th=0x7fb8125bac00, block=0x7fb815c7d7d0, self=<optimized out>, 
    argc=<optimized out>, argv=<optimized out>, blockptr=<optimized out>, cref=0x0, defined_class=140428701407320)
    at vm.c:817
#10 0x00007fb81cca5b1b in vm_invoke_proc (th=th@entry=0x7fb8125bac00, proc=proc@entry=0x7fb815c7d7d0, 
    self=140428616664240, defined_class=140428701407320, argc=0, argv=0x7fb80fd5b830, blockptr=0x0) at vm.c:881
#11 0x00007fb81cca5c08 in rb_vm_invoke_proc (th=th@entry=0x7fb8125bac00, proc=proc@entry=0x7fb815c7d7d0, 
    argc=<optimized out>, argv=<optimized out>, blockptr=blockptr@entry=0x0) at vm.c:900
#12 0x00007fb81ccbaadf in thread_start_func_2 (th=th@entry=0x7fb8125bac00, stack_start=<optimized out>)
    at thread.c:535
#13 0x00007fb81ccbaeb3 in thread_start_func_1 (th_ptr=0x7fb8125bac00) at thread_pthread.c:863
#14 0x00007fb81c8c3064 in start_thread (arg=0x7fb809bff700) at pthread_create.c:309
#15 0x00007fb81bebc62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I'm not sure which is the cause yet, but how about disabling inotify? If set enable_stat_watcher false, fluentd doesn't use inotify's watcher for files.

goakley · 2018-03-20T21:21:48Z

@repeatedly I have applied that change and will follow up in a few days with the application's performance.

abhiTamrakar · 2018-03-23T13:24:06Z

@repeatedly I am hitting this similar issue and strace,pstack and gdb thread backtrace has this, it looks like we are hitting the exact same issue blocked by inotify, can you please confirm?

~$ sudo strace -p 57631
strace: Process 57631 attached
restart_syscall(<... resuming interrupted futex ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16467, {355719, 814820324}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16469, {355720, 115187162}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16471, {355720, 415451900}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16473, {355720, 715769337}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16475, {355721, 16916967}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16477, {355721, 317276504}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16479, {355721, 617613841}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f184121c130, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f184121c0bc, FUTEX_WAIT_BITSET_PRIVATE, 16481, {355721, 918016078}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

$ sudo pstack 57631

57631: /usr/bin/ruby2.3 -Eascii-8bit:ascii-8bit /usr/local/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins -v --under-s...
(No symbols found)
0x7f184243c508: ???? (0, 0, 0, 0, 0, 0) + ffff80e7bede3e90

I will try putting this 'enable_stat_watcher false' in config and update if it resolves the issue. I even see my worker process dies and starts again just after few seconds of deployment, let me know if you have insights on that. I will share the configs if you want.

gdb_thread_all_backtrace.txt

goakley · 2018-03-23T17:37:51Z

@repeatedly since applying enable_stat_watcher false we have not seen fluentd hang. Thank you for the insight! I'm not sure what the implications are here, is this something that fluentd can work around or is it a deeper issue with the libraries that ruby runs on?

abhiTamrakar · 2018-03-27T12:21:29Z

sorry for late reply here, that fixed the issue for me too..
Cant we have this available as an option visible in docs too?

repeatedly · 2018-03-29T18:22:36Z

fluent/fluentd-docs@22a1289

I just updated in_tail article for enable_stat_watcher.

is it a deeper issue with the libraries that ruby runs on?

I assume the problem is inotify scalability.
Maybe, inotify is not designed for monitoring lots of files with frequent access.
Or libev's inotify usage.

lruslan · 2018-04-18T19:51:50Z

hi, @repeatedly , in my case I also have fluentd-1.1.2 periodically hanging and stop processing logs.
I've tried to use 'enable_stat_watcher false' but issue still persist. Most likely it's something different.
Sigdump of a process: https://gist.github.com/lruslan/176ca40a35eb59537e10663aed1d0c61

Inside log directory which suppose to be tailed by fluentd I see the file with timestamp: Apr 18 19:08
But pos file of the fluentd tail stuck on: Apr 18 03:00 /var/log/fluentd-payload-logger.pos

Configuration of tail source:

<source>
  @type tail
  path /var/log/fluentd-payload-logger/**/*.log
  pos_file /var/log/fluentd-payload-logger.pos
  enable_stat_watcher false
  tag payload-logger
  read_lines_limit 100
  read_from_head true
  refresh_interval 10
  <parse>
    @type json
  </parse>
</source>

fluent/fluentd#1630

* fluentd-cloudwatch: Fix Fluentd hanging fluent/fluentd#1630 * Update Chart.yaml

mblaschke · 2018-06-19T07:47:10Z

Have same problem here, the Pod crashed and was restarted but Fluentd is failing to collect logs.
Even the rotation isn't detected.

Maybe a symlink issue?
/var/log/containers/.log -> /var/log/pods//.log -> /var/lib/docker/containers//*.log

Use realpath instead of first symlink?

repeatedly · 2018-06-19T12:01:22Z

@mblaschke fluentd supports symlink. Is your problem reproducible with symlink?

krx252525 · 2018-07-05T04:16:25Z

We seen a similar issue where the systemd input plugin just stopped moving forward.
If you're using a pos file and running in Kubernetes you could run a liveness command doing a stat on the log your tailing, stat on the pos and if they diverge by x return exit 1 and add the liveness check to your pod spec.

* fluentd-cloudwatch: Fix Fluentd hanging fluent/fluentd#1630 * Update Chart.yaml

* fluentd-cloudwatch: Fix Fluentd hanging fluent/fluentd#1630 * Update Chart.yaml Signed-off-by: voron <av@arilot.com>

BhartiAgrawalNvidia · 2018-11-17T01:31:10Z

The suggestion by @keir-rex does not work as the container is not able to come up as liveness check keeps fails when it does and it restarts container continuously. Would really like to get a fix for this.

collin-bachi-sp · 2018-11-29T16:58:23Z

I'm seeing a similar issue using a custom input plugin that polls from SQS.

ganmacs · 2019-12-23T02:44:17Z

it's a stale issue. I'm closing. if you have any problem, updating fluentd might help it.

zackwine · 2020-01-13T19:56:05Z

We have seen this issue in our containerized Fluentd instance and the tail input. As the number of files/volume of logs increases the issue seems to occur more often. We are currently running version v3.1.1. I attempted the workaround mentioned above (adding the option enable_stat_watcher false), but we are still seeing the issue.

Does anyone know if this is resolved in a newer version? If so which version?

baconmania · 2020-05-14T03:19:17Z

I'm also seeing this for high-volume logs, even on the newest version of td-agent. enable_stat_watcher does not fix the issue. Any further information on this?

baconmania · 2020-05-14T03:55:11Z

To provide more datapoints here:

td-agent 3.7.1 (fluentd v1.10.2)
input is in_tail
output is out_s3
our symptoms are similar
- when this does happen, the only way we know is by querying the log dataset shipped to S3 and finding gaps in the data
- td-agent logging completely stops, even at the trace level

dengalebr · 2020-10-28T13:46:05Z

We tried both
enable_stat_watcher true
enable_watch_timer false

And

enable_stat_watcher false
enable_watch_timer true

We are still missing log file and pos file udpate. As we are logging very fast, files get rotated and few of the logs missing.

liogate · 2021-04-23T07:05:40Z

Same issue here, I've increased inotify watchers but fluents detect nothing.

gowtham9b · 2021-04-23T19:48:03Z

@repeatedly I'm facing the same issue too. I've tried enable_stat_watcher false
But fluentd is just stuck.
It's tailing the files and able to detect rotation of files. But not sending logs from those files.
I'm using fluentd 1.9.3

grandich · 2021-07-24T00:06:57Z

We are facing this issue with /var/log/containers/*.log from kubernetes; only a few logs get picked up...

LinkMaq · 2021-09-10T04:41:32Z

Has this problem been solved? Fluentd 1.12.0 still has this problem

mfornasa mentioned this issue May 11, 2018

[incubator/fluentd-cloudwatch] Fluentd stop processing logs after ~ 10 minutes helm/charts#5510

Closed

mfornasa mentioned this issue May 11, 2018

Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 helm/charts#5511

Merged

mfornasa added a commit to mfornasa/charts that referenced this issue May 17, 2018

fluentd-cloudwatch: Fix Fluentd hanging

f519660

fluent/fluentd#1630

k8s-ci-robot pushed a commit to helm/charts that referenced this issue Jun 8, 2018

Fix Fluentd hanging fluent/fluentd#1630 (#5511)

98ec381

* fluentd-cloudwatch: Fix Fluentd hanging fluent/fluentd#1630 * Update Chart.yaml

or1can pushed a commit to or1can/charts that referenced this issue Jul 10, 2018

Fix Fluentd hanging fluent/fluentd#1630 (helm#5511)

ad5dd06

* fluentd-cloudwatch: Fix Fluentd hanging fluent/fluentd#1630 * Update Chart.yaml

voron pushed a commit to dysnix/helm-charts that referenced this issue Sep 5, 2018

Fix Fluentd hanging fluent/fluentd#1630 (helm#5511)

8ee41f8

* fluentd-cloudwatch: Fix Fluentd hanging fluent/fluentd#1630 * Update Chart.yaml Signed-off-by: voron <av@arilot.com>

ganmacs closed this as completed Dec 23, 2019

himanshigpta mentioned this issue May 7, 2020

Td-agent randomly stops sending logs to Elasticsearch #2981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd stops processing logs, but keeps running #1630

Fluentd stops processing logs, but keeps running #1630

dvydra commented Jul 14, 2017 •

edited

Loading

dvydra commented Jul 19, 2017

repeatedly commented Jul 19, 2017

dvydra commented Jul 19, 2017

cehoffman commented Sep 16, 2017

jsleeio commented Nov 1, 2017

goakley commented Mar 12, 2018

repeatedly commented Mar 12, 2018

goakley commented Mar 12, 2018

goakley commented Mar 19, 2018

repeatedly commented Mar 20, 2018

goakley commented Mar 20, 2018

abhiTamrakar commented Mar 23, 2018 •

edited

Loading

goakley commented Mar 23, 2018

abhiTamrakar commented Mar 27, 2018

repeatedly commented Mar 29, 2018 •

edited

Loading

lruslan commented Apr 18, 2018

mblaschke commented Jun 19, 2018

repeatedly commented Jun 19, 2018

krx252525 commented Jul 5, 2018

BhartiAgrawalNvidia commented Nov 17, 2018

collin-bachi-sp commented Nov 29, 2018

ganmacs commented Dec 23, 2019

zackwine commented Jan 13, 2020

baconmania commented May 14, 2020

baconmania commented May 14, 2020 •

edited

Loading

dengalebr commented Oct 28, 2020

liogate commented Apr 23, 2021

gowtham9b commented Apr 23, 2021 •

edited

Loading

grandich commented Jul 24, 2021

LinkMaq commented Sep 10, 2021 •

edited

Loading

Fluentd stops processing logs, but keeps running #1630

Fluentd stops processing logs, but keeps running #1630

Comments

dvydra commented Jul 14, 2017 • edited Loading

dvydra commented Jul 19, 2017

repeatedly commented Jul 19, 2017

dvydra commented Jul 19, 2017

cehoffman commented Sep 16, 2017

jsleeio commented Nov 1, 2017

goakley commented Mar 12, 2018

repeatedly commented Mar 12, 2018

goakley commented Mar 12, 2018

goakley commented Mar 19, 2018

repeatedly commented Mar 20, 2018

goakley commented Mar 20, 2018

abhiTamrakar commented Mar 23, 2018 • edited Loading

goakley commented Mar 23, 2018

abhiTamrakar commented Mar 27, 2018

repeatedly commented Mar 29, 2018 • edited Loading

lruslan commented Apr 18, 2018

mblaschke commented Jun 19, 2018

repeatedly commented Jun 19, 2018

krx252525 commented Jul 5, 2018

BhartiAgrawalNvidia commented Nov 17, 2018

collin-bachi-sp commented Nov 29, 2018

ganmacs commented Dec 23, 2019

zackwine commented Jan 13, 2020

baconmania commented May 14, 2020

baconmania commented May 14, 2020 • edited Loading

dengalebr commented Oct 28, 2020

liogate commented Apr 23, 2021

gowtham9b commented Apr 23, 2021 • edited Loading

grandich commented Jul 24, 2021

LinkMaq commented Sep 10, 2021 • edited Loading

dvydra commented Jul 14, 2017 •

edited

Loading

abhiTamrakar commented Mar 23, 2018 •

edited

Loading

repeatedly commented Mar 29, 2018 •

edited

Loading

baconmania commented May 14, 2020 •

edited

Loading

gowtham9b commented Apr 23, 2021 •

edited

Loading

LinkMaq commented Sep 10, 2021 •

edited

Loading