Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

pdecat · 2017-11-23T15:01:08Z

A bit of context first: I'm deploying fluentd on kubernetes using the fluent/fluentd:v0.14.23 docker image as three pods managed by ReplicationControllers with PersistentVolumes for storing buffers (cannot use a StatefulSet here but that's another story).

Since I have upgraded from v0.14.22 to v0.14.23, only two of those pods are running fine.
The third one is having its worker to crash loop as soon as it starts, apparently when it tries to read its buffers metadata:

I  2017-11-23 12:15:59 +0000 [info]: #0 starting fluentd worker pid=657 ppid=5 worker=0
I  2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `[]'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:219:in `restore_metadata'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:322:in `load_existing_staged_chunk'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer/file_chunk.rb:51:in `initialize'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `new'
|    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:144:in `block in resume'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `glob'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:133:in `resume'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buffer.rb:171:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/buf_file.rb:120:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/plugin/output.rb:415:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:165:in `block in start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:154:in `block (2 levels) in lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `each'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:153:in `block in lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `each'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:140:in `lifecycle'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/root_agent.rb:164:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:274:in `start'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/engine.rb:219:in `run'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:774:in `run_engine'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:523:in `block in run_worker'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:699:in `main_process'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/supervisor.rb:518:in `run_worker'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.23/bin/fluentd:5:in `<top (required)>'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `load'
I    2017-11-23 12:15:59 +0000 [error]: #0 /usr/bin/fluentd:22:in `<main>'
I  2017-11-23 12:15:59 +0000 [error]: #0 unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer"
I    2017-11-23 12:15:59 +0000 [error]: #0 suppressed same stacktrace
I  2017-11-23 12:15:59 +0000 [info]: fluent/log.rb:316:info: Worker 0 finished unexpectedly with status 1

This happens at https://github.com/fluent/fluentd/blob/v0.14.23/lib/fluent/plugin/buffer/file_chunk.rb#L219

The buffers of this pod look corrupt:

# ls -latr /var/log/pods/                                 
total 168                                     
drwxrwS---    2 root     fluent       16384 Nov 15 11:46 lost+found                          
drwxrwsr-x    3 root     fluent      151552 Nov 16 14:40 .                                   
-rw-rw-r--    1 fluent   fluent          75 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta                                                                                
-rw-rw-r--    1 fluent   fluent           0 Nov 16 14:40 buffer.b55e1a98d6e429d3dda1b152fd10035da.log                                                                                     
-rw-rw-r--    1 fluent   fluent          76 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta                                                                                
-rw-rw-r--    1 fluent   fluent           0 Nov 16 14:40 buffer.b55e1a98689733872d3b28dbb9c3483e4.log                                                                                     
drwxr-xr-x    1 root     root          4096 Nov 23 12:12 ..

# hexdump /var/log/pods/buffer.b55e1a98d6e429d3dda1b152fd10035da.log.meta                                                                              
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040

hexdump /var/log/pods/buffer.b55e1a98689733872d3b28dbb9c3483e4.log.meta
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040

I've deleted those buffers to workaround this issue for now and the pod is back to normal.

Not sure the upgrade is the culprit, perhaps it was just the kill that happened at the wrong time.

Maybe the restore_metadata could add checks to avoid this issue.

The text was updated successfully, but these errors were encountered:

repeatedly · 2017-11-24T08:01:10Z

Maybe the restore_metadata could add checks to avoid this issue.

It seems good.
I'm considering adding backup_dir, default is /tmp, to <system>. If the problem happens during restore, move broken files into backup_dir.

rhefner1 · 2018-02-21T12:36:43Z

I just hit this issue as part of a Kubernetes deployment. It would be really great if fluentd could handle this kind of failure without going into a crash backoff loop until I manually SSH in and delete the corrupted buffers.

f0 · 2018-02-26T18:50:07Z

run into the same, also a Kubernetes Deploymdent

repeatedly · 2018-02-28T12:47:59Z

I wrote a patch for avoiding this problem: #1874

I want to know which storage use for file buffer. Local storage or network/distributed storage like NFS?

rhefner1 · 2018-02-28T12:49:48Z

@repeatedly I use local storage for the fluentd buffer before sending to elasticsearch.

f0 · 2018-02-28T14:17:40Z

Local storage

pdecat · 2018-02-28T14:46:24Z

Google Persistent Disk storage mounted into pods via Persistent Volume Claims, closer in behavor to local storage than distributed storage.

buf_file: Skip and delete broken file chunks during resume. fix #1760

mar-kolya mentioned this issue Nov 30, 2017

Fluentd fails to recover if metadata/chunk contains no data. #1768

Open

repeatedly closed this as completed in 73a952f Mar 2, 2018

repeatedly added a commit that referenced this issue Mar 2, 2018

Merge pull request #1874 from fluent/ignore-broken-file-chunks

5c7c32a

buf_file: Skip and delete broken file chunks during resume. fix #1760

repeatedly mentioned this issue Mar 19, 2018

unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer" #1894

Closed

okkez pushed a commit to okkez/fluentd that referenced this issue Mar 27, 2018

buf_file: Skip and delete broken files during resume. fix fluent#1760

5164aa3

lvicentesanchez mentioned this issue Jun 18, 2018

fluentd-elastichsearch: unexpected error error_class=TypeError error="no implicit conversion of Symbol into Integer" kubernetes/kubernetes#65175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

pdecat commented Nov 23, 2017

repeatedly commented Nov 24, 2017

rhefner1 commented Feb 21, 2018

f0 commented Feb 26, 2018

repeatedly commented Feb 28, 2018

rhefner1 commented Feb 28, 2018

f0 commented Feb 28, 2018

pdecat commented Feb 28, 2018

Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

Restoring buffer metadata crashes worker with empty buffer and/or corrupt metadata #1760

Comments

pdecat commented Nov 23, 2017

repeatedly commented Nov 24, 2017

rhefner1 commented Feb 21, 2018

f0 commented Feb 26, 2018

repeatedly commented Feb 28, 2018

rhefner1 commented Feb 28, 2018

f0 commented Feb 28, 2018

pdecat commented Feb 28, 2018