fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

bungoume · 2017-08-18T13:26:38Z

After the connection to the server is lost, the fluentd-sender CPU usage become 100%.
Even if the server is restored, CPU still sticks to max.

fluentd(sender) version: 0.14.20

21:36 - server down
21:38 - server reovered

fluentd sender config

https://github.com/bungoume/log-sender/blob/master/fluent.tmpl#L147-L168

repeatedly · 2017-08-22T17:07:02Z

Do you have reproducible step?
I tested destination shutdown on my Mac but forwarder's CPU is lower...

GLStephen · 2017-08-24T03:21:57Z

I've seen this behavior as well. The target server for a forwarder has a blip and then the sending server spikes and stays high CPU.

repeatedly · 2017-08-25T01:18:18Z

@GLStephen Could you tell me your setting and traffic?
I can't reproduce on my environment so I want to reproduce the problem first.

bungoume · 2017-08-28T10:15:48Z

It was reproduced when AWS CLB(ELB) was placed in front of the aggregator(server).

log sender -> ELB -> aggregator

when ELB->aggs connection was closed,
sender seems to misunderstand that aggregator server has downed and recovered.
And, sender CPU will be 100%.

2017-08-28 10:03:15 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] failed to flush the buffer. retry_time=1 next_retry_seconds=2017-08-28 10:03:17 +0000 chunk="557cd66b41ec2fe08ea752ca6ee2d50e" error_class=Fluent::Plugin::ForwardOutput::ConnectionClosedError error="failed to establish connection with node "
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:606:in `send_data_actual'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:633:in `send_data'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:285:in `block in try_write'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:304:in `block in select_a_healthy_node'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:298:in `times'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:298:in `select_a_healthy_node'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/out_forward.rb:285:in `try_write'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/output.rb:1034:in `try_flush'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/output.rb:1268:in `flush_thread_run'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin/output.rb:420:in `block (2 levels) in start'
  2017-08-28 10:03:16 +0000 [warn]: #0 /usr/lib/ruby/gems/2.4.0/gems/fluentd-0.14.20/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-08-28 10:03:16 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:19 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224
2017-08-28 10:03:20 +0000 [warn]: #0 [forward_any] destination node closed the connection. regard it as unavailable. host="log-aggregator.local" port=24224
2017-08-28 10:03:21 +0000 [warn]: #0 [forward_any] recovered forwarding server '' host="log-aggregator.local" port=24224

GLStephen · 2017-08-28T10:44:08Z

@repeatedly there are some aspects to our config I would prefer not to make public. What info do you need exactly? The basics are. We are using flowcounter to forward message counts every 15 seconds. Running about 1m messages a minute. We have also seen what appears to be deadlocking with 14 at these volumes.

GLStephen · 2017-08-28T23:48:00Z

@repeatedly more info is here, we are seeing this as related to the above, sending about 1m events/hour

fluent-plugins-nursery/fluent-plugin-bigquery#143

repeatedly · 2017-08-29T01:40:35Z

@GLStephen Above link is for deadlock. Do you hit 100% CPU problem together?

repeatedly · 2017-08-29T01:41:17Z

@bungoume It means the problem doesn't happen without AWS CLB/ELB?

GLStephen · 2017-08-29T01:51:15Z

@repeatedly I've seen both happen together in a forwarding pair. Our aggregator, referenced above will deadlock and the downstream flowcounter metric server will do this 100% CPU thing like above.

bungoume · 2017-08-29T02:36:44Z

@repeatedly yes,
log-sender -> ELB connection was closed or
log-aggregator downed only when log-sender directory connect,
the problem does not happen.

repeatedly · 2017-08-29T03:38:13Z

@GLStephen Thanks. I will fix deadlock issue first.

@bungoume So the ELB returns wrong response? I'm not familiar with ELB connection handling so I want to know what's the return in this situation.

bungoume · 2017-08-29T06:02:13Z

We collected logs. (fluentd log, ruby profile, and TCP dump)
fluentd20170829.tar.gz

10.81.24.149: client (log-sender)
10.81.21.192: server (log-aggregator ELB)

Normal case

client send message (No.7) and TCP-ACK(No.10)
server returns software-ACK (No.12)
ex: ackVX21QxVErW1j597Kn12TZQ==
FIN by client (No.16-18)

Problem case

client send message (No.538) and TCP-ACK(No.540)
without server software-ACK
FIN by server (No.543-)

This may be due to the TCP level health check being successful.
(But this is not a direct reason for CPU 100%?)

repeatedly · 2017-08-30T08:44:30Z

@bungoume Thanks!

@GLStephen One question. Do you forward logs between fluentd nodes directly? Not use ELB like bungoume, right?

GLStephen · 2017-08-31T01:04:32Z

@repeatedly Correct, we connect directly from the forwarder to the aggregator

repeatedly · 2017-09-04T05:51:24Z

@bungoume Could you get sigdump result from problematic fluentd instance?

okkez · 2017-09-06T05:18:57Z

@bungoume Could you try #1684 ?

bungoume · 2017-09-06T09:42:32Z

@repeatedly
Add require 'sigdump/setup' at /usr/bin/fluentd and attach the executed log.
sigdump(normal)-1549.txt
sigdump(cpu100)-267.txt (send SIGCONT twice)

I tried #1684 but it did not resolve.

repeatedly · 2017-09-06T22:51:27Z

@bungoume Thanks.
I wrote a patch for this problem. Could you test this? > #1686

bungoume · 2017-09-07T01:52:54Z

@repeatedly Thank you!
The bug seems to have been fixed.

However, I get a lot of logs.
log:
20170907T012240.txt

2017-09-07 01:20:08 Block ELB->log-agg connect
2017-09-07 01:21:40 Fix ELB->log-agg connect
2017-09-07 01:22:38 Stop fluentd

repeatedly · 2017-09-07T06:25:15Z

If you have lots of chunks, it seems normal because such logs are generated by each chunk.

repeatedly · 2017-09-07T06:53:02Z

I'm not sure this case is popular or not.
If there are lots of servers behind ELB, retried chunks are routed to live aggregators, right?

bungoume · 2017-09-11T01:28:46Z

Yes, it does not occur much.
However, it may be caused by log-aggregators deployment error or configuration bug.

It is fatal for all applications to die due to fluentd.

repeatedly · 2017-09-11T06:24:39Z

it may be caused by log-aggregators deployment error or configuration bug.

How to detect this by fluentd forwarder side?
Maybe, we can't suppress these error because temporal network failure cause same error, right?

bungoume · 2017-09-11T10:43:34Z

I think there is no need to distinguish it from network error.
Would you please reduce the amount of logs, including network errors?

bungoume · 2017-11-28T07:48:58Z

@repeatedly how is this issue?
This problem also occurred in using AWS-NLB and v0.14.23.

Don't update retry state when failed to get ack response. fix #1665

bungoume · 2017-12-05T01:16:27Z

@repeatedly Unfortunately, this problem has also reappeared on fluentd-1.0.0.rc1(Ruby 2.4.2)

bungoume · 2017-12-06T01:44:13Z

fluent.txt
This is the sender log when CPU reaches 100%.

repeatedly · 2017-12-08T00:21:56Z

Is this same situation or other network sequence?

bungoume · 2017-12-08T03:22:13Z

the last report is using NLB.
Occurs when log-aggregator stops due to Elasticsearch failure.

nathan-oakes · 2018-06-01T20:56:05Z

I have noticed this same issue, but not on all nodes. Config is using 1.0 (td-agent3):
Within Datacenter: [td-agent in_tail, out_forward on nodes] -> (td-agent in_forward, out_forward on "Forwarder") -> Within AWS VPC over IPSEC tunnel (in_forward, out_elasticsearch and out_s3 on "Aggregator").

repeatedly added moreinfo Missing version, need reproducible steps, need to investigate more v0.14 labels Aug 22, 2017

okkez mentioned this issue Sep 6, 2017

Sleep if readable sockets are empty #1684

Closed

repeatedly closed this as completed in 2f40abb Nov 29, 2017

repeatedly added a commit that referenced this issue Nov 29, 2017

Merge pull request #1686 from fluent/fix-busy-flush-in-out_forward

1b9cddd

Don't update retry state when failed to get ack response. fix #1665

bungoume mentioned this issue Dec 6, 2017

Temporary disable require_ack_response bungoume/log-sender#24

Merged

fujimotos mentioned this issue Mar 16, 2018

output: Prevent flushing threads from consuming too much CPU. #1901

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

bungoume commented Aug 18, 2017

repeatedly commented Aug 22, 2017

GLStephen commented Aug 24, 2017

repeatedly commented Aug 25, 2017

bungoume commented Aug 28, 2017

GLStephen commented Aug 28, 2017

GLStephen commented Aug 28, 2017

repeatedly commented Aug 29, 2017

repeatedly commented Aug 29, 2017

GLStephen commented Aug 29, 2017

bungoume commented Aug 29, 2017

repeatedly commented Aug 29, 2017

bungoume commented Aug 29, 2017 •

edited

Loading

repeatedly commented Aug 30, 2017

GLStephen commented Aug 31, 2017

repeatedly commented Sep 4, 2017

okkez commented Sep 6, 2017

bungoume commented Sep 6, 2017

repeatedly commented Sep 6, 2017

bungoume commented Sep 7, 2017

repeatedly commented Sep 7, 2017

repeatedly commented Sep 7, 2017

bungoume commented Sep 11, 2017

repeatedly commented Sep 11, 2017 •

edited

Loading

bungoume commented Sep 11, 2017

bungoume commented Nov 28, 2017

bungoume commented Dec 5, 2017

bungoume commented Dec 6, 2017

repeatedly commented Dec 8, 2017

bungoume commented Dec 8, 2017

nathan-oakes commented Jun 1, 2018 •

edited

Loading

fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

fluentd-sender(0.14.20) become cpu 100% after fluentd-server down #1665

Comments

bungoume commented Aug 18, 2017

fluentd sender config

repeatedly commented Aug 22, 2017

GLStephen commented Aug 24, 2017

repeatedly commented Aug 25, 2017

bungoume commented Aug 28, 2017

GLStephen commented Aug 28, 2017

GLStephen commented Aug 28, 2017

repeatedly commented Aug 29, 2017

repeatedly commented Aug 29, 2017

GLStephen commented Aug 29, 2017

bungoume commented Aug 29, 2017

repeatedly commented Aug 29, 2017

bungoume commented Aug 29, 2017 • edited Loading

Normal case

Problem case

repeatedly commented Aug 30, 2017

GLStephen commented Aug 31, 2017

repeatedly commented Sep 4, 2017

okkez commented Sep 6, 2017

bungoume commented Sep 6, 2017

repeatedly commented Sep 6, 2017

bungoume commented Sep 7, 2017

repeatedly commented Sep 7, 2017

repeatedly commented Sep 7, 2017

bungoume commented Sep 11, 2017

repeatedly commented Sep 11, 2017 • edited Loading

bungoume commented Sep 11, 2017

bungoume commented Nov 28, 2017

bungoume commented Dec 5, 2017

bungoume commented Dec 6, 2017

repeatedly commented Dec 8, 2017

bungoume commented Dec 8, 2017

nathan-oakes commented Jun 1, 2018 • edited Loading

bungoume commented Aug 29, 2017 •

edited

Loading

repeatedly commented Sep 11, 2017 •

edited

Loading

nathan-oakes commented Jun 1, 2018 •

edited

Loading