Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory usage seems to be growing indefinitely with simple forward configuration #2926

Closed
sp0f opened this issue Apr 2, 2020 · 19 comments
Closed
Labels
bug Something isn't working memory

Comments

@sp0f
Copy link

sp0f commented Apr 2, 2020

Describe the bug
Memory usage is growing indefinitely with very simple @type forward configuration. I'm shipping about 10-30G of logs daily. Mostly during evening hours. td-agent at the beginning is using about 100M of RES memory but after about 12h it's 500M and half of this time it's mostly idled because there are not many logs during the night. Memory is never freed. When I switch to any 'local' output like file or stdout memory usage is very stable. I've seen the same behavior with elasticsearch output so I guess it can be something connected with network outputs ... or just my stupidity :) There are no errors in the log file. I've tried to set RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 but it didn't fix the problem. Memory usage is still growing. I have two workers in my configuration This problem only occurs on the one with a high amount of logs generated during day hours (worker 0). A worker is reading about 50 files at once. It may be relevant that I have a lot of pattern not matched - I'm in the middle of standardizing log format for all apps.

To Reproduce
Run with a high amount of not perfectly formate logs.

Expected behavior
Stable memory usage.

Your Environment

  • Fluentd or td-agent version: td-agent 1.9.2
  • Operating system: NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/"
  • Kernel version: 4.14.171-136.231.amzn2.x86_64

Your Configuration

<system>
  workers 2
  log_level warn
</system>

<worker 0>
  <source>
    @type tail
    path "/somepath/*/*.log"
    read_from_head true
    pos_file /var/log/td-agent/mpservice-es-pos-file
    tag mpservice-raw.*
    enable_stat_watcher false
    <parse>
      @type multiline
      time_key time
      time_format %Y-%m-%d %H:%M:%S,%L
      timeout 0.5
      format_firstline /^\d{4}-\d{2}-\d{2}/
      format1 /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \| (?<level>.*) \| (?<class>.*) \| (?<thread>.*) \| (?<message>.*)/

    </parse>
  </source>

<match *.**>
  @type forward
  <server>
    host ***
    port 24224
  </server>
  <buffer>
    @type memory
    flush_interval 2s
    flush_thread_count 2
  </buffer>
</match>

</worker>

# webapps ES
<worker 1>
  <source>
    @type tail
    path "/somepath/*/*[a-z].log"
    read_from_head true
    pos_file /var/log/td-agent/webapps-es-pos-file
    tag webapps-raw.*
    enable_stat_watcher false
    <parse>
      @type multiline
      format_firstline /^\d{4}-\d{2}-\d{2}/
      format1 /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) \| (?<level>.*) \| (?<class>.*) \| (?<thread>.*) \| (?<message>.*)/
    </parse>
  </source>


<match *.**>
  @type forward
  <server>
    host ***
    port 24224
  </server>
  <buffer>
    @type memory
    flush_interval 2s
    flush_thread_count 2
  </buffer>
</match>
</worker>

Your Error Log

no errors in logs
@aeber
Copy link
Contributor

aeber commented Apr 6, 2020

I am currently investigating a similar issue.

My setup:

  • td-agent version: 1.9.2 / 1.10.0
  • Operating system: Debian Buster / Debian Stretch
  • Kernel version: 4.19.0-8-amd64 / 4.9.0-12-amd64

I'm using the forward output with a file buffer to forward logs to an aggregator node:

    <buffer>
      chunk_limit_size 8m
      flush_at_shutdown true
      flush_interval 5s
      path /var/log/td-agent/buffer/forward.*.buf
      total_limit_size 128m
      @type file
    </buffer>

To reproduce the issue i created a dummy input with size=1000, rate=150 and dropped the received dummy messages on the aggregator node without doing any processing.

By lowering the chunk_limit_size to 1m and the flush_interval to 1s i was able to increase memory usage by 0.5MB each second.
Setting RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=0.9 did not change this behaviour.

@aeber
Copy link
Contributor

aeber commented Apr 7, 2020

I just did two more tests:

Null output with buffer

Using the null output with a file buffer does not lead to an increasing memory consumption as seen with the forward output.

Affected fluentd versions

I could reproduce this issue with each fluentd version down to 1.7.0. Fluentd
1.6.3 seems to be the last version not affected by this behaviour.

@repeatedly
Copy link
Member

Thanks. We will check it.

@repeatedly repeatedly added the bug Something isn't working label Apr 7, 2020
@aeber
Copy link
Contributor

aeber commented Apr 7, 2020

I managed to narrow it down even further: The problem first appeared after merging #2501.

Somehow the SocketCache does not seem to really cache sockets and creates a new one every few seconds.

@ganmacs
Copy link
Member

ganmacs commented Apr 8, 2020

@aeber Can you give us the actual config to reproduce it?

Besides, I'd like to ask you how long you ran fluentd?
I tried to reproduce it with the following configs. I was able to see an increasing memory. but after a few minutes, the increase in memory stopped and stabilized.

Am I missing important things or it depends on OSs?

Sender(fluentd v1.10.0 and macOS(10.15.3))

<source>
  @type dummy
  tag test
  rate 150
  size 1000
</source>

<match test>
  @type forward

  <server>
    host 127.0.0.1
    port 24224
  </server>

  <buffer>
    @type file
    chunk_limit_size 1m
    flush_at_shutdown true
    flush_interval 1s
    path ./tmp/buf
    total_limit_size 128m
  </buffer>
</match>

Receiver(fluentd v1.10.0 and macOS(10.15.3))

<source>
  @type forward
</source>

<match test>
  @type null

  <buffer>
    flush_interval 1s
    chunk_limit_size 5m
  </buffer>
</match>

@aeber
Copy link
Contributor

aeber commented Apr 9, 2020

The problem only seems to appear when using the tls transport.

The configs i'm using for reproducing this:

Sender (docker image fluentd:v1.9-1)

<source>
  @type dummy
  size 1200
  rate 450
  tag test
</source>

<match **>
  @type forward
  <buffer>
    chunk_limit_size 1m
    flush_at_shutdown true
    flush_interval 1s
    path /fluentd/buffer/forward_graylog.*.buf
    total_limit_size 128m
    @type file
  </buffer>
  heartbeat_interval 1s
  keepalive true
  phi_threshold 16
  recover_wait 10s
  <security>
    self_hostname client
    shared_key sharedkeypassword
  </security>
  send_timeout 60s
  <server>
    host server
    password testpassword
    port 24284
    username testuser
  </server>
  tls_verify_hostname true
  tls_cert_path /fluentd/certs/rootCA.pem
  tls_insecure_mode false
  transport tls
</match>

Receiver (docker image fluentd:v1.9-1)

<source>
  @type forward
  bind 0.0.0.0
  port 24284
  <security>
    self_hostname server
    shared_key sharedkeypassword
    <user>
      password testpassword
      username testuser
    </user>
    user_auth true
  </security>
  <transport tls>
    cert_path /fluentd/certs/server.pem
    insecure false
    private_key_passphrase
    private_key_path /fluentd/certs/server-key.pem
  </transport>
</source>

<match test>
  @type null
</match>

For testing this i created a local CA and server certificate with
mkcert and then used the following
docker-compose.yml:

version: '2'
services:
  client:
    image: fluentd:v1.9-1
    command:
      - "-v"
    volumes:
      - "./client/fluent.conf:/fluentd/etc/fluent.conf"
      - "./client/certs:/fluentd/certs"
  server:
    image: fluentd:v1.9-1
    command:
      - "-v"
    volumes:
      - "./server/fluent.conf:/fluentd/etc/fluent.conf"
      - "./server/certs:/fluentd/certs"

After running for about 20minutes on an Macbook i see a memory usage of around 200MB for the client container and rising.
When i run this on a linux server it only takes around 5-10 minutes to reach the same amount of memory usage (not sure why).

@4ndr4s
Copy link

4ndr4s commented Apr 12, 2020

I have a similar issue, after updating fluentd container from 0.12.26 to 1.9.3 version under alpine flavor, memory started growing some time after the container started running and never stopped until OOM.

Screen Shot 2020-04-12 at 11 54 19 AM

this wasn't happening with the same configuration in old versions I can confirm @aeber comment.

configuration:
fluentd version: 1.9.3

fluentd.conf:

  @type syslog
  port 514
  <transport udp>
  </transport>
  tag  system
  @log_level info
</source>

<source>
  @type syslog
  port 514
  <transport tcp>
  </transport>
  tag  system
</source>

<label @FLUENT_LOG>
  <match fluent.**>
    @type null
  </match>
</label>

<filter **>
  @type elasticsearch_genid
  hash_id_key alert_hash
</filter>

<match **>
  @type copy
  <store>
    @type file
    compress gzip
    path /var/log/fluent/service_logs
    <buffer time>
      timekey 86400
      timekey_use_utc true
    </buffer>
    append true
  </store>
  <store>
    @type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    @type stdout
  </store>
</match>

<system>
  log_level warn
  emit_error_log_interval 300
  path /var/log/fluent/fluentd
</system>```

@repeatedly
Copy link
Member

@4ndr4s Does this happen with elasticsearch and file buffer combo?
Your graph shows it is not memory leak. Issue author and you use memory buffer.
So if incoming speed is faster than output spped, memory usage is growing.

@4ndr4s
Copy link

4ndr4s commented Apr 13, 2020

@repeatedly yes, this happen with elasticsearch and buffer combo configuration, not sure if the buffer configuration can influence in this behavior, the only change besides the version I did was the buffer block according with fluentd documentation, this was the previous configuration.

    @type file
    compress gzip
    path /var/log/fluent/al_service_logs
    time_slice_format %Y%m%d
    time_slice_wait 10m
    time_format %Y%m%dT%H%M%S%z
    append true
    utc
  </store>

@4ndr4s
Copy link

4ndr4s commented Apr 13, 2020

@repeatedly thinking twice, not sure if that applied to elasticsearch too, I'll enable logging and let you know, and test elasticsearch config alone.

@ganmacs ganmacs removed their assignment Apr 14, 2020
@aeber
Copy link
Contributor

aeber commented Apr 14, 2020

I can confirm, that a current build from master @ cdf75b0 no longer leaks memory with the given configuration. Currently running for 30minutes and memory usage is stable at ~70MB.

Is there already a timeline for a fluentd / td-agent release covering this fix?

@4ndr4s
Copy link

4ndr4s commented Apr 14, 2020

@repeatedly I already remove the buffer block, and go back to the previous configuration and the behavior is the same memory start growing until OOM, this wasn't happening in 0.12.26 version with the same configuration and same incoming data.


<source>
  @type syslog
  port 514
  <transport udp>
  </transport>
  tag  system
  @log_level info
</source>

<source>
  @type syslog
  port 514
  <transport tcp>
  </transport>
  tag  system
</source>

<label @FLUENT_LOG>
  <match fluent.**>
    @type null
  </match>
</label>

<filter **>
  @type elasticsearch_genid
  hash_id_key alert_hash
</filter>

<match **>
  @type copy
  <store>
    @type file
    compress gzip
    path /var/log/fluent/al_service_logs
    time_slice_format %Y%m%d
    time_slice_wait 10m
    time_format %Y%m%dT%H%M%S%z
    append true
  </store>
  <store>
    @type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    @type stdout
  </store>
</match>

<system>
  log_level warn
  emit_error_log_interval 300
  path /var/log/fluent/fluentd
</system>

@4ndr4s
Copy link

4ndr4s commented Apr 15, 2020

@repeatedly I removed the @ type file block and keep only the elasticsearch block and stdout, the issue is still happening, memory usage continue growing until OOM.

  @type copy
  <store>
    @type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    @type stdout
  </store>
</match>

@repeatedly
Copy link
Member

@aeber v1.10.2 has been released.

@4ndr4s Your comment is unclear. Do you use v0.12 with file buffer, right?

@4ndr4s
Copy link

4ndr4s commented Apr 15, 2020

@repeatedly I use v0.12 without buffer and with the follow configuration, using 3 outputs plugins.
I already made those changes using the 1.9.3 version:

  • use the old configuration.
  • use memory buffer with file output plugin.
  • use file buffer with with file output plugin.
  • remove file output plugin

I got OOM with all of those configurations.
I already build a new container with 1.10.2 version, will let you know if that can help with the current issue, need to wait around 24 hours.

fluentd v0.12 configuration >

  type syslog
  port 514
  protocol_type udp
  tag  system
  @log_level info
</source>

<source>
  type syslog
  port 514
  protocol_type tcp
  tag  system
</source>

<match fluent.**>
  @type null
</match>

<filter **>
  @type elasticsearch_genid
  hash_id_key alert_hash
</filter>

<match **>
  type copy
  <store>
    @type file
    compress gzip
    path /var/log/fluent/al_service_logs
    time_slice_format %Y%m%d
    time_slice_wait 10m
    time_format %Y%m%dT%H%M%S%z
    append true
    utc
  </store>
  <store>
    type elasticsearch
    host elasticsearch.example.com
    logstash_format true
    include_tag_key true
    tag_key key
    flush_interval 10s
    id_key alert_hash
    remove_keys alert_hash
  </store>
  <store>
    type stdout
  </store>
</match>

<system>
  log_level warn
  emit_error_log_interval 300
  path /var/log/fluent/fluentd
</system>

@ganmacs
Copy link
Member

ganmacs commented Apr 16, 2020

@4ndr4s

I already made those changes using the 1.9.3 version:

Can you give us the complete config, then?

By the way, Can you re-create a new issue? This issue is not related to your issue. I'll close this issue later.

@4ndr4s
Copy link

4ndr4s commented Apr 16, 2020

@ganmacs thanks for your answer, I just open a new issue with all details. #2955

@ashie ashie added the memory label Mar 15, 2022
@douglasawh
Copy link

I am a bit confused reading this. It looks like #2945 is confirmed to fix the issue, and yet it remains open. Is this still an issue?

@ashie
Copy link
Member

ashie commented May 11, 2022

I'm not sure but I think there are 3 reports in this issue:

Although above 3 reports are similar, probably all of their causes are different.

The issue of the original report is fixed, so we should close this issue.

BTW recently I investigated a similar issue with report2, this is the why I left this opened:
fluent/fluent-package-builder#374
But now I've confirmed that it's also caused by a different root cause.

@ashie ashie closed this as completed May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working memory
Projects
None yet
Development

No branches or pull requests

7 participants