Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail causing very high cpu load when running and stops #5350

Closed
waardd opened this issue Feb 9, 2022 · 32 comments · Fixed by #6135
Closed

promtail causing very high cpu load when running and stops #5350

waardd opened this issue Feb 9, 2022 · 32 comments · Fixed by #6135

Comments

@waardd
Copy link

waardd commented Feb 9, 2022

On several of my servers i want to use promtail like i did on a bunch of other machines.
But when i start promtail on only the var/log logs, the CPU load goes trough the roof (like 15 to 20) and then promtail stops.
Looks like kswap is joining the party in being high on CPU when running promtail.

Is there anybody out there that has this same issue?

It's on RHEL7 by the way.

@dannykopping
Copy link
Contributor

When reporting issues, there is a template that we have in place to extract some relevant information related to the bug.

This is the template:

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Started Loki (SHA or version)
2. Started Promtail (SHA or version) to tail '...'
3. Query: `{} term`

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment:**
 - Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
 - Deployment tool: [e.g., helm, jsonnet]

**Screenshots, Promtail config, or terminal output**
If applicable, add any output to help explain your problem.

Please update your description using this template.

@liguozhong
Copy link
Contributor

liguozhong commented Feb 10, 2022

I also has this same issue. My online environment solves the problem of too high CPU through this ‘limit_config’ configuration.

limit_config:
  readline_rate: 1000
  readline_burst: 2000
  readline_rate_enabled: true

refer pr: #5031

@waardd
Copy link
Author

waardd commented Feb 10, 2022

@dannykopping To first fill the template.

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (SHA or version)
  2. Started Promtail (SHA or version) to tail '...'
  3. Query: {} term

Expected behavior
When starting promtail on my RHEL7 machine i was expecting simple scan of the var/log files and sending them to loki

Environment:

  • Infrastructure: 2 RHEL 7 machines. 1 with promtail one with loki
    promtail, version 2.4.2 (branch: HEAD, revision: 525040a32) build user: root@5d9e7a4c92e6 build date: 2022-01-12T16:48:53Z go version: go1.16.2 platform: linux/amd64
    loki, version 2.2.1 (branch: HEAD, revision: babea82e) build user: root@e2d295b84e26 build date: 2021-04-06T00:52:41Z go version: go1.15.3 platform: linux/amd64
  • Deployment tool: binaries copied and run as service

Screenshots, Promtail config, or terminal output
promtail config:

# Promtail Server Config
server:
  http_listen_port: 9080
  grpc_listen_port: 0

# Positions
positions:
  filename: /tmp/positions.yaml

# Loki Server URL
clients:
  - url: http://MyIP:3100/loki/api/v1/push

scrape_configs:
- job_name: varlog
  static_configs:
    - targets:
        - localhost
      labels:
        job: varlog
        host: MyServer
        env: pat
        __path__: /var/log/*log

When running this like so the load goes trhough the roof:
top - 07:11:33 up 9 days, 2:57, 3 users, load average: 20.30, 5.13, 2.08
Tasks: 423 total, 1 running, 422 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.1 us, 3.3 sy, 3.9 ni, 69.7 id, 19.8 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 15215400 total, 9242588 free, 5344188 used, 628624 buff/cache
KiB Swap: 4194296 total, 4812 free, 4189484 used. 9537196 avail Mem

And promtail stops

As an answer to @liguozhong :
When i add

limit_config:
  readline_rate: 1000
  readline_burst: 2000
  readline_rate_enabled: true

I get the message:
[mca@MyServer:~] ./promtail-linux-amd64 -config.file promtail.yml
Unable to parse config: promtail.yml: yaml: unmarshal errors:
line 25: field limit_config not found in type main.Config

@dannykopping
Copy link
Contributor

dannykopping commented Feb 10, 2022

Please include the version of Loki/Promtail you are using

@waardd
Copy link
Author

waardd commented Feb 10, 2022

Added them to the message :)

@dannykopping
Copy link
Contributor

Thanks 👍

Out of interest, why are you running Promtail 2.4.2 but with Loki 2.2.1?

@waardd
Copy link
Author

waardd commented Feb 10, 2022

I tried several versions of promtail to see if it had any influence because of perhaps bugs or new code.
This is the latest version i downloaded.

@dannykopping
Copy link
Contributor

OK, I thought so. Thanks.

limit_config was not part of v2.4.2 I'm afraid. It'll be available in the next couple of weeks I'm sure in v2.5.

Coming back to your original message:

Looks like kswap is joining the party in being high on CPU when running promtail.

This activates when your system is under memory pressure. If your system starts swapping, that can in turn put a lot of strain on the disk. When the disk is under pressure and promtail needs to read from disk, there might be a lot of I/O wait which can factor into load.

%Cpu(s): 3.1 us, 3.3 sy, 3.9 ni, 69.7 id, 19.8 wa, 0.0 hi, 0.2 si, 0.0 st

The 19.8 wa seems to validate that.

wa, IO-wait : time waiting for I/O completion

Your system seems to have a lot of physical memory free, though:

KiB Mem : 15215400 total, 9242588 free, 5344188 used, 628624 buff/cache
KiB Swap: 4194296 total, 4812 free, 4189484 used. 9537196 avail Mem

So I'm not sure why it's swapping so much. You'll have to investigate that to find out why, or you can try disabling swap and re-enabling it which will force the OS to reallocate all of that memory into physical RAM.

I'm not yet convinced that this is an issue within promtail. Happy to re-evaluate if new data is presented.

@waardd
Copy link
Author

waardd commented Feb 10, 2022

What i do notice is that other machines work fine with promtail.
Indeed the software that is runnign on thje new to configure machines uses a lot of memory for it to work.
One would also expect that after a reboot it would go better but it doesn't (reboots about once a week or 2 weeks)
The system im trying this on is a development machine which has a very low pressure from other software on it because it bearly used. So stress added from other processes are about 0....

@dannykopping
Copy link
Contributor

Please run ps aux --sort=-%mem | head and send the results while promtail is in this state of high CPU and stalling

@waardd
Copy link
Author

waardd commented Feb 10, 2022

Just before promtail stops the result is:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mca       33268 72.7 57.7 9843960 8785976 pts/2 Sl+  08:59   0:12 ./promtail-linux-amd64 -config.file promtail.yml
jboss    127126  0.5 14.6 8227324 2228144 ?     Ssl  Feb08  14:21 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -D[Standalone] -verbose:gc -Xloggc:/<someapp>/standalone/log/gc.log
mca        8023  0.3  3.9 5840504 596492 ?      Sl   Feb01  52:50 /usr/lib/jvm/jre-11-openjdk/bin/java -Xms1990m -Xmx1990m -Dlog4j2.configurationFile=/<someapp>/log4j2.xml
mca      123392  0.5  2.1 4303564 325408 ?      Sl   Feb08  15:01 /usr/lib/jvm/jre-11-openjdk/bin/java -Xms512m -Xmx512m -Dlog4j2.configurationFile=/<someapp>/log4j2.xml -Dj
wsp        3888  0.1  1.5 3156444 236980 ?      SLl  Feb01  17:44 /opt/was85/as/java/bin/java -Dosgi.install.area=/opt/was85/as -Dwas.status.socket=42001 -Dosgi.configuration.area=/opt/
mca        9176  0.4  1.2 6149332 194592 ?      Sl   Feb01  58:43 /usr/lib/jvm/jre-11-openjdk/bin/java -Xms1990m -Xmx1990m -Dlog4j2.configurationFile=/<someapp>log4j2.xml -Dj
root       2241  0.9  1.1 1361644 177164 ?      Sl   Feb01 126:16 /opt/microsoft/mdatp/sbin/wdavdaemon edr 12 11 3
mca        1388  0.1  1.0 7086124 162480 ?      Sl   Feb01  14:32 /usr/bin/java -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts -jar swarm-client.jar -master https://jenkins.windows.ra
splunk     3041  0.3  0.8 275732 122168 ?       Sl   Feb01  45:13 splunkd -p 8089 start

@dannykopping
Copy link
Contributor

9MB of RAM is basically nothing.

I don't think it's promtail that's causing your system to swap.
It's really hard to figure out what's going wrong here without access to your system.

I'm pretty sure this is a swap -> disk saturation -> I/O load problem.

@waardd
Copy link
Author

waardd commented Feb 10, 2022

swapon -s gives following:

Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       2097148 2074808 -2
/dev/dm-14                              partition       2097148 2091864 -3

@waardd
Copy link
Author

waardd commented Feb 14, 2022

We are creating new RHEL8 virtual machines for our applications.
Im trying to run promtqail with minimum config as above on it.
There is no other application running
And wehen i start promtail i see it eating my memory untill it gets killed.

[waardd1.eu@MyServer ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       462Mi        14Gi       0.0Ki       411Mi        14Gi
Swap:         4.0Gi       308Mi       3.7Gi

[waardd1.eu@MyServer ~]$ swapon -s
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       4194300 315580  -2

Within 1 minute it uses up all memory
I also see that in the node-exporter info, the other exporter running on the machine.

@stale
Copy link

stale bot commented Apr 17, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 17, 2022
@waardd
Copy link
Author

waardd commented Apr 17, 2022

The issue was solved with the latest version of promtail that I installed yesterday.
No more high CPU

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Apr 17, 2022
@raypet-cillco
Copy link

This just happened to me as well (2.4.2), with *log being the only thing I was grabbing. Trying the upgrade advice, thanks for the tip @waardd

@misiek08
Copy link

The issue was solved with the latest version of promtail that I installed yesterday. No more high CPU

From which to which version did it help? We are observing 15 cores of CPU usage on high log traffic.

@raypet-cillco
Copy link

I can also confirm that the high cpu went away with a new version. Something was off with 2.4.2

@waardd
Copy link
Author

waardd commented Apr 25, 2022

The issue was solved with the latest version of promtail that I installed yesterday. No more high CPU

From which to which version did it help? We are observing 15 cores of CPU usage on high log traffic.

The older version we tried was 2.4.2.
It got resolved when we used 2.5.0.

@tcolgate
Copy link
Contributor

tcolgate commented May 5, 2022

I've been hitting this, very high CPU on one particular node (with no clear reason).
I checked what the process was actually doing with strace and there were a lot of restarted syscalls (mostly newfstatat, and some read). I thought I'd try running with GODEBUG=asyncpreemptoff=1 and so far, this seem to have resolved the issue, though I'll need to monitor for a bit longer to confirm.
Unforutaley the problem came back.

@epetrovich
Copy link

epetrovich commented May 6, 2022

I have the similar issue. A bunch of ec2 machines in AWS starts with 100% by promtail. Promtail is working and sending logs.
The machines are fresh, the the amount of log files is 2-3.

We are getting rid of promtail/loki. Sorry. No production usage.

promtail --version promtail, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23) build user: root@4779f4b48f3a build date: 2022-04-07T21:50:00Z go version: go1.17.6 platform: linux/amd64

Screen Shot 2022-05-06 at 10 15 45 AM

Here is my config:

`server:
http_listen_port: 3100
grpc_listen_port: 0

positions:
filename: /tmp/positions.yaml

clients:

  • url: http://SERVER_HOSTNAME_XXX:3100/loki/api/v1/push

scrape_configs:

  • job_name: ec2-logs
    ec2_sd_configs:
    • region: us-east-1
      access_key: XXX
      secret_key: YYY
      relabel_configs:
    • action: replace
      replacement: /home/mistia/mistia/**.log
      target_label: path
    • source_labels: [__meta_ec2_tag_Name]
      regex: "(.*)"
      replacement: "$1.staging.goconsensus.internal"
      target_label: host
    • action: labelmap
      regex: "_meta_ec2(.+)"
      replacement: "$1"
  • job_name: ec2-logs-core
    ec2_sd_configs:
    • region: us-east-1
      access_key: XXX
      secret_key: YYY
      relabel_configs:
    • action: replace
      replacement: /home/mistia-core/**.log
      target_label: path
    • source_labels: [__meta_ec2_tag_Name]
      regex: "(.*)"
      replacement: "$1.staging.goconsensus.internal"
      target_label: host
    • action: labelmap
      regex: "_meta_ec2(.+)"
      replacement: "$1"`

@tcolgate
Copy link
Contributor

tcolgate commented May 6, 2022

My machine is a 4 core GCP instance, promtail running i a container. I have about 200 such instances, and so far I am only seeing this on 1 machine. promtail is max'ing out all cores. Prof suggests it's just golang select. This could possible be a loop over a closed channel maybe?
The only "special" thing about the node is that it runs out CI jobs. That said, at this time, the node has only been up for 12 hours and has only every run about 60 pods (it's a k8s cluster). We have been happily running promtail 2.3 for a while, this started with an update to 2.5

image

@Mario-Hofstaetter
Copy link
Contributor

Mario-Hofstaetter commented May 6, 2022

I am currently facing the same issue, running promtail 2.5.0 on windows machines.
The high CPU load starts suddenly and disappears after restarting promtail.

@tcolgate
Copy link
Contributor

tcolgate commented May 9, 2022

I had a look with perf-trace to see if I could get some more details on what is happening.

  • Initially the process starts, starts watching a few files, and all is well.
  • One of our workflows starts, promtail sees the new files, starts watching, all is well.
  • The new workflow fiinishes... at this point the all the CPUs on the machine max out.
  • promtail logs that shutting down of the watches and file tailing.

The machine runs at 100% until promtail is restarted. There's no large volume of logs. perf trace shows a continuous stream of newfstatat calls. Using perf trace -p NNNN --call-graph fp -e syscalls:sys_enter_newfstatat shows the following details for the calls.

   920.450 promtail/277361 syscalls:sys_enter_newfstatat:dfd: CWD, filename: "", statbuf: 0xc00253bbd8
                                       syscall.Syscall6.abi0 (/usr/bin/promtail)
                                       os.statNolog (/usr/bin/promtail)
                                       os.Stat (/usr/bin/promtail)
                                       github.com/hpcloud/tail/watch.(*PollingFileWatcher).ChangeEvents.func1 (/usr/bin/promtail)
                                       runtime.goexit.abi0 (/usr/bin/promtail)

Initially CWD was set to /, I tried updating to /tmp but it hasn't changes any of the behaviour. the filename = "" is deeply suspicious, especially since it is for CWD.

@misiek08
Copy link

misiek08 commented May 9, 2022

Maybe on Windows, MacOS and Linux we can set Poll to false? Looks like hpcloud/tail uses fsnotify and it supports push based notifications instead of pull based, so it should work much better.

@rabbipigeon
Copy link

Any updates on this, I have updated promtail to latest version still its maxing out cpu cores.

@Mario-Hofstaetter
Copy link
Contributor

promtail just nuked another one of our machines, this is an urgent issue 😢

@tcolgate
Copy link
Contributor

tcolgate commented May 10, 2022

I'm getting a little closer. When one of our workflows stops, the file watcher begins to loop coninuous recieving nil events from the filewatcher channel. This code here...
https://github.com/grafana/loki/blob/main/clients/pkg/promtail/targets/file/filetarget.go#L171-L172

The problem is that the file events channel gets closed, so the loop repeatedly reads nil values from it (it's not a nil channel, it's a closed channel). It's not obvious to me why this is only triggering on very specific circumstances. In our case I think it's the updating of annotations on a pod causing it to be rediscovered. during the shutdown process.

I suspect the following:

  1. promtail starts following a target path P with set of annotations A
  2. the task writing to path finishes, annotations get updated to B
  3. promtail stops origin watcher for P[A]
  4. promtail starts watcher for P[B]
  5. promtail closes file events channel for watchers[P], which the new watcher is listening on

The PR below (#6135) fixes the issue in our case.

@yongzhang
Copy link

any workaround before new version released?

@tcolgate
Copy link
Contributor

tcolgate commented Jul 5, 2022

Not that I know of, I had to run a build with patches applied.

@tablatronix
Copy link

tablatronix commented Sep 26, 2022

FYI, this syntax has changed, for anyone coming across this
it is now, limits_config

promtail           | Unable to parse config: /promtail/promtail-config.yaml: yaml: unmarshal errors:
promtail           |   line 58: field limit_config not found in type main.Config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.