promtail causing very high cpu load when running and stops #5350

waardd · 2022-02-09T08:15:36Z

On several of my servers i want to use promtail like i did on a bunch of other machines.
But when i start promtail on only the var/log logs, the CPU load goes trough the roof (like 15 to 20) and then promtail stops.
Looks like kswap is joining the party in being high on CPU when running promtail.

Is there anybody out there that has this same issue?

It's on RHEL7 by the way.

dannykopping · 2022-02-09T08:48:21Z

When reporting issues, there is a template that we have in place to extract some relevant information related to the bug.

This is the template:

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Started Loki (SHA or version)
2. Started Promtail (SHA or version) to tail '...'
3. Query: `{} term`

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment:**
 - Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
 - Deployment tool: [e.g., helm, jsonnet]

**Screenshots, Promtail config, or terminal output**
If applicable, add any output to help explain your problem.

Please update your description using this template.

liguozhong · 2022-02-10T04:04:02Z

I also has this same issue. My online environment solves the problem of too high CPU through this ‘limit_config’ configuration.

limit_config:
  readline_rate: 1000
  readline_burst: 2000
  readline_rate_enabled: true

refer pr： #5031

waardd · 2022-02-10T06:16:12Z

@dannykopping To first fill the template.

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Started Loki (SHA or version)
Started Promtail (SHA or version) to tail '...'
Query: {} term

Expected behavior
When starting promtail on my RHEL7 machine i was expecting simple scan of the var/log files and sending them to loki

Environment:

Infrastructure: 2 RHEL 7 machines. 1 with promtail one with loki
promtail, version 2.4.2 (branch: HEAD, revision: 525040a32) build user: root@5d9e7a4c92e6 build date: 2022-01-12T16:48:53Z go version: go1.16.2 platform: linux/amd64
loki, version 2.2.1 (branch: HEAD, revision: babea82e) build user: root@e2d295b84e26 build date: 2021-04-06T00:52:41Z go version: go1.15.3 platform: linux/amd64
Deployment tool: binaries copied and run as service

Screenshots, Promtail config, or terminal output
promtail config:

# Promtail Server Config
server:
  http_listen_port: 9080
  grpc_listen_port: 0

# Positions
positions:
  filename: /tmp/positions.yaml

# Loki Server URL
clients:
  - url: http://MyIP:3100/loki/api/v1/push

scrape_configs:
- job_name: varlog
  static_configs:
    - targets:
        - localhost
      labels:
        job: varlog
        host: MyServer
        env: pat
        __path__: /var/log/*log

When running this like so the load goes trhough the roof:
top - 07:11:33 up 9 days, 2:57, 3 users, load average: 20.30, 5.13, 2.08
Tasks: 423 total, 1 running, 422 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.1 us, 3.3 sy, 3.9 ni, 69.7 id, 19.8 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 15215400 total, 9242588 free, 5344188 used, 628624 buff/cache
KiB Swap: 4194296 total, 4812 free, 4189484 used. 9537196 avail Mem

And promtail stops

As an answer to @liguozhong :
When i add

limit_config:
  readline_rate: 1000
  readline_burst: 2000
  readline_rate_enabled: true

I get the message:
[mca@MyServer:~] ./promtail-linux-amd64 -config.file promtail.yml
Unable to parse config: promtail.yml: yaml: unmarshal errors:
line 25: field limit_config not found in type main.Config

dannykopping · 2022-02-10T06:22:28Z

Please include the version of Loki/Promtail you are using

waardd · 2022-02-10T06:25:27Z

Added them to the message :)

dannykopping · 2022-02-10T06:27:32Z

Thanks 👍

Out of interest, why are you running Promtail 2.4.2 but with Loki 2.2.1?

waardd · 2022-02-10T06:29:49Z

I tried several versions of promtail to see if it had any influence because of perhaps bugs or new code.
This is the latest version i downloaded.

dannykopping · 2022-02-10T06:38:47Z

OK, I thought so. Thanks.

limit_config was not part of v2.4.2 I'm afraid. It'll be available in the next couple of weeks I'm sure in v2.5.

Coming back to your original message:

Looks like kswap is joining the party in being high on CPU when running promtail.

This activates when your system is under memory pressure. If your system starts swapping, that can in turn put a lot of strain on the disk. When the disk is under pressure and promtail needs to read from disk, there might be a lot of I/O wait which can factor into load.

%Cpu(s): 3.1 us, 3.3 sy, 3.9 ni, 69.7 id, 19.8 wa, 0.0 hi, 0.2 si, 0.0 st

The 19.8 wa seems to validate that.

wa, IO-wait : time waiting for I/O completion

Your system seems to have a lot of physical memory free, though:

KiB Mem : 15215400 total, 9242588 free, 5344188 used, 628624 buff/cache
KiB Swap: 4194296 total, 4812 free, 4189484 used. 9537196 avail Mem

So I'm not sure why it's swapping so much. You'll have to investigate that to find out why, or you can try disabling swap and re-enabling it which will force the OS to reallocate all of that memory into physical RAM.

I'm not yet convinced that this is an issue within promtail. Happy to re-evaluate if new data is presented.

waardd · 2022-02-10T07:17:52Z

What i do notice is that other machines work fine with promtail.
Indeed the software that is runnign on thje new to configure machines uses a lot of memory for it to work.
One would also expect that after a reboot it would go better but it doesn't (reboots about once a week or 2 weeks)
The system im trying this on is a development machine which has a very low pressure from other software on it because it bearly used. So stress added from other processes are about 0....

dannykopping · 2022-02-10T07:38:22Z

Please run ps aux --sort=-%mem | head and send the results while promtail is in this state of high CPU and stalling

waardd · 2022-02-10T08:01:05Z

Just before promtail stops the result is:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mca       33268 72.7 57.7 9843960 8785976 pts/2 Sl+  08:59   0:12 ./promtail-linux-amd64 -config.file promtail.yml
jboss    127126  0.5 14.6 8227324 2228144 ?     Ssl  Feb08  14:21 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -D[Standalone] -verbose:gc -Xloggc:/<someapp>/standalone/log/gc.log
mca        8023  0.3  3.9 5840504 596492 ?      Sl   Feb01  52:50 /usr/lib/jvm/jre-11-openjdk/bin/java -Xms1990m -Xmx1990m -Dlog4j2.configurationFile=/<someapp>/log4j2.xml
mca      123392  0.5  2.1 4303564 325408 ?      Sl   Feb08  15:01 /usr/lib/jvm/jre-11-openjdk/bin/java -Xms512m -Xmx512m -Dlog4j2.configurationFile=/<someapp>/log4j2.xml -Dj
wsp        3888  0.1  1.5 3156444 236980 ?      SLl  Feb01  17:44 /opt/was85/as/java/bin/java -Dosgi.install.area=/opt/was85/as -Dwas.status.socket=42001 -Dosgi.configuration.area=/opt/
mca        9176  0.4  1.2 6149332 194592 ?      Sl   Feb01  58:43 /usr/lib/jvm/jre-11-openjdk/bin/java -Xms1990m -Xmx1990m -Dlog4j2.configurationFile=/<someapp>log4j2.xml -Dj
root       2241  0.9  1.1 1361644 177164 ?      Sl   Feb01 126:16 /opt/microsoft/mdatp/sbin/wdavdaemon edr 12 11 3
mca        1388  0.1  1.0 7086124 162480 ?      Sl   Feb01  14:32 /usr/bin/java -Djavax.net.ssl.trustStore=/etc/pki/java/cacerts -jar swarm-client.jar -master https://jenkins.windows.ra
splunk     3041  0.3  0.8 275732 122168 ?       Sl   Feb01  45:13 splunkd -p 8089 start

dannykopping · 2022-02-10T08:23:52Z

9MB of RAM is basically nothing.

I don't think it's promtail that's causing your system to swap.
It's really hard to figure out what's going wrong here without access to your system.

I'm pretty sure this is a swap -> disk saturation -> I/O load problem.

waardd · 2022-02-10T09:28:11Z

swapon -s gives following:

Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       2097148 2074808 -2
/dev/dm-14                              partition       2097148 2091864 -3

waardd · 2022-02-14T07:14:40Z

We are creating new RHEL8 virtual machines for our applications.
Im trying to run promtqail with minimum config as above on it.
There is no other application running
And wehen i start promtail i see it eating my memory untill it gets killed.

[waardd1.eu@MyServer ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       462Mi        14Gi       0.0Ki       411Mi        14Gi
Swap:         4.0Gi       308Mi       3.7Gi

[waardd1.eu@MyServer ~]$ swapon -s
Filename                                Type            Size    Used    Priority
/dev/dm-1                               partition       4194300 315580  -2

Within 1 minute it uses up all memory
I also see that in the node-exporter info, the other exporter running on the machine.

stale · 2022-04-17T06:15:08Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

waardd · 2022-04-17T07:17:05Z

The issue was solved with the latest version of promtail that I installed yesterday.
No more high CPU

raypet-cillco · 2022-04-19T18:53:35Z

This just happened to me as well (2.4.2), with *log being the only thing I was grabbing. Trying the upgrade advice, thanks for the tip @waardd

misiek08 · 2022-04-21T19:53:45Z

The issue was solved with the latest version of promtail that I installed yesterday. No more high CPU

From which to which version did it help? We are observing 15 cores of CPU usage on high log traffic.

raypet-cillco · 2022-04-21T19:55:05Z

I can also confirm that the high cpu went away with a new version. Something was off with 2.4.2

waardd · 2022-04-25T09:11:05Z

The issue was solved with the latest version of promtail that I installed yesterday. No more high CPU

From which to which version did it help? We are observing 15 cores of CPU usage on high log traffic.

The older version we tried was 2.4.2.
It got resolved when we used 2.5.0.

tcolgate · 2022-05-05T17:15:39Z

I've been hitting this, very high CPU on one particular node (with no clear reason).
I checked what the process was actually doing with strace and there were a lot of restarted syscalls (mostly newfstatat, and some read). I thought I'd try running with GODEBUG=asyncpreemptoff=1 and so far, this seem to have resolved the issue, though I'll need to monitor for a bit longer to confirm.
Unforutaley the problem came back.

epetrovich · 2022-05-06T07:22:37Z

I have the similar issue. A bunch of ec2 machines in AWS starts with 100% by promtail. Promtail is working and sending logs.
The machines are fresh, the the amount of log files is 2-3.

We are getting rid of promtail/loki. Sorry. No production usage.

promtail --version promtail, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23) build user: root@4779f4b48f3a build date: 2022-04-07T21:50:00Z go version: go1.17.6 platform: linux/amd64

Here is my config:

`server:
http_listen_port: 3100
grpc_listen_port: 0

positions:
filename: /tmp/positions.yaml

clients:

url: http://SERVER_HOSTNAME_XXX:3100/loki/api/v1/push

scrape_configs:

job_name: ec2-logs
ec2_sd_configs:
- region: us-east-1
  access_key: XXX
  secret_key: YYY
  relabel_configs:
- action: replace
  replacement: /home/mistia/mistia/**.log
  target_label: path
- source_labels: [__meta_ec2_tag_Name]
  regex: "(.*)"
  replacement: "$1.staging.goconsensus.internal"
  target_label: host
- action: labelmap
  regex: "_meta_ec2(.+)"
  replacement: "$1"
job_name: ec2-logs-core
ec2_sd_configs:
- region: us-east-1
  access_key: XXX
  secret_key: YYY
  relabel_configs:
- action: replace
  replacement: /home/mistia-core/**.log
  target_label: path
- source_labels: [__meta_ec2_tag_Name]
  regex: "(.*)"
  replacement: "$1.staging.goconsensus.internal"
  target_label: host
- action: labelmap
  regex: "_meta_ec2(.+)"
  replacement: "$1"`

tcolgate · 2022-05-06T07:30:59Z

My machine is a 4 core GCP instance, promtail running i a container. I have about 200 such instances, and so far I am only seeing this on 1 machine. promtail is max'ing out all cores. Prof suggests it's just golang select. This could possible be a loop over a closed channel maybe?
The only "special" thing about the node is that it runs out CI jobs. That said, at this time, the node has only been up for 12 hours and has only every run about 60 pods (it's a k8s cluster). We have been happily running promtail 2.3 for a while, this started with an update to 2.5

Mario-Hofstaetter · 2022-05-06T07:35:25Z

I am currently facing the same issue, running promtail 2.5.0 on windows machines.
The high CPU load starts suddenly and disappears after restarting promtail.

tcolgate · 2022-05-09T11:05:34Z

I had a look with perf-trace to see if I could get some more details on what is happening.

Initially the process starts, starts watching a few files, and all is well.
One of our workflows starts, promtail sees the new files, starts watching, all is well.
The new workflow fiinishes... at this point the all the CPUs on the machine max out.
promtail logs that shutting down of the watches and file tailing.

The machine runs at 100% until promtail is restarted. There's no large volume of logs. perf trace shows a continuous stream of newfstatat calls. Using perf trace -p NNNN --call-graph fp -e syscalls:sys_enter_newfstatat shows the following details for the calls.

   920.450 promtail/277361 syscalls:sys_enter_newfstatat:dfd: CWD, filename: "", statbuf: 0xc00253bbd8
                                       syscall.Syscall6.abi0 (/usr/bin/promtail)
                                       os.statNolog (/usr/bin/promtail)
                                       os.Stat (/usr/bin/promtail)
                                       github.com/hpcloud/tail/watch.(*PollingFileWatcher).ChangeEvents.func1 (/usr/bin/promtail)
                                       runtime.goexit.abi0 (/usr/bin/promtail)

Initially CWD was set to /, I tried updating to /tmp but it hasn't changes any of the behaviour. the filename = "" is deeply suspicious, especially since it is for CWD.

misiek08 · 2022-05-09T21:53:08Z

Maybe on Windows, MacOS and Linux we can set Poll to false? Looks like hpcloud/tail uses fsnotify and it supports push based notifications instead of pull based, so it should work much better.

rabbipigeon · 2022-05-10T07:47:47Z

Any updates on this, I have updated promtail to latest version still its maxing out cpu cores.

Mario-Hofstaetter · 2022-05-10T08:02:05Z

promtail just nuked another one of our machines, this is an urgent issue 😢

tcolgate · 2022-05-10T09:25:10Z

I'm getting a little closer. When one of our workflows stops, the file watcher begins to loop coninuous recieving nil events from the filewatcher channel. This code here...
https://github.com/grafana/loki/blob/main/clients/pkg/promtail/targets/file/filetarget.go#L171-L172

The problem is that the file events channel gets closed, so the loop repeatedly reads nil values from it (it's not a nil channel, it's a closed channel). It's not obvious to me why this is only triggering on very specific circumstances. In our case I think it's the updating of annotations on a pod causing it to be rediscovered. during the shutdown process.

I suspect the following:

promtail starts following a target path P with set of annotations A
the task writing to path finishes, annotations get updated to B
promtail stops origin watcher for P[A]
promtail starts watcher for P[B]
promtail closes file events channel for watchers[P], which the new watcher is listening on

The PR below (#6135) fixes the issue in our case.

yongzhang · 2022-07-05T08:21:13Z

any workaround before new version released?

tcolgate · 2022-07-05T12:25:16Z

Not that I know of, I had to run a build with patches applied.

tablatronix · 2022-09-26T19:56:05Z

FYI, this syntax has changed, for anyone coming across this
it is now, limits_config

promtail           | Unable to parse config: /promtail/promtail-config.yaml: yaml: unmarshal errors:
promtail           |   line 58: field limit_config not found in type main.Config

stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 17, 2022

stale bot removed the stale A stale issue or PR that will automatically be closed. label Apr 17, 2022

tcolgate added a commit to tcolgate/loki that referenced this issue May 10, 2022

promtail: fix infinite loop in file target (fixes grafana#5350)

df8fd44

tcolgate mentioned this issue May 10, 2022

promtail: Fix infinite loop in file target (fixes #5350) #6135

Merged

cyriltovena closed this as completed in #6135 Jun 16, 2022

cyriltovena pushed a commit that referenced this issue Jun 16, 2022

promtail: fix infinite loop in file target (fixes #5350) (#6135)

21f6964

chenjpu mentioned this issue Jun 23, 2022

agent: promtail causing very high cpu load when running and stops grafana/agent#1820

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promtail causing very high cpu load when running and stops #5350

promtail causing very high cpu load when running and stops #5350

waardd commented Feb 9, 2022

dannykopping commented Feb 9, 2022

liguozhong commented Feb 10, 2022 •

edited

Loading

waardd commented Feb 10, 2022 •

edited

Loading

dannykopping commented Feb 10, 2022 •

edited

Loading

waardd commented Feb 10, 2022

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022 •

edited by dannykopping

Loading

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022 •

edited by dannykopping

Loading

waardd commented Feb 14, 2022 •

edited by dannykopping

Loading

stale bot commented Apr 17, 2022

waardd commented Apr 17, 2022

raypet-cillco commented Apr 19, 2022

misiek08 commented Apr 21, 2022

raypet-cillco commented Apr 21, 2022

waardd commented Apr 25, 2022

tcolgate commented May 5, 2022 •

edited

Loading

epetrovich commented May 6, 2022 •

edited

Loading

tcolgate commented May 6, 2022 •

edited

Loading

Mario-Hofstaetter commented May 6, 2022 •

edited

Loading

tcolgate commented May 9, 2022 •

edited

Loading

misiek08 commented May 9, 2022

rabbipigeon commented May 10, 2022

Mario-Hofstaetter commented May 10, 2022

tcolgate commented May 10, 2022 •

edited

Loading

yongzhang commented Jul 5, 2022

tcolgate commented Jul 5, 2022

tablatronix commented Sep 26, 2022 •

edited

Loading

promtail causing very high cpu load when running and stops #5350

promtail causing very high cpu load when running and stops #5350

Comments

waardd commented Feb 9, 2022

dannykopping commented Feb 9, 2022

liguozhong commented Feb 10, 2022 • edited Loading

waardd commented Feb 10, 2022 • edited Loading

dannykopping commented Feb 10, 2022 • edited Loading

waardd commented Feb 10, 2022

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022 • edited by dannykopping Loading

dannykopping commented Feb 10, 2022

waardd commented Feb 10, 2022 • edited by dannykopping Loading

waardd commented Feb 14, 2022 • edited by dannykopping Loading

stale bot commented Apr 17, 2022

waardd commented Apr 17, 2022

raypet-cillco commented Apr 19, 2022

misiek08 commented Apr 21, 2022

raypet-cillco commented Apr 21, 2022

waardd commented Apr 25, 2022

tcolgate commented May 5, 2022 • edited Loading

epetrovich commented May 6, 2022 • edited Loading

tcolgate commented May 6, 2022 • edited Loading

Mario-Hofstaetter commented May 6, 2022 • edited Loading

tcolgate commented May 9, 2022 • edited Loading

misiek08 commented May 9, 2022

rabbipigeon commented May 10, 2022

Mario-Hofstaetter commented May 10, 2022

tcolgate commented May 10, 2022 • edited Loading

yongzhang commented Jul 5, 2022

tcolgate commented Jul 5, 2022

tablatronix commented Sep 26, 2022 • edited Loading

liguozhong commented Feb 10, 2022 •

edited

Loading

waardd commented Feb 10, 2022 •

edited

Loading

dannykopping commented Feb 10, 2022 •

edited

Loading

waardd commented Feb 10, 2022 •

edited by dannykopping

Loading

waardd commented Feb 10, 2022 •

edited by dannykopping

Loading

waardd commented Feb 14, 2022 •

edited by dannykopping

Loading

tcolgate commented May 5, 2022 •

edited

Loading

epetrovich commented May 6, 2022 •

edited

Loading

tcolgate commented May 6, 2022 •

edited

Loading

Mario-Hofstaetter commented May 6, 2022 •

edited

Loading

tcolgate commented May 9, 2022 •

edited

Loading

tcolgate commented May 10, 2022 •

edited

Loading

tablatronix commented Sep 26, 2022 •

edited

Loading