bug: high cpu usage after prometheus.lua report 'no memory' #10000

wklken · 2023-08-10T03:17:20Z

Current Behavior

  apisixConfig:
    luaSharedDict:
      .......
      prometheus-metrics: 10m

we make prometheus-metrics shared dict 10m, and after deployed online (1 deployment 8 pods) for about 7 days, the memory for each pod has been exhausted one by one.

we take a look at each pod which been killed because the cpu hit the resources limit

from the grafana dashboard and the error log

when the metrics lost data, it means /metrics not response in 30 seconds, maybe the response is huge?

the error log only have no memory present

[error] 76#76: *62508500 [lua] prometheus.lua:920: log_error(): Error while setting 'etcd_modify_indexes{key="x_etcd_index"}' to '98387': 'no memory', client: 1.1.1.1, server: , request: "GET /metrics HTTP/1.1", host: "0.0.0.0:6008"

and after few hours, the apisix container will hit the cpu limits and been restarted.

before the container hit the high cpu limit, it has been report no memory for few hours.

we have many other environments, and if the environment has restarts, we redeploy the apisix, and no restarts before the prometheus report no memory.

Expected Behavior

no hig cpu usage even the prometheus.lua report 'no memory'

Error Logs

[error] 76#76: *62508500 [lua] prometheus.lua:920: log_error(): Error while setting 'etcd_modify_indexes{key="x_etcd_index"}' to '98387': 'no memory', client: 1.1.1.1, server: , request: "GET /metrics HTTP/1.1", host: "0.0.0.0:6008"

Steps to Reproduce

set prometheus_metrics to a limited size
deploy it with resources cpu limit
add the /metrics as a prometheus target, Scrape Duration to 1 second

Environment

APISIX version (run apisix version): 3.2.0
Operating system (run uname -a):
OpenResty / Nginx version (run openresty -V or nginx -V):
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.5.4
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

wklken · 2023-08-10T11:28:05Z

from the log:

apisix.node_listen port is ok before been restarted, access.log
prometheus.lua:920 no memory in error.log
the prometheus can't scrape the /metrics from apisix for few hours

the service monitor config

    interval: "30s"
    scrapeTimeout: "30s"

so, if the response too slow(maybe the metrics data too huge?), it will timeout and lost the line in chart.

if the prometheus privileged process keep receiving the /metrics for hours, all timeout => a high cpu usage => the livenessProbe/readinessProbe to the apisix will be affected, timeout (context deadline exceeded (Client Timeout exceeded while awaiting headers) => the container been restarted

wklken · 2023-08-10T12:20:21Z

if the /metrics about 15-20M, the prometheus scrape will cause the cpu to 100%

wklken · 2023-08-11T03:22:22Z

limit: 48000 lines, about 14M
http_status = route / bandwidth = type * route / http_latency = type * route * (len(default_buckets) + 1)
route + 3 * route + 3 * route * (n + 1) <= 48000
(7 + 3n) * route <= 48000 (n >=1)

version >= 3.4.x, we can set default_buckets; if n = 1, route <= 4800;
version <= 3.4, we can't set default_buckets; n=15, route <= 923;

Revolyssup · 2023-09-18T07:43:11Z

related etcd issue: #7353

Revolyssup · 2023-09-18T07:47:40Z

@wklken This was identified as an etcd issue and was fixed in etcd release 3.5.5 etcd-io/etcd#14138

wklken · 2023-09-18T08:08:19Z

@Revolyssup
I use apisix 3.2.1(the prometheus is a privileged process), and use http to connect with etcd

deployment:
  etcd:
    host:
      - "http://bk-apigateway-etcd:2379"

so I'm not sure if the huge metrics cause high cpu usage is related to #7345

Revolyssup · 2023-09-18T21:16:47Z

For the no memory issue, can you increase the size of shared_dict? Because it keeps retrying when out of memory and the CPU usage might have that correlation.

wklken · 2023-09-19T01:37:43Z

@Revolyssup

It's also not the memory issue, we change the shared_dict prometheus-metrics: 50m, and while we have registered about 20000 routes, and just curl each of them, the /metrics is about 15-20M, at that time, curl /metrics will cause 100% cpu usage of the container, then hit the limits and been restarted.

You can use loop curl post /routes to create 20000 routes and just curl each of them.

It may take a lot of cpu usage for the prometheus plugin to dump(or calculate) the data from memory into response?

Currently, I have to increase the limits and patch a trigger to disable the official prometheus metrics.

Revolyssup · 2023-09-19T11:18:43Z

@wklken The performance is not possible to fix. But as many users wish, they need configurable built-in metrics (not hardcode), since some metrics define some labels that vary greatly in some cases, making the number of metric variants increase a lot. We do not delete outdated metrics, so each time pulled by prometheus, the CPU increases a lot. we always advise that: custom the metrics as they need when they use prometheus plugin. So in this case, the patch you have is only solution. You can close this issue if this answers your questions.

wklken · 2023-09-19T11:24:11Z

ok, understood! the #9673 make it possible to decrease the amount of /metrics, but it's nice to have a config to disable all.

close, thanks for your response.

Revolyssup added the bug Something isn't working label Aug 11, 2023

Revolyssup added this to Apache APISIX backlog Aug 11, 2023

Revolyssup moved this to 📋 Backlog in Apache APISIX backlog Aug 11, 2023

Revolyssup moved this from 📋 Backlog to 🏗 In progress in Apache APISIX backlog Sep 18, 2023

wklken closed this as completed Sep 19, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: high cpu usage after prometheus.lua report 'no memory' #10000

bug: high cpu usage after prometheus.lua report 'no memory' #10000

wklken commented Aug 10, 2023 •

edited

Loading

wklken commented Aug 10, 2023 •

edited

Loading

wklken commented Aug 10, 2023

wklken commented Aug 11, 2023 •

edited

Loading

Revolyssup commented Sep 18, 2023

Revolyssup commented Sep 18, 2023

wklken commented Sep 18, 2023

Revolyssup commented Sep 18, 2023

wklken commented Sep 19, 2023 •

edited

Loading

Revolyssup commented Sep 19, 2023

wklken commented Sep 19, 2023

bug: high cpu usage after prometheus.lua report 'no memory' #10000

bug: high cpu usage after prometheus.lua report 'no memory' #10000

Comments

wklken commented Aug 10, 2023 • edited Loading

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

wklken commented Aug 10, 2023 • edited Loading

wklken commented Aug 10, 2023

wklken commented Aug 11, 2023 • edited Loading

Revolyssup commented Sep 18, 2023

Revolyssup commented Sep 18, 2023

wklken commented Sep 18, 2023

Revolyssup commented Sep 18, 2023

wklken commented Sep 19, 2023 • edited Loading

Revolyssup commented Sep 19, 2023

wklken commented Sep 19, 2023

wklken commented Aug 10, 2023 •

edited

Loading

wklken commented Aug 10, 2023 •

edited

Loading

wklken commented Aug 11, 2023 •

edited

Loading

wklken commented Sep 19, 2023 •

edited

Loading