-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: high cpu usage after prometheus.lua report 'no memory' #10000
Comments
from the log:
the service monitor config
so, if the response too slow(maybe the metrics data too huge?), it will timeout and lost the line in chart. if the prometheus privileged process keep receiving the |
|
related etcd issue: #7353 |
@wklken This was identified as an etcd issue and was fixed in etcd release 3.5.5 etcd-io/etcd#14138 |
@Revolyssup
so I'm not sure if the huge metrics cause high cpu usage is related to #7345 |
For the no memory issue, can you increase the size of shared_dict? Because it keeps retrying when out of memory and the CPU usage might have that correlation. |
It's also not the memory issue, we change the shared_dict You can use loop curl It may take a lot of cpu usage for the prometheus plugin to dump(or calculate) the data from memory into response? Currently, I have to increase the limits and patch a trigger to disable the official prometheus metrics. |
@wklken The performance is not possible to fix. But as many users wish, they need configurable built-in metrics (not hardcode), since some metrics define some labels that vary greatly in some cases, making the number of metric variants increase a lot. We do not delete outdated metrics, so each time pulled by prometheus, the CPU increases a lot. we always advise that: custom the metrics as they need when they use prometheus plugin. So in this case, the patch you have is only solution. You can close this issue if this answers your questions. |
ok, understood! the #9673 make it possible to decrease the amount of close, thanks for your response. |
Current Behavior
we make
prometheus-metrics
shared dict10m
, and after deployed online (1 deployment 8 pods) for about 7 days, the memory for each pod has been exhausted one by one.we take a look at each pod which been killed because the cpu hit the resources limit
from the grafana dashboard and the error log
when the metrics lost data, it means
/metrics
not response in 30 seconds, maybe the response is huge?the error log only have
no memory
presentand after few hours, the apisix container will hit the cpu limits and been restarted.
before the container hit the high cpu limit, it has been report
no memory
for few hours.we have many other environments, and if the environment has restarts, we redeploy the apisix, and no restarts before the prometheus report
no memory
.Expected Behavior
no hig cpu usage even the prometheus.lua report 'no memory'
Error Logs
[error] 76#76: *62508500 [lua] prometheus.lua:920: log_error(): Error while setting 'etcd_modify_indexes{key="x_etcd_index"}' to '98387': 'no memory', client: 1.1.1.1, server: , request: "GET /metrics HTTP/1.1", host: "0.0.0.0:6008"
Steps to Reproduce
prometheus_metrics
to a limited size/metrics
as a prometheus target, Scrape Duration to 1 secondEnvironment
apisix version
): 3.2.0uname -a
):openresty -V
ornginx -V
):curl http://127.0.0.1:9090/v1/server_info
): 3.5.4luarocks --version
):The text was updated successfully, but these errors were encountered: