Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS]: Improving the performance of the prometheus plugin #5755

Closed
tzssangglass opened this issue Dec 9, 2021 · 10 comments · Fixed by #6315
Closed

[DISCUSS]: Improving the performance of the prometheus plugin #5755

tzssangglass opened this issue Dec 9, 2021 · 10 comments · Fixed by #6315
Assignees
Labels
enhancement New feature or request

Comments

@tzssangglass
Copy link
Member

Issue description

the community has recently received reports from users of APISIX generating long-tail requests. e.g.
#5604
#5500

Talking to them and testing on their environment, we found the problem that when starting the prometheus plugin on APISIX, the prometheus service when collecting data (accessing /apisix/prometheus/metrics) causes APISIX to delay some of the requests, i.e. long-tail requests.

According to my analysis, this is because the collect function is computationally heavy and takes up too many CPU time slices.

function _M.collect()
if not prometheus or not metrics then
core.log.error("prometheus: plugin is not initialized, please make sure ",
" 'prometheus_metrics' shared dict is present in nginx template")
return 500, {message = "An unexpected error occurred"}
end
-- across all services
nginx_status()

I have tried to do some optimisation, which has worked, but not as well as I would have liked.

here are my optimizations

diff --git a/apisix/plugins/prometheus/exporter.lua b/apisix/plugins/prometheus/exporter.lua
index d565b4c4..32f51ddb 100644
--- a/apisix/plugins/prometheus/exporter.lua
+++ b/apisix/plugins/prometheus/exporter.lua
@@ -34,7 +34,7 @@ local get_stream_routes = router.stream_routes
 local get_protos = require("apisix.plugins.grpc-transcode.proto").protos
 local service_fetch = require("apisix.http.service").get
 local latency_details = require("apisix.utils.log-util").latency_details_in_ms
-
+local ngx_sleep = ngx.sleep
 
 
 -- Default set of latency buckets, 1ms to 60s:
@@ -174,17 +174,18 @@ local function nginx_status()
         core.log.error("failed to fetch Nginx status")
         return
     end
-
     -- Active connections: 2
     -- server accepts handled requests
     --   26 26 84
     -- Reading: 0 Writing: 1 Waiting: 1
 
+    ngx_sleep(0)
     local iterator, err = re_gmatch(res.body, [[(\d+)]], "jmo")
     if not iterator then
         core.log.error("failed to re.gmatch Nginx status: ", err)
         return
     end
+    ngx_sleep(0)
 
     for _, name in ipairs(ngx_status_items) do
         local val = iterator()
@@ -196,6 +197,7 @@ local function nginx_status()
         metrics.connections:set(val[0], label_values)
 
     end
+    ngx_sleep(0)
 end
 
 
@@ -216,7 +218,7 @@ local function set_modify_index(key, items, items_ver, global_max_index)
 
     key_values[1] = key
     metrics.etcd_modify_indexes:set(max_idx, key_values)
-
+    ngx_sleep(0)
 
     global_max_index = max_idx > global_max_index and max_idx or global_max_index
 
@@ -231,19 +233,19 @@ local function etcd_modify_index()
     -- routes
     local routes, routes_ver = get_routes()
     global_max_idx = set_modify_index("routes", routes, routes_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- services
     local services, services_ver = get_services()
     global_max_idx = set_modify_index("services", services, services_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- ssls
     local ssls, ssls_ver = get_ssls()
     global_max_idx = set_modify_index("ssls", ssls, ssls_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- consumers
     local consumers, consumers_ver = get_consumers()
     global_max_idx = set_modify_index("consumers", consumers, consumers_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- global_rules
     local global_rules = router.global_rules
     if global_rules then
@@ -253,28 +255,29 @@ local function etcd_modify_index()
         -- prev_index
         key_values[1] = "prev_index"
         metrics.etcd_modify_indexes:set(global_rules.prev_index, key_values)
-
+        ngx_sleep(0)
     else
         global_max_idx = set_modify_index("global_rules", nil, nil, global_max_idx)
+        ngx_sleep(0)
     end
 
     -- upstreams
     local upstreams, upstreams_ver = get_upstreams()
     global_max_idx = set_modify_index("upstreams", upstreams, upstreams_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- stream_routes
     local stream_routes, stream_routes_ver = get_stream_routes()
     global_max_idx = set_modify_index("stream_routes", stream_routes,
         stream_routes_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- proto
     local protos, protos_ver = get_protos()
     global_max_idx = set_modify_index("protos", protos, protos_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- global max
     key_values[1] = "max_modify_index"
     metrics.etcd_modify_indexes:set(global_max_idx, key_values)
-
+    ngx_sleep(0)
 end
 
 
@@ -297,15 +300,17 @@ function _M.collect()
     if config.type == "etcd" then
         -- etcd modify index
         etcd_modify_index()
-
+        ngx_sleep(0)
         local version, err = config:server_version()
         if version then
             metrics.etcd_reachable:set(1)
+            ngx_sleep(0)
 
         else
             metrics.etcd_reachable:set(0)
             core.log.error("prometheus: failed to reach config server while ",
                            "processing metrics endpoint: ", err)
+            ngx_sleep(0)
         end
 
         -- Because request any key from etcd will return the "X-Etcd-Index".
@@ -317,13 +322,17 @@ function _M.collect()
             -- global max
             key_values[1] = "x_etcd_index"
             metrics.etcd_modify_indexes:set(res.headers["X-Etcd-Index"], key_values)
+            ngx_sleep(0)
         end
     end
 
     metrics.node_info:set(1, gen_arr(hostname))
+    ngx_sleep(0)
 
     core.response.set_header("content_type", "text/plain")
-    return 200, core.table.concat(prometheus:metric_data())
+    local metric_data = prometheus:metric_data()
+    ngx_sleep(0)
+    return 200, core.table.concat(metric_data)
 end
 
 

and I've done some local verification

config

  1. upstream server is an openresty, the nginx.conf is https://github.com/apache/apisix/tree/master/benchmark/server/conf

  2. the route config is

curl --location --request PUT 'http://127.0.0.1:9080/apisix/admin/routes/1' \
--header 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
--header 'Content-Type: application/json' \
--data-raw '{
    "plugins": {
        "prometheus":{}
    },
    "upstream": {
        "nodes": {
            "127.0.0.1:1980": 1
        },
        "type": "roundrobin"
    },
    "uri": "/get"
}'
  1. the config.yaml is
apisix:
  admin_key:
    - name: admin
      key: edd1c9f034335f136f87ad84b625c8f1  # using fixed API token has security risk, please update it when you deploy to production environment
      role: admin
nginx_config:
  http:
    enable_access_log: false
  worker_processes: 1

test

  1. no prometheus plugin
wrk2 -t4 -c200 -d60s -R5000 --u_latency http://127.0.0.1:9080/get
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%  669.00us
 75.000%    0.97ms
 90.000%    1.30ms
 99.000%   10.90ms
 99.900%   39.55ms
 99.990%   45.69ms
 99.999%   49.44ms
100.000%   49.47ms
  1. with prometheus plugin

use wrk to trigger prometheus clooect data

wrk2 -t4 -c100 -d6000s -R200 --u_latency http://127.0.0.1:9091/apisix/prometheus/metrics

test

wrk2 -t4 -c200 -d60s -R5000 --u_latency http://127.0.0.1:9080/get
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%    2.23ms
 75.000%   13.78ms
 90.000%   28.03ms
 99.000%   46.59ms
 99.900%   64.54ms
 99.990%   90.69ms
 99.999%   99.71ms
100.000%  101.57ms
  1. with prometheus plugin and optimise

the trigger is as above

test

wrk2 -t4 -c200 -d60s -R5000 --u_latency http://127.0.0.1:9080/get
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%    3.93ms
 75.000%    7.65ms
 90.000%   14.16ms
 99.000%   33.85ms
 99.900%   51.65ms
 99.990%   60.06ms
 99.999%   68.42ms
100.000%   68.48ms

for more information on how to optimise, refer to: https://groups.google.com/g/openresty/c/fuY_vTS01eg

when collecting data in prometheus, we use nix.timer(0) to give up CPU time slice to give epoll a chance to process more requests.

based on the above optimisation ideas we can assume that this optimisation is effective.

but we need to make more optimizations to the peometheus plugin, or even modify lua-resty-prometheus.

thanks @jagerzhang @sandy420

would you like to improve this optimization?

Environment

  • apisix version (cmd: apisix version): master
  • OS (cmd: uname -a):
  • OpenResty / Nginx version (cmd: nginx -V or openresty -V):
  • etcd version, if have (cmd: run curl http://127.0.0.1:9090/v1/server_info to get the info from server-info API):
  • apisix-dashboard version, if have:
  • the plugin runner version, if the issue is about a plugin runner (cmd: depended on the kind of runner):
  • luarocks version, if the issue is about installation (cmd: luarocks --version):
@spacewander spacewander added the enhancement New feature or request label Dec 9, 2021
@moonming
Copy link
Member

Do we have other ways to switch out some CPU time? So many hard-coded ngx_sleep(0) do not look good enough.

@tzssangglass
Copy link
Member Author

yes, it looks strange. But I only know that ngx_sleep(0) can do it.

@tokers
Copy link
Contributor

tokers commented Dec 13, 2021

I think we may only need to focus on some critical paths, not all of them need to be yielded. Could you generate a flamegraph?

@spacewander
Copy link
Member

I think this solution is similar to Kong/kong@3fc3961?

Maybe we can also use counter for it: https://github.com/Kong/kong/blob/d65101fe80fd7ac9870a84d34d81bda8bcb461ac/kong/tools/utils.lua#L1438

@tzssangglass
Copy link
Member Author

I think this solution is similar to Kong/kong@3fc3961?

Maybe we can also use counter for it: Kong/kong@d65101f/kong/tools/utils.lua#L1438

counter is for call yield in a loop?

@xianshun163
Copy link

dear all, Is there a good solution for this issue?

@tzssangglass
Copy link
Member Author

tzssangglass commented Jan 17, 2022

dear all, Is there a good solution for this issue?

I' m working on it.

ref: knyar/nginx-lua-prometheus#131

I will submit another PR to upstream.

@moonming moonming pinned this issue Jan 27, 2022
@moonming
Copy link
Member

@tzssangglass knyar/nginx-lua-prometheus#131 merged, so any update for this issue? thx

@tzssangglass
Copy link
Member Author

@tzssangglass knyar/nginx-lua-prometheus#131 merged, so any update for this issue? thx

I have asked the maintainer if there are plans to release a new version, form my previous observations, usually the maintainers wait for stability before releasing a new version, so let's wait a little longer.

@tzssangglass
Copy link
Member Author

new version has be released: https://github.com/knyar/nginx-lua-prometheus/releases/tag/0.20220127

In the following days I will submit a PR to fix APISIX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants