[DISCUSS]: Improving the performance of the prometheus plugin #5755

tzssangglass · 2021-12-09T07:35:56Z

Issue description

the community has recently received reports from users of APISIX generating long-tail requests. e.g.
#5604
#5500

Talking to them and testing on their environment, we found the problem that when starting the prometheus plugin on APISIX, the prometheus service when collecting data (accessing /apisix/prometheus/metrics) causes APISIX to delay some of the requests, i.e. long-tail requests.

According to my analysis, this is because the collect function is computationally heavy and takes up too many CPU time slices.

apisix/apisix/plugins/prometheus/exporter.lua

Lines 281 to 289 in 5ae38f8

    
           function _M.collect() 
        
               if not prometheus or not metrics then 
        
                   core.log.error("prometheus: plugin is not initialized, please make sure ", 
        
                                " 'prometheus_metrics' shared dict is present in nginx template") 
        
                   return 500, {message = "An unexpected error occurred"} 
        
               end 
        
               -- across all services 
        
               nginx_status()

I have tried to do some optimisation, which has worked, but not as well as I would have liked.

here are my optimizations

diff --git a/apisix/plugins/prometheus/exporter.lua b/apisix/plugins/prometheus/exporter.lua
index d565b4c4..32f51ddb 100644
--- a/apisix/plugins/prometheus/exporter.lua
+++ b/apisix/plugins/prometheus/exporter.lua
@@ -34,7 +34,7 @@ local get_stream_routes = router.stream_routes
 local get_protos = require("apisix.plugins.grpc-transcode.proto").protos
 local service_fetch = require("apisix.http.service").get
 local latency_details = require("apisix.utils.log-util").latency_details_in_ms
-
+local ngx_sleep = ngx.sleep
 
 
 -- Default set of latency buckets, 1ms to 60s:
@@ -174,17 +174,18 @@ local function nginx_status()
         core.log.error("failed to fetch Nginx status")
         return
     end
-
     -- Active connections: 2
     -- server accepts handled requests
     --   26 26 84
     -- Reading: 0 Writing: 1 Waiting: 1
 
+    ngx_sleep(0)
     local iterator, err = re_gmatch(res.body, [[(\d+)]], "jmo")
     if not iterator then
         core.log.error("failed to re.gmatch Nginx status: ", err)
         return
     end
+    ngx_sleep(0)
 
     for _, name in ipairs(ngx_status_items) do
         local val = iterator()
@@ -196,6 +197,7 @@ local function nginx_status()
         metrics.connections:set(val[0], label_values)
 
     end
+    ngx_sleep(0)
 end
 
 
@@ -216,7 +218,7 @@ local function set_modify_index(key, items, items_ver, global_max_index)
 
     key_values[1] = key
     metrics.etcd_modify_indexes:set(max_idx, key_values)
-
+    ngx_sleep(0)
 
     global_max_index = max_idx > global_max_index and max_idx or global_max_index
 
@@ -231,19 +233,19 @@ local function etcd_modify_index()
     -- routes
     local routes, routes_ver = get_routes()
     global_max_idx = set_modify_index("routes", routes, routes_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- services
     local services, services_ver = get_services()
     global_max_idx = set_modify_index("services", services, services_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- ssls
     local ssls, ssls_ver = get_ssls()
     global_max_idx = set_modify_index("ssls", ssls, ssls_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- consumers
     local consumers, consumers_ver = get_consumers()
     global_max_idx = set_modify_index("consumers", consumers, consumers_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- global_rules
     local global_rules = router.global_rules
     if global_rules then
@@ -253,28 +255,29 @@ local function etcd_modify_index()
         -- prev_index
         key_values[1] = "prev_index"
         metrics.etcd_modify_indexes:set(global_rules.prev_index, key_values)
-
+        ngx_sleep(0)
     else
         global_max_idx = set_modify_index("global_rules", nil, nil, global_max_idx)
+        ngx_sleep(0)
     end
 
     -- upstreams
     local upstreams, upstreams_ver = get_upstreams()
     global_max_idx = set_modify_index("upstreams", upstreams, upstreams_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- stream_routes
     local stream_routes, stream_routes_ver = get_stream_routes()
     global_max_idx = set_modify_index("stream_routes", stream_routes,
         stream_routes_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- proto
     local protos, protos_ver = get_protos()
     global_max_idx = set_modify_index("protos", protos, protos_ver, global_max_idx)
-
+    ngx_sleep(0)
     -- global max
     key_values[1] = "max_modify_index"
     metrics.etcd_modify_indexes:set(global_max_idx, key_values)
-
+    ngx_sleep(0)
 end
 
 
@@ -297,15 +300,17 @@ function _M.collect()
     if config.type == "etcd" then
         -- etcd modify index
         etcd_modify_index()
-
+        ngx_sleep(0)
         local version, err = config:server_version()
         if version then
             metrics.etcd_reachable:set(1)
+            ngx_sleep(0)
 
         else
             metrics.etcd_reachable:set(0)
             core.log.error("prometheus: failed to reach config server while ",
                            "processing metrics endpoint: ", err)
+            ngx_sleep(0)
         end
 
         -- Because request any key from etcd will return the "X-Etcd-Index".
@@ -317,13 +322,17 @@ function _M.collect()
             -- global max
             key_values[1] = "x_etcd_index"
             metrics.etcd_modify_indexes:set(res.headers["X-Etcd-Index"], key_values)
+            ngx_sleep(0)
         end
     end
 
     metrics.node_info:set(1, gen_arr(hostname))
+    ngx_sleep(0)
 
     core.response.set_header("content_type", "text/plain")
-    return 200, core.table.concat(prometheus:metric_data())
+    local metric_data = prometheus:metric_data()
+    ngx_sleep(0)
+    return 200, core.table.concat(metric_data)
 end

and I've done some local verification

config

upstream server is an openresty, the nginx.conf is https://github.com/apache/apisix/tree/master/benchmark/server/conf
the route config is

curl --location --request PUT 'http://127.0.0.1:9080/apisix/admin/routes/1' \
--header 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
--header 'Content-Type: application/json' \
--data-raw '{
    "plugins": {
        "prometheus":{}
    },
    "upstream": {
        "nodes": {
            "127.0.0.1:1980": 1
        },
        "type": "roundrobin"
    },
    "uri": "/get"
}'

the config.yaml is

apisix:
  admin_key:
    - name: admin
      key: edd1c9f034335f136f87ad84b625c8f1  # using fixed API token has security risk, please update it when you deploy to production environment
      role: admin
nginx_config:
  http:
    enable_access_log: false
  worker_processes: 1

test

no prometheus plugin

wrk2 -t4 -c200 -d60s -R5000 --u_latency http://127.0.0.1:9080/get
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%  669.00us
 75.000%    0.97ms
 90.000%    1.30ms
 99.000%   10.90ms
 99.900%   39.55ms
 99.990%   45.69ms
 99.999%   49.44ms
100.000%   49.47ms

with prometheus plugin

use wrk to trigger prometheus clooect data

wrk2 -t4 -c100 -d6000s -R200 --u_latency http://127.0.0.1:9091/apisix/prometheus/metrics

test

wrk2 -t4 -c200 -d60s -R5000 --u_latency http://127.0.0.1:9080/get
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%    2.23ms
 75.000%   13.78ms
 90.000%   28.03ms
 99.000%   46.59ms
 99.900%   64.54ms
 99.990%   90.69ms
 99.999%   99.71ms
100.000%  101.57ms

with prometheus plugin and optimise

the trigger is as above

test

wrk2 -t4 -c200 -d60s -R5000 --u_latency http://127.0.0.1:9080/get
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%    3.93ms
 75.000%    7.65ms
 90.000%   14.16ms
 99.000%   33.85ms
 99.900%   51.65ms
 99.990%   60.06ms
 99.999%   68.42ms
100.000%   68.48ms

for more information on how to optimise, refer to: https://groups.google.com/g/openresty/c/fuY_vTS01eg

when collecting data in prometheus, we use nix.timer(0) to give up CPU time slice to give epoll a chance to process more requests.

based on the above optimisation ideas we can assume that this optimisation is effective.

but we need to make more optimizations to the peometheus plugin, or even modify lua-resty-prometheus.

thanks @jagerzhang @sandy420

would you like to improve this optimization?

Environment

apisix version (cmd: apisix version): master
OS (cmd: uname -a):
OpenResty / Nginx version (cmd: nginx -V or openresty -V):
etcd version, if have (cmd: run curl http://127.0.0.1:9090/v1/server_info to get the info from server-info API):
apisix-dashboard version, if have:
the plugin runner version, if the issue is about a plugin runner (cmd: depended on the kind of runner):
luarocks version, if the issue is about installation (cmd: luarocks --version):

The text was updated successfully, but these errors were encountered:

moonming · 2021-12-10T12:36:09Z

Do we have other ways to switch out some CPU time? So many hard-coded ngx_sleep(0) do not look good enough.

tzssangglass · 2021-12-12T15:41:32Z

yes, it looks strange. But I only know that ngx_sleep(0) can do it.

tokers · 2021-12-13T01:46:34Z

I think we may only need to focus on some critical paths, not all of them need to be yielded. Could you generate a flamegraph?

spacewander · 2021-12-13T06:13:22Z

I think this solution is similar to Kong/kong@3fc3961?

Maybe we can also use counter for it: https://github.com/Kong/kong/blob/d65101fe80fd7ac9870a84d34d81bda8bcb461ac/kong/tools/utils.lua#L1438

tzssangglass · 2021-12-15T01:52:39Z

I think this solution is similar to Kong/kong@3fc3961?

Maybe we can also use counter for it: Kong/kong@d65101f/kong/tools/utils.lua#L1438

counter is for call yield in a loop?

xianshun163 · 2022-01-17T07:07:41Z

dear all, Is there a good solution for this issue?

tzssangglass · 2022-01-17T13:39:43Z

dear all, Is there a good solution for this issue?

I' m working on it.

ref: knyar/nginx-lua-prometheus#131

I will submit another PR to upstream.

moonming · 2022-01-27T07:43:12Z

@tzssangglass knyar/nginx-lua-prometheus#131 merged, so any update for this issue? thx

tzssangglass · 2022-01-27T09:52:04Z

@tzssangglass knyar/nginx-lua-prometheus#131 merged, so any update for this issue? thx

I have asked the maintainer if there are plans to release a new version, form my previous observations, usually the maintainers wait for stability before releasing a new version, so let's wait a little longer.

tzssangglass · 2022-01-28T02:54:38Z

new version has be released: https://github.com/knyar/nginx-lua-prometheus/releases/tag/0.20220127

In the following days I will submit a PR to fix APISIX.

Closes apache#5755

spacewander added the enhancement New feature or request label Dec 9, 2021

This was referenced Dec 28, 2021

request help: Request help about prometheus plugin #5904

Closed

optimize the long-tail request phenomenon knyar/nginx-lua-prometheus#130

Closed

moonming pinned this issue Jan 27, 2022

tzssangglass self-assigned this Jan 28, 2022

tzssangglass added a commit to tzssangglass/apisix that referenced this issue Feb 14, 2022

chore: update ngxin-lua-prometheus verison

2d8bec5

Closes apache#5755

tzssangglass mentioned this issue Feb 14, 2022

chore: update ngxin-lua-prometheus verison #6315

Merged

4 tasks

spacewander closed this as completed in #6315 Feb 15, 2022

moonming unpinned this issue Feb 17, 2022

kongjun01 mentioned this issue Feb 18, 2022

request help: client frequent SocketException: read time out #6285

Closed

jagerzhang mentioned this issue Jun 9, 2022

bug: prometheus metrics cause high cpu usage #7211

Closed

tzssangglass mentioned this issue Jun 29, 2022

the prometheus metrics API is tool slow #7353

Closed

tzssangglass mentioned this issue Sep 15, 2022

bug: P99 latency is too high #7919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS]: Improving the performance of the prometheus plugin #5755

[DISCUSS]: Improving the performance of the prometheus plugin #5755

tzssangglass commented Dec 9, 2021

moonming commented Dec 10, 2021

tzssangglass commented Dec 12, 2021

tokers commented Dec 13, 2021

spacewander commented Dec 13, 2021

tzssangglass commented Dec 15, 2021

xianshun163 commented Jan 17, 2022

tzssangglass commented Jan 17, 2022 •

edited

Loading

moonming commented Jan 27, 2022

tzssangglass commented Jan 27, 2022

tzssangglass commented Jan 28, 2022

[DISCUSS]: Improving the performance of the prometheus plugin #5755

[DISCUSS]: Improving the performance of the prometheus plugin #5755

Comments

tzssangglass commented Dec 9, 2021

Issue description

Environment

moonming commented Dec 10, 2021

tzssangglass commented Dec 12, 2021

tokers commented Dec 13, 2021

spacewander commented Dec 13, 2021

tzssangglass commented Dec 15, 2021

xianshun163 commented Jan 17, 2022

tzssangglass commented Jan 17, 2022 • edited Loading

moonming commented Jan 27, 2022

tzssangglass commented Jan 27, 2022

tzssangglass commented Jan 28, 2022

tzssangglass commented Jan 17, 2022 •

edited

Loading