bug: P99 latency is too high #7919

ryan4yin · 2022-09-14T11:26:23Z

Current Behavior

APISIX's P99 latency is too high, while the upstream latency is very low.

related issues:

bug: The latency of request is not stable as expected #5604

Expected Behavior

APISIX's P99 latency should not be so high.

Error Logs

the cpu flame graph I captured using openresty-xray shows the exporter/prometheus plugin consumed too much times in the request lifetime:

Steps to Reproduce

first, create an upstream using the following config:

{
    "id": "xxx-extractor",
    "desc": "xxx-extractor",
    "scheme": "http",
    "type":"roundrobin",
    "nodes": [
        { "host": "172.22.33.44", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.45", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.46", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.47", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.48", "port": 8106, "weight": 0, "priority": 0}
    ],
    "retries": 0,
    "timeout": {
        "connect":15,
        "send":15,
        "read":15
    },
    "checks": {
        "active": {
            "timeout": 3,
            "http_path": "/",
            "host": "xxx.xxx",
            "healthy": {
                "interval": 3,
                "successes": 3
            },
            "unhealthy": {
                "interval": 3,
                "http_failures": 3
            },
            "req_headers": ["User-Agent: curl/7.29.0"]
        },
        "passive": {
            "healthy": {
                "http_statuses": [200, 201],
                "successes": 3
            },
            "unhealthy": {
                "http_statuses": [500, 502, 503, 504],
                "http_failures": 3,
                "tcp_failures": 3
            }
        }
    }
}

then create a route with the following config:

{
  "id": "xxx-ingress",
  "uri": "/*",
  "hosts": [
    "xxx.xxx"
  ],
  "methods": [
    "PUT",
    "GET",
    "POST",
    "HEAD"
  ],
  "plugins": {
    "prometheus": {},
    "proxy-mirror": {
      "host": "http://my-app.test.com"
    }
  },
  "upstream_id": "xxx-extractor"
}

and then given about 30 QPS to the APISIX instance, high P99 latency problem will occurs.

Environment

running in Kubernetes, using helm chart apisix/apisix with version 0.11.0.

APISIX version (run apisix version): docker.io/apache/apisix:2.15.0-alpine
Operating system (run uname -a): Linux apisix-edge-765f88c49f-8tslz 5.4.190-107.353.amzn2.x86_64 #1 SMP Wed Apr 27 21:16:35 UTC 2022 x86_64 Linux
etcd version: docker.io/bitnami/etcd:3.5.4-debian-11-r22
the grafana dashbaord I'm using: https://grafana.com/grafana/dashboards/11719-apache-apisix/

The text was updated successfully, but these errors were encountered:

tokers · 2022-09-15T00:56:42Z

@tzssangglass If I remember correctly, you submitted a PR to the api7-prometheus lib and try to optimize the performance?

tzssangglass · 2022-09-15T03:45:32Z

The sample is so small that you can reproduce the sample.
Is there a prometheus exporter that visits http://127.0.0.1:9091/apisix/prometheus/metrics to get the metrics when the flame chart is sampled?
From the latency monitoring, APISIX causes high latency, you can remove the proxy-mirror plugin and keep only prometheus plugin, and test again?

ryan4yin · 2022-09-15T04:13:06Z

The sample is so small that you can reproduce the sample.

Is there a prometheus exporter that visits http://127.0.0.1:9091/apisix/prometheus/metrics to get the metrics when the flame chart is sampled?

From the latency monitoring, APISIX causes high latency, you can remove the proxy-mirror plugin and keep only prometheus plugin, and test again?

@tzssangglass

I'm trying to reproduce this in a test environment
I tried several times to sample a cpu flame chart, I post some of them below.
I disabled the proxy-mirror plugin, but it's helpless, the P99 latency is still high.

the flame chart smapled 16mins ago, all the cpu flame chart seems the same:

but the lua exectution flame chart shows some new info:

tzssangglass · 2022-09-15T05:17:59Z

3. I disabled the proxy-mirror plugin, but it's helpless, the P99 latency is still high.

Can you show the grafana monitoring after removing the proxy-mirror plugin? I am interested in the average latency of about 5s. This is not normal.

Are there some error logs? It looks like you captured the exception stack.

The phenomena and conclusions about prometheus plugins causing P99 long-tail requests are known. ref: #5755
Note: This phenomenon usually occurs when the number of indicators exceeds 10,000.
Based on previous experience, the latency of long-tail requests is nearly a hundred times higher than the average latency(with an average delay of a few tens of milliseconds and a few seconds for long-tail requests.).
But your garfana monitoring shows: long-tail requests are about four times more likely to be delayed than the average. This makes me feel strange(with an average latency of 5 seconds and 20 seconds for long-tail requests).

As a comparison test, you can remove the proimetheus plugin and test again to see if the average 5s latency and the P99 latency appear.

ryan4yin · 2022-09-15T05:31:10Z

@tzssangglass

after removing the proxy-mirror plugin:

and the http latency distribution by openresty-xray:

I'll try to remove the prometheus plguins to see what will happen.

ryan4yin · 2022-09-15T05:40:20Z

even though I disable all the plugins, the latency is still high...

ryan4yin · 2022-09-15T05:43:15Z

and I noticed(before I disable the prometheus plugin), there are many requests with 408 and 499 status code with QPS up to about 0.5:

ryan4yin · 2022-09-15T05:45:51Z

Are there some error logs? It looks like you captured the exception stack.

the openresty-xray captured some Lua Error Message:

bad argument #1 to '?' (string expected, got nil)

but I can not found any error message from container's stdout, it only contains many warn messages like this one:

2022/09/15 06:08:26 [warn] 45#45: *479340 a client request body is buffered to a temporary file /usr/local/apisix/client_body_temp/0000002504, client: 1.2.3.4, server: _, request: "POST /sa?project=default HTTP/1.1", host: "xxx.xxx.xxx"

and this is the lua exception flame chart I captured before:

but the lua exectution flame chart shows some new info:

is this flame chart shows some problem? t

ryan4yin · 2022-09-15T05:54:16Z

the phenomena is so weird, really confusing...

the latest statistics generated by openresty-xray:

ryan4yin · 2022-09-15T08:44:17Z

@tzssangglass does the APISIX latency contains the whole request lifetime? it it possible that the client's network are too bad, which caused high P99 latency?

my network architecture:

[Android Client(at Brazil)] 
=> [AWS Network LoadBalancer(at AWS region us-east-1)] 
=> [APISIX Gateway Container running in AWS EKS] 
=> [Upstream Server]

tzssangglass · 2022-09-15T08:47:51Z

even though I disable all the plugins, the latency is still high...

This means that the latency is not related to the prometheus plugin, but to APISIX and upstream.
What is your upstream? You can have the request go from client to upstream without going through the APISIX proxy and see the latency distribution.

tzssangglass · 2022-09-15T08:49:42Z

does the APISIX latency contains the whole request lifetime?

APISIX latency = whole request latency - upstream latency, indicates the time APISIX took to process the request.

ryan4yin · 2022-09-15T08:56:04Z

This means that the latency is not related to the prometheus plugin, but to APISIX and upstream.

thanks for reply, but the upstream latency recorded by the prometheus plugin & openresty-xray is very slow, is this indicates that the upstream server works fine?

the upstream server is sevaral sensorsdata's data-extractor server running in the same AWS VPC as APISIX.

tzssangglass · 2022-09-15T09:05:59Z

it it possible that the client's network are too bad, which caused high P99 latency?

The large number of 408 499 errors makes me feel like this.

for 408:

client_body_timeout(default 60s): Defines a timeout for reading client request body. The timeout is set only for a period between two successive read operations, not for the transmission of the whole request body. If a client does not transmit anything within this time, the request is terminated with the 408 (Request Time-out) error.

client_header_timeoutt(default 60s): Defines a timeout for reading client request header. If a client does not transmit the entire header within this time, the request is terminated with the 408 (Request Time-out) error.

for 499:

HTTP 499 in Nginx means that the client closed the connection before the server answered the request. In my experience is usually caused by client side timeout. As I know it's an Nginx specific error code.

2022/09/15 06:08:26 [warn] 45#45: *479340 a client request body is buffered to a temporary file /usr/local/apisix/client_body_temp/0000002504, client: 1.2.3.4, server: _, request: "POST /sa?project=default HTTP/1.1", host: "xxx.xxx.xxx"

And based on these logs, I think the possible reasons are:

The network between client and APISIX is poor, APISIX cannot read the client's headers or bodies properly;
The client request body is very large and the APISIX read timeout(based on the analysis of available information, this is closer to the truth);

Can you trace the high latency associated with the large client request body?

ryan4yin · 2022-09-15T09:26:34Z

And based on these logs, I think the possible reasons are:
1. The network between client and APISIX is poor, APISIX cannot read the client's headers or bodies properly;

2. The client request body is very large and the APISIX read timeout(based on the analysis of available information, this is closer to the truth);
Can you trace the high latency associated with the large client request body?

@tzssangglass thanks very much! now the problem is clear, it's not a problem of APISIX Gateway.

I'll talk with our android team about this problem, and close this issue later.

ryan4yin closed this as completed Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: P99 latency is too high #7919

bug: P99 latency is too high #7919

ryan4yin commented Sep 14, 2022 •

edited

Loading

tokers commented Sep 15, 2022

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 •

edited

Loading

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022

ryan4yin commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 •

edited

Loading

tzssangglass commented Sep 15, 2022

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 •

edited

Loading

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 •

edited

Loading

bug: P99 latency is too high #7919

bug: P99 latency is too high #7919

Comments

ryan4yin commented Sep 14, 2022 • edited Loading

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

tokers commented Sep 15, 2022

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 • edited Loading

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 • edited Loading

ryan4yin commented Sep 15, 2022

ryan4yin commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 • edited Loading

ryan4yin commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 • edited Loading

tzssangglass commented Sep 15, 2022

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 • edited Loading

tzssangglass commented Sep 15, 2022

ryan4yin commented Sep 15, 2022 • edited Loading

ryan4yin commented Sep 14, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022 •

edited

Loading

ryan4yin commented Sep 15, 2022 •

edited

Loading