Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: P99 latency is too high #7919

Closed
ryan4yin opened this issue Sep 14, 2022 · 15 comments
Closed

bug: P99 latency is too high #7919

ryan4yin opened this issue Sep 14, 2022 · 15 comments

Comments

@ryan4yin
Copy link
Contributor

ryan4yin commented Sep 14, 2022

Current Behavior

APISIX's P99 latency is too high, while the upstream latency is very low.

apisix-p99-too-high

related issues:

Expected Behavior

APISIX's P99 latency should not be so high.

Error Logs

the cpu flame graph I captured using openresty-xray shows the exporter/prometheus plugin consumed too much times in the request lifetime:

job-4400902444-Lua-Land-CPU-Flame-Graph

Steps to Reproduce

first, create an upstream using the following config:

{
    "id": "xxx-extractor",
    "desc": "xxx-extractor",
    "scheme": "http",
    "type":"roundrobin",
    "nodes": [
        { "host": "172.22.33.44", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.45", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.46", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.47", "port": 8106, "weight": 0, "priority": 0},
        { "host": "172.22.33.48", "port": 8106, "weight": 0, "priority": 0}
    ],
    "retries": 0,
    "timeout": {
        "connect":15,
        "send":15,
        "read":15
    },
    "checks": {
        "active": {
            "timeout": 3,
            "http_path": "/",
            "host": "xxx.xxx",
            "healthy": {
                "interval": 3,
                "successes": 3
            },
            "unhealthy": {
                "interval": 3,
                "http_failures": 3
            },
            "req_headers": ["User-Agent: curl/7.29.0"]
        },
        "passive": {
            "healthy": {
                "http_statuses": [200, 201],
                "successes": 3
            },
            "unhealthy": {
                "http_statuses": [500, 502, 503, 504],
                "http_failures": 3,
                "tcp_failures": 3
            }
        }
    }
}

then create a route with the following config:

{
  "id": "xxx-ingress",
  "uri": "/*",
  "hosts": [
    "xxx.xxx"
  ],
  "methods": [
    "PUT",
    "GET",
    "POST",
    "HEAD"
  ],
  "plugins": {
    "prometheus": {},
    "proxy-mirror": {
      "host": "http://my-app.test.com"
    }
  },
  "upstream_id": "xxx-extractor"
}

and then given about 30 QPS to the APISIX instance, high P99 latency problem will occurs.

Environment

running in Kubernetes, using helm chart apisix/apisix with version 0.11.0.

  • APISIX version (run apisix version): docker.io/apache/apisix:2.15.0-alpine
  • Operating system (run uname -a): Linux apisix-edge-765f88c49f-8tslz 5.4.190-107.353.amzn2.x86_64 #1 SMP Wed Apr 27 21:16:35 UTC 2022 x86_64 Linux
  • etcd version: docker.io/bitnami/etcd:3.5.4-debian-11-r22
  • the grafana dashbaord I'm using: https://grafana.com/grafana/dashboards/11719-apache-apisix/
@tokers
Copy link
Contributor

tokers commented Sep 15, 2022

@tzssangglass If I remember correctly, you submitted a PR to the api7-prometheus lib and try to optimize the performance?

@tzssangglass
Copy link
Member

  1. The sample is so small that you can reproduce the sample.
  2. Is there a prometheus exporter that visits http://127.0.0.1:9091/apisix/prometheus/metrics to get the metrics when the flame chart is sampled?
  3. From the latency monitoring, APISIX causes high latency, you can remove the proxy-mirror plugin and keep only prometheus plugin, and test again?

@ryan4yin
Copy link
Contributor Author

ryan4yin commented Sep 15, 2022

  1. The sample is so small that you can reproduce the sample.
  2. Is there a prometheus exporter that visits http://127.0.0.1:9091/apisix/prometheus/metrics to get the metrics when the flame chart is sampled?
  3. From the latency monitoring, APISIX causes high latency, you can remove the proxy-mirror plugin and keep only prometheus plugin, and test again?

@tzssangglass

  1. I'm trying to reproduce this in a test environment
  2. I tried several times to sample a cpu flame chart, I post some of them below.
  3. I disabled the proxy-mirror plugin, but it's helpless, the P99 latency is still high.

the flame chart smapled 16mins ago, all the cpu flame chart seems the same:

image

but the lua exectution flame chart shows some new info:

image

@tzssangglass
Copy link
Member

3. I disabled the proxy-mirror plugin, but it's helpless, the P99 latency is still high.

Can you show the grafana monitoring after removing the proxy-mirror plugin? I am interested in the average latency of about 5s. This is not normal.

Are there some error logs? It looks like you captured the exception stack.

The phenomena and conclusions about prometheus plugins causing P99 long-tail requests are known. ref: #5755
Note: This phenomenon usually occurs when the number of indicators exceeds 10,000.
Based on previous experience, the latency of long-tail requests is nearly a hundred times higher than the average latency(with an average delay of a few tens of milliseconds and a few seconds for long-tail requests.).
But your garfana monitoring shows: long-tail requests are about four times more likely to be delayed than the average. This makes me feel strange(with an average latency of 5 seconds and 20 seconds for long-tail requests).

As a comparison test, you can remove the proimetheus plugin and test again to see if the average 5s latency and the P99 latency appear.

@ryan4yin
Copy link
Contributor Author

ryan4yin commented Sep 15, 2022

@tzssangglass

after removing the proxy-mirror plugin:
image

and the http latency distribution by openresty-xray:

image

I'll try to remove the prometheus plguins to see what will happen.

@ryan4yin
Copy link
Contributor Author

even though I disable all the plugins, the latency is still high...
image

@ryan4yin
Copy link
Contributor Author

and I noticed(before I disable the prometheus plugin), there are many requests with 408 and 499 status code with QPS up to about 0.5:

image

@ryan4yin
Copy link
Contributor Author

ryan4yin commented Sep 15, 2022

Are there some error logs? It looks like you captured the exception stack.

the openresty-xray captured some Lua Error Message:

bad argument #1 to '?' (string expected, got nil)

but I can not found any error message from container's stdout, it only contains many warn messages like this one:

2022/09/15 06:08:26 [warn] 45#45: *479340 a client request body is buffered to a temporary file /usr/local/apisix/client_body_temp/0000002504, client: 1.2.3.4, server: _, request: "POST /sa?project=default HTTP/1.1", host: "xxx.xxx.xxx"

and this is the lua exception flame chart I captured before:

but the lua exectution flame chart shows some new info:

image

is this flame chart shows some problem? t

@ryan4yin
Copy link
Contributor Author

the phenomena is so weird, really confusing...

the latest statistics generated by openresty-xray:

image

image

@ryan4yin
Copy link
Contributor Author

ryan4yin commented Sep 15, 2022

@tzssangglass does the APISIX latency contains the whole request lifetime? it it possible that the client's network are too bad, which caused high P99 latency?

my network architecture:

[Android Client(at Brazil)] 
=> [AWS Network LoadBalancer(at AWS region us-east-1)] 
=> [APISIX Gateway Container running in AWS EKS] 
=> [Upstream Server]

@tzssangglass
Copy link
Member

even though I disable all the plugins, the latency is still high...

This means that the latency is not related to the prometheus plugin, but to APISIX and upstream.
What is your upstream? You can have the request go from client to upstream without going through the APISIX proxy and see the latency distribution.

@tzssangglass
Copy link
Member

does the APISIX latency contains the whole request lifetime?

APISIX latency = whole request latency - upstream latency, indicates the time APISIX took to process the request.

@ryan4yin
Copy link
Contributor Author

ryan4yin commented Sep 15, 2022

This means that the latency is not related to the prometheus plugin, but to APISIX and upstream.

thanks for reply, but the upstream latency recorded by the prometheus plugin & openresty-xray is very slow, is this indicates that the upstream server works fine?

the upstream server is sevaral sensorsdata's data-extractor server running in the same AWS VPC as APISIX.

@tzssangglass
Copy link
Member

it it possible that the client's network are too bad, which caused high P99 latency?

The large number of 408 499 errors makes me feel like this.

for 408:

client_body_timeout(default 60s): Defines a timeout for reading client request body. The timeout is set only for a period between two successive read operations, not for the transmission of the whole request body. If a client does not transmit anything within this time, the request is terminated with the 408 (Request Time-out) error.

client_header_timeoutt(default 60s): Defines a timeout for reading client request header. If a client does not transmit the entire header within this time, the request is terminated with the 408 (Request Time-out) error.

for 499:

HTTP 499 in Nginx means that the client closed the connection before the server answered the request. In my experience is usually caused by client side timeout. As I know it's an Nginx specific error code.

2022/09/15 06:08:26 [warn] 45#45: *479340 a client request body is buffered to a temporary file /usr/local/apisix/client_body_temp/0000002504, client: 1.2.3.4, server: _, request: "POST /sa?project=default HTTP/1.1", host: "xxx.xxx.xxx"

And based on these logs, I think the possible reasons are:

  1. The network between client and APISIX is poor, APISIX cannot read the client's headers or bodies properly;
  2. The client request body is very large and the APISIX read timeout(based on the analysis of available information, this is closer to the truth);

Can you trace the high latency associated with the large client request body?

@ryan4yin
Copy link
Contributor Author

ryan4yin commented Sep 15, 2022

And based on these logs, I think the possible reasons are:

1. The network between client and APISIX is poor, APISIX cannot read the client's headers or bodies properly;

2. The client request body is very large and the APISIX read timeout(based on the analysis of available information, this is closer to the truth);

Can you trace the high latency associated with the large client request body?

@tzssangglass thanks very much! now the problem is clear, it's not a problem of APISIX Gateway.

I'll talk with our android team about this problem, and close this issue later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants