-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: P99 latency is too high #7919
Comments
@tzssangglass If I remember correctly, you submitted a PR to the api7-prometheus lib and try to optimize the performance? |
|
the flame chart smapled 16mins ago, all the cpu flame chart seems the same: but the lua exectution flame chart shows some new info: |
Can you show the grafana monitoring after removing the Are there some error logs? It looks like you captured the exception stack. The phenomena and conclusions about As a comparison test, you can remove the proimetheus plugin and test again to see if the average 5s latency and the P99 latency appear. |
the openresty-xray captured some Lua Error Message:
but I can not found any error message from container's stdout, it only contains many warn messages like this one:
and this is the lua exception flame chart I captured before:
is this flame chart shows some problem? t |
@tzssangglass does the APISIX latency contains the whole request lifetime? it it possible that the client's network are too bad, which caused high P99 latency? my network architecture:
|
This means that the latency is not related to the prometheus plugin, but to APISIX and upstream. |
APISIX latency = whole request latency - upstream latency, indicates the time APISIX took to process the request. |
thanks for reply, but the upstream latency recorded by the prometheus plugin & openresty-xray is very slow, is this indicates that the upstream server works fine? the upstream server is sevaral sensorsdata's data-extractor server running in the same AWS VPC as APISIX. |
The large number of 408 499 errors makes me feel like this. for 408: client_body_timeout(default 60s): Defines a timeout for reading client request body. The timeout is set only for a period between two successive read operations, not for the transmission of the whole request body. If a client does not transmit anything within this time, the request is terminated with the 408 (Request Time-out) error. client_header_timeoutt(default 60s): Defines a timeout for reading client request header. If a client does not transmit the entire header within this time, the request is terminated with the 408 (Request Time-out) error. for 499: HTTP 499 in Nginx means that the client closed the connection before the server answered the request. In my experience is usually caused by client side timeout. As I know it's an Nginx specific error code.
And based on these logs, I think the possible reasons are:
Can you trace the high latency associated with the large client request body? |
@tzssangglass thanks very much! now the problem is clear, it's not a problem of APISIX Gateway. I'll talk with our android team about this problem, and close this issue later. |
Current Behavior
APISIX's P99 latency is too high, while the upstream latency is very low.
related issues:
Expected Behavior
APISIX's P99 latency should not be so high.
Error Logs
the cpu flame graph I captured using
openresty-xray
shows theexporter
/prometheus
plugin consumed too much times in the request lifetime:Steps to Reproduce
first, create an upstream using the following config:
then create a route with the following config:
and then given about 30 QPS to the APISIX instance, high P99 latency problem will occurs.
Environment
running in Kubernetes, using helm chart
apisix/apisix
with version0.11.0
.apisix version
):docker.io/apache/apisix:2.15.0-alpine
uname -a
):Linux apisix-edge-765f88c49f-8tslz 5.4.190-107.353.amzn2.x86_64 #1 SMP Wed Apr 27 21:16:35 UTC 2022 x86_64 Linux
docker.io/bitnami/etcd:3.5.4-debian-11-r22
The text was updated successfully, but these errors were encountered: