-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issue of p95 latency for a simple python service #8057
Comments
/cc @tcnghia @vagababov |
found the corresponding trace id logging in queue proxy which shows 2.9ms, so likely this is networking issue?
|
|
I set
|
I think this is same issue as #7349. |
If you have only 2 workers though, it might make sense to set |
I tried that it does not make much difference, also I am testing with very low qps like sending one at a time, so random load balancing should also work. I think we have concluded that it is queue-proxy resource limit issue, it is getting throttled sometimes.
The queue proxy limits are following which is 40% of limits of user container with 1CPU/2Gi
Also I created my own service which directly sends to user container and bypassing queue-proxy
after that I get pretty stable performance
|
I have fixed the issue by setting resource request == limit for
After that the performance is comparable to directly hitting user container
|
/cc @julz I heard you are working on some changes to allow setting resource limits on |
according to https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/, All burstable, BestEffort and Guaranteed with non-integer CPU containers run in a shared CPU pool. |
@yuzisun Yes, it is throttled from time to time and causing tail latency to suffer. I had to increase queue-proxy cpu request boundary from 100m to 200m to minimize this. However, I noticed some model having larger payload still get throttled. Thus, I believe optimizing queue-proxy is also important. |
Is this related to this kernel issue: https://kccncna19.sched.com/event/Uae1/throttling-new-developments-in-application-performance-with-cpu-limits-dave-chiluk-indeed? |
Yes, I suspect it's due to that issue |
IIUC this was resolved? cc @vagababov |
I think so. |
@vagababov: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area networking
Ask your question here:
Benchmarking a simple python service which processes 1-2ms consistently, sending 10qps to 10 replicas of the service, the p95 latency is always 45-60ms and there is a blackhole of 40-50ms for 5% of the requests showing on distributed tracing graph, wondering what could be the issue.
The text was updated successfully, but these errors were encountered: