performance issue of p95 latency for a simple python service #8057

yuzisun · 2020-05-25T00:12:28Z

/area networking

Ask your question here:

Benchmarking a simple python service which processes 1-2ms consistently, sending 10qps to 10 replicas of the service, the p95 latency is always 45-60ms and there is a blackhole of 40-50ms for 5% of the requests showing on distributed tracing graph, wondering what could be the issue.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sklearn-iris1-predictor-default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/target: "2"
        autoscaling.knative.dev/targetBurstCapacity: "0"
        identities.bloomberg.com/bcs: dsun-bcs-gen
        identities.bloomberg.com/initContainerNames: storage-initializer
        internal.serving.kubeflow.org/storage-initializer-sourceuri: s3://platform-test/sklearn/iris
        prometheus.io/port: "9091"
        prometheus.io/scrape: "true"
        queue.sidecar.serving.knative.dev/resourcePercentage: "20"
        sidecar.istio.io/inject: "false"
      labels:
        component: predictor
        endpoint: default
        model: sklearn-iris1
        serving.kubeflow.org/inferenceservice: sklearn-iris1
    spec:
      containerConcurrency: 0
      containers:
      - args:
        - --model_name=sklearn-iris1
        - --model_dir=/mnt/models
        - --http_port=8080
        - --workers=2
        image: gcr.io/kfserving/sklearnserver:v0.3.0
        name: kfserving-container
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources:
          limits:
            cpu: "2"
            memory: 1Gi
          requests:
            cpu: "2"
            memory: 1Gi
      timeoutSeconds: 60

Requests      [total, rate]            1800, 10.01
Duration      [total, attack, wait]    2m59.904985081s, 2m59.899883762s, 5.101319ms
Latencies     [mean, 50, 95, 99, max]  13.455032ms, 5.287963ms, 47.477641ms, 50.651898ms, 705.943588ms
Bytes In      [total, mean]            41400, 23.00
Bytes Out     [total, mean]            106200, 59.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:1800  
Error Set:

The text was updated successfully, but these errors were encountered:

yuzisun · 2020-05-25T00:13:41Z

/cc @tcnghia @vagababov

yuzisun · 2020-05-25T00:31:45Z

found the corresponding trace id logging in queue proxy which shows 2.9ms, so likely this is networking issue?

{"httpRequest": {"requestMethod": "POST", 
"requestUrl": "/v1/models/sklearn-iris1:predict", 
"requestSize": "59",
"status": 200, 
"responseSize": "23", 
"userAgent": "Go-http-client/2.0", 
"remoteIp": "100.80.59.207:50948", 
"serverIp": "100.80.51.95", "referer": "", 
"latency": "0.002965221s", "protocol": "HTTP/1.1"}, 
"traceId": "[cb1eda854e1dac1f89fa8faa2ee7fc3e]"}

markusthoemmes · 2020-05-25T06:09:19Z

autoscaling.knative.dev/target: "2" and containerConcurrency: 0 seems odd. Is there a limit to how many requests your app can handle in parallel?

yuzisun · 2020-05-25T12:32:52Z

I set --workers=2 so there are 2 python processes in the container which can process 2 requests at a time. I think I have concluded that the latency is not in queue-proxy and I am suspecting some networking issues now between gateway and queue-proxy as I can see the timestamps that server(in this case queue-proxy) receives the package almost after 50ms.

5/24/2020, 10:43:00 PM	692μs	Client Start	100.80.48.110 (cluster-local-gateway)
5/24/2020, 10:43:00 PM	51.410ms	Server Start	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	51.433ms	Received 59 bytes	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	54.364ms	Sent 23 bytes	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	54.665ms	Server Finish	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	54.811ms	Client Finish	100.80.48.110 (cluster-local-gateway)

yuzisun · 2020-05-25T16:18:36Z

I think this is same issue as #7349.
@pradithya it seems that even with the cpu upper bound(500m) sometimes it still throttles right ?

markusthoemmes · 2020-05-25T16:38:56Z

If you have only 2 workers though, it might make sense to set containerConcurrency (hard limit) rather than the target (which only influences autoscaling recommendations). With such a low CC, you'll probably also want the activator to be in the path most of the time to be somewhat smart about loadbalancing.

yuzisun · 2020-05-25T16:55:45Z

If you have only 2 workers though, it might make sense to set containerConcurrency (hard limit) rather than the target (which only influences autoscaling recommendations). With such a low CC, you'll probably also want the activator to be in the path most of the time to be somewhat smart about loadbalancing.

I tried that it does not make much difference, also I am testing with very low qps like sending one at a time, so random load balancing should also work.

I think we have concluded that it is queue-proxy resource limit issue, it is getting throttled sometimes.

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.803210015s, 59.799998772s, 3.211243ms
Latencies     [mean, 50, 95, 99, max]  12.490487ms, 3.088503ms, 76.321081ms, 81.580211ms, 409.107397ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.840230434s, 59.800002674s, 40.22776ms
Latencies     [mean, 50, 95, 99, max]  6.591095ms, 3.027777ms, 39.89522ms, 40.068618ms, 40.781162ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300 

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.803550754s, 59.799928867s, 3.621887ms
Latencies     [mean, 50, 95, 99, max]  3.211671ms, 3.199555ms, 3.685837ms, 4.382211ms, 10.002046ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300

The queue proxy limits are following which is 40% of limits of user container with 1CPU/2Gi

    resources:
      limits:
        cpu: 400m
        memory: 429496729600m
      requests:
        cpu: 100m
        memory: 200Mi

Also I created my own service which directly sends to user container and bypassing queue-proxy

apiVersion: v1
kind: Service
metadata:
  labels:
    app: sklearn-iris-predictor-default-5cj8t
    serving.kubeflow.org/inferenceservice: sklearn-iris
  name: sklearn-iris-predictor-default-raw
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: sklearn-iris-predictor-default-5cj8t
  sessionAffinity: None
  type: ClusterIP

after that I get pretty stable performance

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.802380963s, 59.800035342s, 2.345621ms
Latencies     [mean, 50, 95, 99, max]  2.349571ms, 2.307148ms, 2.667784ms, 3.267413ms, 10.876731ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300 

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.802390135s, 59.799971604s, 2.418531ms
Latencies     [mean, 50, 95, 99, max]  2.422852ms, 2.450808ms, 2.681113ms, 3.315ms, 7.843621ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.802135338s, 59.800008978s, 2.12636ms
Latencies     [mean, 50, 95, 99, max]  2.378033ms, 2.359145ms, 2.65627ms, 3.292868ms, 11.837204ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300

yuzisun · 2020-05-25T18:31:16Z

I have fixed the issue by setting resource request == limit for queue-proxy with 1cpu and 500Mi. Previously the pod is marked as burstable and now it is guaranteed.

  phase: Running
  podIP: 10.210.176.74
  qosClass: Guaranteed
  startTime: "2020-05-25T18:13:30Z"

After that the performance is comparable to directly hitting user container

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.808066664s, 59.799953103s, 8.113561ms
Latencies     [mean, 50, 95, 99, max]  8.238343ms, 8.140814ms, 8.792159ms, 9.229659ms, 19.632035ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.806826689s, 59.800011691s, 6.814998ms
Latencies     [mean, 50, 95, 99, max]  6.723181ms, 6.651973ms, 7.34629ms, 7.713206ms, 26.301761ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.810027733s, 59.801079742s, 8.947991ms
Latencies     [mean, 50, 95, 99, max]  225.706375ms, 8.218174ms, 2.115340474s, 4.515755043s, 5.016317553s
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.804045855s, 59.799961298s, 4.084557ms
Latencies     [mean, 50, 95, 99, max]  4.427072ms, 4.306852ms, 4.902942ms, 5.399814ms, 33.870577ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300

yuzisun · 2020-05-25T18:33:44Z

/cc @julz I heard you are working on some changes to allow setting resource limits on queue-proxy, can we also make it configurable?

yuzisun · 2020-05-25T21:00:56Z

according to https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/, All burstable, BestEffort and Guaranteed with non-integer CPU containers run in a shared CPU pool.

pradithya · 2020-05-26T04:50:25Z

@yuzisun Yes, it is throttled from time to time and causing tail latency to suffer. I had to increase queue-proxy cpu request boundary from 100m to 200m to minimize this. However, I noticed some model having larger payload still get throttled. Thus, I believe optimizing queue-proxy is also important.

JRBANCEL · 2020-05-26T17:33:58Z

Is this related to this kernel issue: https://kccncna19.sched.com/event/Uae1/throttling-new-developments-in-application-performance-with-cpu-limits-dave-chiluk-indeed?

pradithya · 2020-05-26T17:47:36Z

Yes, I suspect it's due to that issue

mattmoor · 2020-08-17T19:38:54Z

IIUC this was resolved? cc @vagababov

vagababov · 2020-08-17T20:11:13Z

I think so.
/close

knative-prow-robot · 2020-08-17T20:11:20Z

@vagababov: Closing this issue.

In response to this:

I think so.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuzisun added the kind/question Further information is requested label May 25, 2020

knative-prow-robot added the area/networking label May 25, 2020

yuzisun changed the title ~~performance issue for p95 latency for a simple python service~~ performance issue of p95 latency for a simple python service May 25, 2020

This was referenced May 25, 2020

Can we set a default CPU/Memory Limit/Request for queue_proxy container #5829

Closed

Allow configurable boundary for queue proxy resource #7349

Closed

yuzisun mentioned this issue May 26, 2020

KFServing tail latency issue due to k8s cpu throttle for knative queue proxy kserve/kserve#844

Closed

knative-prow-robot closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue of p95 latency for a simple python service #8057

performance issue of p95 latency for a simple python service #8057

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020

yuzisun commented May 25, 2020 •

edited

Loading

markusthoemmes commented May 25, 2020

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020

markusthoemmes commented May 25, 2020

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020

pradithya commented May 26, 2020

JRBANCEL commented May 26, 2020

pradithya commented May 26, 2020

mattmoor commented Aug 17, 2020

vagababov commented Aug 17, 2020

knative-prow-robot commented Aug 17, 2020

performance issue of p95 latency for a simple python service #8057

performance issue of p95 latency for a simple python service #8057

Comments

yuzisun commented May 25, 2020 • edited Loading

Ask your question here:

yuzisun commented May 25, 2020

yuzisun commented May 25, 2020 • edited Loading

markusthoemmes commented May 25, 2020

yuzisun commented May 25, 2020 • edited Loading

yuzisun commented May 25, 2020

markusthoemmes commented May 25, 2020

yuzisun commented May 25, 2020 • edited Loading

yuzisun commented May 25, 2020

yuzisun commented May 25, 2020 • edited Loading

yuzisun commented May 25, 2020

pradithya commented May 26, 2020

JRBANCEL commented May 26, 2020

pradithya commented May 26, 2020

mattmoor commented Aug 17, 2020

vagababov commented Aug 17, 2020

knative-prow-robot commented Aug 17, 2020

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020 •

edited

Loading

yuzisun commented May 25, 2020 •

edited

Loading