Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue of p95 latency for a simple python service #8057

Closed
yuzisun opened this issue May 25, 2020 · 16 comments
Closed

performance issue of p95 latency for a simple python service #8057

yuzisun opened this issue May 25, 2020 · 16 comments
Labels
area/networking kind/question Further information is requested

Comments

@yuzisun
Copy link

yuzisun commented May 25, 2020

/area networking

Ask your question here:

Benchmarking a simple python service which processes 1-2ms consistently, sending 10qps to 10 replicas of the service, the p95 latency is always 45-60ms and there is a blackhole of 40-50ms for 5% of the requests showing on distributed tracing graph, wondering what could be the issue.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sklearn-iris1-predictor-default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/target: "2"
        autoscaling.knative.dev/targetBurstCapacity: "0"
        identities.bloomberg.com/bcs: dsun-bcs-gen
        identities.bloomberg.com/initContainerNames: storage-initializer
        internal.serving.kubeflow.org/storage-initializer-sourceuri: s3://platform-test/sklearn/iris
        prometheus.io/port: "9091"
        prometheus.io/scrape: "true"
        queue.sidecar.serving.knative.dev/resourcePercentage: "20"
        sidecar.istio.io/inject: "false"
      labels:
        component: predictor
        endpoint: default
        model: sklearn-iris1
        serving.kubeflow.org/inferenceservice: sklearn-iris1
    spec:
      containerConcurrency: 0
      containers:
      - args:
        - --model_name=sklearn-iris1
        - --model_dir=/mnt/models
        - --http_port=8080
        - --workers=2
        image: gcr.io/kfserving/sklearnserver:v0.3.0
        name: kfserving-container
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources:
          limits:
            cpu: "2"
            memory: 1Gi
          requests:
            cpu: "2"
            memory: 1Gi
      timeoutSeconds: 60
Requests      [total, rate]            1800, 10.01
Duration      [total, attack, wait]    2m59.904985081s, 2m59.899883762s, 5.101319ms
Latencies     [mean, 50, 95, 99, max]  13.455032ms, 5.287963ms, 47.477641ms, 50.651898ms, 705.943588ms
Bytes In      [total, mean]            41400, 23.00
Bytes Out     [total, mean]            106200, 59.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:1800  
Error Set:

Screen Shot 2020-05-24 at 8 32 48 PM

Screen Shot 2020-05-24 at 8 33 47 PM

@yuzisun yuzisun added the kind/question Further information is requested label May 25, 2020
@yuzisun yuzisun changed the title performance issue for p95 latency for a simple python service performance issue of p95 latency for a simple python service May 25, 2020
@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

/cc @tcnghia @vagababov

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

found the corresponding trace id logging in queue proxy which shows 2.9ms, so likely this is networking issue?

{"httpRequest": {"requestMethod": "POST", 
"requestUrl": "/v1/models/sklearn-iris1:predict", 
"requestSize": "59",
"status": 200, 
"responseSize": "23", 
"userAgent": "Go-http-client/2.0", 
"remoteIp": "100.80.59.207:50948", 
"serverIp": "100.80.51.95", "referer": "", 
"latency": "0.002965221s", "protocol": "HTTP/1.1"}, 
"traceId": "[cb1eda854e1dac1f89fa8faa2ee7fc3e]"}

@markusthoemmes
Copy link
Contributor

autoscaling.knative.dev/target: "2" and containerConcurrency: 0 seems odd. Is there a limit to how many requests your app can handle in parallel?

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

I set --workers=2 so there are 2 python processes in the container which can process 2 requests at a time. I think I have concluded that the latency is not in queue-proxy and I am suspecting some networking issues now between gateway and queue-proxy as I can see the timestamps that server(in this case queue-proxy) receives the package almost after 50ms.

5/24/2020, 10:43:00 PM	692μs	Client Start	100.80.48.110 (cluster-local-gateway)
5/24/2020, 10:43:00 PM	51.410ms	Server Start	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	51.433ms	Received 59 bytes	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	54.364ms	Sent 23 bytes	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	54.665ms	Server Finish	100.80.57.124:80 (sklearn-iris2-predictor-default-rvlpk-deployment-7dfd9bdfdjp4m2)
5/24/2020, 10:43:00 PM	54.811ms	Client Finish	100.80.48.110 (cluster-local-gateway)

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

I think this is same issue as #7349.
@pradithya it seems that even with the cpu upper bound(500m) sometimes it still throttles right ?

@markusthoemmes
Copy link
Contributor

If you have only 2 workers though, it might make sense to set containerConcurrency (hard limit) rather than the target (which only influences autoscaling recommendations). With such a low CC, you'll probably also want the activator to be in the path most of the time to be somewhat smart about loadbalancing.

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

If you have only 2 workers though, it might make sense to set containerConcurrency (hard limit) rather than the target (which only influences autoscaling recommendations). With such a low CC, you'll probably also want the activator to be in the path most of the time to be somewhat smart about loadbalancing.

I tried that it does not make much difference, also I am testing with very low qps like sending one at a time, so random load balancing should also work.

I think we have concluded that it is queue-proxy resource limit issue, it is getting throttled sometimes.

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.803210015s, 59.799998772s, 3.211243ms
Latencies     [mean, 50, 95, 99, max]  12.490487ms, 3.088503ms, 76.321081ms, 81.580211ms, 409.107397ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.840230434s, 59.800002674s, 40.22776ms
Latencies     [mean, 50, 95, 99, max]  6.591095ms, 3.027777ms, 39.89522ms, 40.068618ms, 40.781162ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300 

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.803550754s, 59.799928867s, 3.621887ms
Latencies     [mean, 50, 95, 99, max]  3.211671ms, 3.199555ms, 3.685837ms, 4.382211ms, 10.002046ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  

The queue proxy limits are following which is 40% of limits of user container with 1CPU/2Gi

    resources:
      limits:
        cpu: 400m
        memory: 429496729600m
      requests:
        cpu: 100m
        memory: 200Mi

Also I created my own service which directly sends to user container and bypassing queue-proxy

apiVersion: v1
kind: Service
metadata:
  labels:
    app: sklearn-iris-predictor-default-5cj8t
    serving.kubeflow.org/inferenceservice: sklearn-iris
  name: sklearn-iris-predictor-default-raw
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: sklearn-iris-predictor-default-5cj8t
  sessionAffinity: None
  type: ClusterIP

after that I get pretty stable performance

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.802380963s, 59.800035342s, 2.345621ms
Latencies     [mean, 50, 95, 99, max]  2.349571ms, 2.307148ms, 2.667784ms, 3.267413ms, 10.876731ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300 

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.802390135s, 59.799971604s, 2.418531ms
Latencies     [mean, 50, 95, 99, max]  2.422852ms, 2.450808ms, 2.681113ms, 3.315ms, 7.843621ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.802135338s, 59.800008978s, 2.12636ms
Latencies     [mean, 50, 95, 99, max]  2.378033ms, 2.359145ms, 2.65627ms, 3.292868ms, 11.837204ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

I have fixed the issue by setting resource request == limit for queue-proxy with 1cpu and 500Mi. Previously the pod is marked as burstable and now it is guaranteed.

  phase: Running
  podIP: 10.210.176.74
  qosClass: Guaranteed
  startTime: "2020-05-25T18:13:30Z"

After that the performance is comparable to directly hitting user container

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.808066664s, 59.799953103s, 8.113561ms
Latencies     [mean, 50, 95, 99, max]  8.238343ms, 8.140814ms, 8.792159ms, 9.229659ms, 19.632035ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.806826689s, 59.800011691s, 6.814998ms
Latencies     [mean, 50, 95, 99, max]  6.723181ms, 6.651973ms, 7.34629ms, 7.713206ms, 26.301761ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.810027733s, 59.801079742s, 8.947991ms
Latencies     [mean, 50, 95, 99, max]  225.706375ms, 8.218174ms, 2.115340474s, 4.515755043s, 5.016317553s
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.804045855s, 59.799961298s, 4.084557ms
Latencies     [mean, 50, 95, 99, max]  4.427072ms, 4.306852ms, 4.902942ms, 5.399814ms, 33.870577ms
Bytes In      [total, mean]            6900, 23.00
Bytes Out     [total, mean]            24600, 82.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300 

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

/cc @julz I heard you are working on some changes to allow setting resource limits on queue-proxy, can we also make it configurable?

@yuzisun
Copy link
Author

yuzisun commented May 25, 2020

according to https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/, All burstable, BestEffort and Guaranteed with non-integer CPU containers run in a shared CPU pool.

@pradithya
Copy link

@yuzisun Yes, it is throttled from time to time and causing tail latency to suffer. I had to increase queue-proxy cpu request boundary from 100m to 200m to minimize this. However, I noticed some model having larger payload still get throttled. Thus, I believe optimizing queue-proxy is also important.

@JRBANCEL
Copy link
Contributor

@pradithya
Copy link

Yes, I suspect it's due to that issue

@mattmoor
Copy link
Member

IIUC this was resolved? cc @vagababov

@vagababov
Copy link
Contributor

I think so.
/close

@knative-prow-robot
Copy link
Contributor

@vagababov: Closing this issue.

In response to this:

I think so.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants