Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knative probes are failing after deploying sample sk-learn model, not able to call the URL #1153

Closed
kd303 opened this issue Oct 21, 2020 · 20 comments

Comments

@kd303
Copy link
Contributor

kd303 commented Oct 21, 2020

/kind bug

What steps did you take and what happened:
Installed https://github.com/kubeflow/kfserving/tree/master/docs/samples/sklearn in on-premise Kubernetes cluster
Created a sample services, all containers are up and running (istio-init, sklearnserving, storageInitizer etc).
I have installed default Kfserving with kubeflow cluster

What did you expect to happen:
I expect the inference service to be in "ready" state

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Followed the debugging guide
Environment:
Installed using Installed using - https://github.com/kubeflow/kfctl/releases/tag/v1.1.0

  • Istio Version:
  • Knative Version:
  • KFServing Version:
  • Kubeflow version:
  • Kfdef:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
  • Minikube version:
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:11:50Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g. from /etc/os-release):

Logs::

NAME              URL                                                                        READY   DEFAULT TRAFFIC   CANARY TRAFFIC   AGE
sklearn-iris-v2   http://sklearn-iris-v2.mmskubeflow.example.com/v1/models/sklearn-iris-v2   True    100                                81m

Logs from Autoscaler;

{"level":"error","ts":"2020-10-21T13:41:21.808Z","logger":"autoscaler.collector","caller":"autoscaler/collector.go:276","msg":"Failed to scrape metrics","commit":"c9be0ab","knative.dev/key":"mmskubeflow/sklearn-iris-v2-predictor-default-clrb4","error":"unsuccessful scrape, sampleSize=1: Get http://sklearn-iris-v2-predictor-default-clrb4-private.mmskubeflow:9090/metrics: dial tcp 10.99.145.251:9090: connect: connection refused","stacktrace":"knative.dev/serving/pkg/autoscaler.newCollection.func1\n\t/home/prow/go/src/knative.dev/serving/pkg/autoscaler/collector.go:276"}

Logs from Activator:

{"level":"error","ts":"2020-10-21T14:03:11.081Z","logger":"activator","caller":"net/revision_backends.go:285","msg":"Failed to probe clusterIP 10.99.145.251:80","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-648766d5d-txt7n","error":"error roundtripping http://10.99.145.251:80: dial tcp 10.99.145.251:80: connect: connection refused","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:285\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}

Additional Information

  1. My cluster is running using Rancher, the kfserving is deployed with kubeflow 1.1 default installation
  2. All revisions, KVC, VS are shown as ready
  3. Istio Ingress gateway is configured as NodePort and does not have external IP, I am able to open kubeflow dashboard on said NodePort
  4. Sometimes the InferenceService is shown as Ready=False with error IngressNotConfigured, however restarting network-istio and cluster-gateway-local Pod it starts working. sometimes the inferenceservice also requires a restart.
  5. IP resolved in above is exactly same as Pod IPs

KLindly help as I have reached to wits end as to why is this happening.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/inference 0.68

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.56

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

1 similar comment
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.56

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@kd303
Copy link
Contributor Author

kd303 commented Oct 23, 2020

My Kubeflow installation is 1.1 and it is installed on premise no Dex is configured still I am getting 403 error, can you suggest if installing separate gateway is the only option here?

@yuzisun - apologies for tagging but is there any other solution to this?

{"level":"error","ts":"2020-10-23T10:54:55.113Z","logger":"activator","caller":"net/revision_backends.go:251","msg":"Failed probing","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-7b6b444684-zjq9v","error":"unexpected body: want \"queue\", got \"RBAC: access denied\"","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:251\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}
{"level":"error","ts":"2020-10-23T10:54:55.113Z","logger":"activator","caller":"net/revision_backends.go:285","msg":"Failed to probe clusterIP 10.104.163.196:80","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-7b6b444684-zjq9v","error":"unexpected body: want \"queue\", got \"RBAC: access denied\"","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:285\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}
{"level":"error","ts":"2020-10-23T10:54:55.125Z","logger":"activator","caller":"net/revision_backends.go:251","msg":"Failed probing","commit":"c9be0ab",

@yuzisun
Copy link
Member

yuzisun commented Oct 23, 2020

@kd303 Do you have istio sidecar injected in the inference service pod? I think you have istio security turned on in the cluster so istio is blocking the request to hitting the inference service main container.

@kd303
Copy link
Contributor Author

kd303 commented Oct 26, 2020

@yuzisun I have side care injected pls see the output, i dont think I have any security turned out its a plain cluster where I dont have to provide the logons to kubeflow dashboard..

Pls note I have the envoy debugging turned on for rbac & http2 logs enabled.. on activator my logs are failing with RBAC error, I am not sure what is wrong..

{"level":"error","ts":"2020-10-26T07:28:30.850Z","logger":"istiocontroller.ingress-controller.status-manager","caller":"ingress/status.go:365","msg":"Probing of http://sklearn-iris-predictor-default.mmskubeflow:80/ failed, IP: 10.244.2.189:80, ready: false, error: error roundtripping http://sklearn-iris-predictor-default.mmskubeflow:80/: dial tcp 10.244.2.189:80: connect: cannot assign requested address (depth: 0)","commit":"c9be0ab","knative.dev/controller":"ingress-controller","stacktrace":"knative.dev/serving/pkg/reconciler/ingress.(*StatusProber).processWorkItem\n\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:365\nknative.dev/serving/pkg/reconciler/ingress.(*StatusProber).Start.func1\n\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:268"}
{"level":"info","ts":"2020-10-26T07:30:45.693Z","logger":"istiocontroller.ingress-controller.status-manager","caller":"ingress/status.go:331","msg":"Processing probe for http://sklearn-iris-predictor-default.mmskubeflow.example.com:80/, IP: 10.244.2.189:80 (depth: 0)","commit":"c9be0ab","knative.dev/controller":"ingress-controller"}
{"level":"error","ts":"2020-10-26T07:30:45.880Z","logger":"istiocontroller.ingress-controller.status-manager","caller":"ingress/status.go:365","msg":"Probing of http://sklearn-iris-predictor-default.mmskubeflow.example.com:80/ failed, IP: 10.244.2.189:80, ready: false, error: unexpected status code: want 200, got 404 (depth: 0)","commit":"c9be0ab","knative.dev/controller":"ingress-controller","stacktrace":"knative.dev/serving/pkg/reconciler/ingress.(*StatusProber).processWorkItem\n\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:365\nknative.dev/serving/pkg/reconciler/ingress.(*StatusProber).Start.func1\n\t/home/prow/go/src/knative.dev/serving/pkg/reconciler/ingress/status.go:268"}

http://10.2

kubectl -n mmskubeflow logs sklearn-iris-predictor-default-xtnds-deployment-7f948994d4p9zbz -c istio-proxy
[2020-10-26 11:44:11.387][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[2020-10-26 11:44:11.390][20][warning][config] [external/envoy/source/common/config/grpc_mux_subscription_impl.cc:81] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener(s) 10.244.1.211_9090: Found header(name: "kubeflow-userid"
exact_match: "anonymous@kubeflow.org"
) rule,not supported by RBAC network filter, virtualInbound: Found header(name: "kubeflow-userid"
exact_match: "anonymous@kubeflow.org"
) rule,not supported by RBAC network filter
[2020-10-26 11:49:11.422][20][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:742] [C23] stream closed: 0
[2020-10-26 11:49:11.422][20][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2020-10-26 11:49:11.731][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[2020-10-26 11:49:11.735][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[2020-10-26 11:49:11.737][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[2020-10-26 11:49:11.737][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[2020-10-26 11:49:11.739][20][warning][config] [external/envoy/source/common/config/grpc_mux_subscription_impl.cc:81] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener(s) 10.244.1.211_9090: Found header(name: "kubeflow-userid"
exact_match: "anonymous@kubeflow.org"
) rule,not supported by RBAC network filter, virtualInbound: Found header(name: "kubeflow-userid"
exact_match: "anonymous@kubeflow.org"
) rule,not supported by RBAC network filter

@kd303

This comment has been minimized.

@kd303

This comment has been minimized.

@kd303
Copy link
Contributor Author

kd303 commented Oct 27, 2020

@yuzisun Pls ignore above logs now, and redeploy the cluster just to be sure

ok, I cleaned all the services everything that was available on the cluster, I no longer get RBAC error, Please see the issue description now:

kubectl get inferenceservice -nmmskubeflow
NAME           URL                                                                  READY   DEFAULT TRAFFIC   CANARY TRAFFIC   AGE
sklearn-iris   http://sklearn-iris.mmskubeflow.example.com/v1/models/sklearn-iris   True    100                                23m
kubectl get svc -n mmskubeflow

sklearn-iris                                   ExternalName   <none>           cluster-local-gateway.istio-system.svc.cluster.local   <none>                              11m
sklearn-iris-predictor-default                 ExternalName   <none>           cluster-local-gateway.istio-system.svc.cluster.local   <none>                              11m
sklearn-iris-predictor-default-b4kgt           ClusterIP      10.97.87.144     <none>                                                 80/TCP                              11m
sklearn-iris-predictor-default-b4kgt-private   ClusterIP     **10.98.251.28**     <none>                                                 80/TCP,9090/TCP,9091/TCP,8022/TCP   11m

activator logs are below:

Failed probing","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-564f8996c5-fp5zd","error":"error roundtripping http://10.244.3.43:8012: dial tcp 10.244.3.43:8012: connect: connection refused","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:251\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}
{"level":"error","ts":"2020-10-27T07:39:50.224Z","logger":"activator","caller":"net/revision_backends.go:285","msg":"Failed to probe clusterIP 10.98.251.28:80","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-564f8996c5-fp5zd","error":"error roundtripping http://10.98.251.28:80: dial tcp 10.98.251.28:80: connect: connection refused","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:285\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}
{"level":"error","ts":"2020-10-27T07:39:50.424Z","logger":"activator","caller":"net/revision_backends.go:251","msg":"Failed probing","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-564f8996c5-fp5zd","error":"error roundtripping http://10.244.3.43:8012: dial tcp 10.244.3.43:8012: connect: connection refused","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:251\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}
{"level":"error","ts":"2020-10-27T07:39:50.424Z","logger":"activator","caller":"net/revision_backends.go:285","msg":"Failed to probe clusterIP 10.98.251.28:80","commit":"c9be0ab","knative.dev/controller":"activator","knative.dev/pod":"activator-564f8996c5-fp5zd","error":"error roundtripping http://10.98.251.28:80: dial tcp 10.98.251.28:80: connect: connection refused","stacktrace":"knative.dev/serving/pkg/activator/net.(*revisionWatcher).checkDests\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:285\nknative.dev/serving/pkg/activator/net.(*revisionWatcher).run\n\t/home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:326"}

It is trying to probe 10.244.3.43 which is my Pod IP, where sk-learn service is deployed, the probe is failing at 10.98.251.28:80 which the IP address of my svc sklearn-iris-predictor-default-b4kgt-private describe gives following:

kubectl get svc sklearn-iris-predictor-default-b4kgt-private -nmmskubeflow -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
    autoscaling.knative.dev/minScale: "1"
    autoscaling.knative.dev/target: "1"
    internal.serving.kubeflow.org/storage-initializer-sourceuri: s3://mms-objects/models/sklearn/iris
    queue.sidecar.serving.knative.dev/resourcePercentage: "20"
    serving.knative.dev/creator: system:serviceaccount:kubeflow:default
  creationTimestamp: "2020-10-27T07:38:53Z"
  labels:
    app: sklearn-iris-predictor-default-b4kgt
    component: predictor
    endpoint: default
    model: sklearn-iris
    networking.internal.knative.dev/serverlessservice: sklearn-iris-predictor-default-b4kgt
    networking.internal.knative.dev/serviceType: Private
    serving.knative.dev/configuration: sklearn-iris-predictor-default
    serving.knative.dev/configurationGeneration: "1"
    serving.knative.dev/revision: sklearn-iris-predictor-default-b4kgt
    serving.knative.dev/revisionUID: 6ec05726-373e-41ae-bd68-4075f4f6e0c4
    serving.knative.dev/service: sklearn-iris-predictor-default
    serving.kubeflow.org/inferenceservice: sklearn-iris
  name: sklearn-iris-predictor-default-b4kgt-private
  namespace: mmskubeflow
  ownerReferences:
  - apiVersion: networking.internal.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: ServerlessService
    name: sklearn-iris-predictor-default-b4kgt
    uid: b6a15895-e439-44e2-995d-60e4634dd104
  resourceVersion: "5324563"
  selfLink: /api/v1/namespaces/mmskubeflow/services/sklearn-iris-predictor-default-b4kgt-private
  uid: 2cb891e2-b65f-4eaa-a8b2-21f7fbf4b5c6
spec:
  clusterIP: 10.98.251.28
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8012
  - name: queue-metrics
    port: 9090
    protocol: TCP
    targetPort: queue-metrics
  - name: http-usermetric
    port: 9091
    protocol: TCP
    targetPort: http-usermetric
  - name: http-queueadm
    port: 8022
    protocol: TCP
    targetPort: 8022
  selector:
    serving.knative.dev/revisionUID: 6ec05726-373e-41ae-bd68-4075f4f6e0c4
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

@kd303
Copy link
Contributor Author

kd303 commented Oct 27, 2020

It seems when I follow Knative Debugging guide, below step is not as expected, route names are shown in the label

kubectl get ingresses.networking.internal.knative.dev -o=custom-columns='NAME:.metadata.name,LABELS:.metadata.labels' -n mmskubeflow
'NAME                            LABELS
sklearn-iris-predictor-default   <none>

@yuzisun
Copy link
Member

yuzisun commented Oct 28, 2020

@kd303 looks the inference service is actually in ready state now? Can you actually curl the service ?

@kd303
Copy link
Contributor Author

kd303 commented Oct 28, 2020

@yuzisun the curl returns 404

Response headers::

x-powered-by:
Express
content-security-policy:
default-src 'none'
x-content-type-options:
nosniff
content-type:
text/html; charset=utf-8
content-length:
168
date:
Tue, 27 Oct 2020 18:01:33 GMT
x-envoy-upstream-service-time:
1
server:
istio-envoy


<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot POST /workflow/deployments/predict</pre>
</body>
</html>

@kd303
Copy link
Contributor Author

kd303 commented Oct 28, 2020

@yuzisun Thanks for all the help, I could resolve the issue however I would like point out the issue so that this helps others and may be included in documentation section page, its been some work all these days.

The failing probes remains a mistry as after couple of re-deployment of knative pods things work fine.

I was testing with Postman and other tools again this is a dev environment where no external domains are configured with Isitio,

The Virtual Service definition created by Inferenceservice is looking for authority header, so if you have any tools like Postman or any similar tools would add additional header, in my case below is the IP address.

Now ^sklearn-iris\.mmskubeflow\.example\.com(?::\d{1,5})?$ particular regx will not be able to resolve the header and hence the response from envoy would be 404. (from header is added to navigate Istio-ingress tracing logs :) )

IMO, either there should be way for development environments etc. to

  1. Either update the VS definitions using matchers like uri or any custom headers
  2. regex should be relaxed to ensure modelname/predictor part should be not at the start (may be remove '^')

HTML Header

':authority', '10.85.43.223:31380,sklearn-iris-predictor-default.mmskubeflow.example.com'
':path', '/'
':method', 'POST'
'content-type', 'application/json'
'from', 'QWEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEFQWEQWARRRRRRRRRRRRRRRR'
'content-length', '148'
kubectl get vs sklearn-iris  -nmmskubeflow -oyaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"serving.kubeflow.org/v1alpha2","kind":"InferenceService","metadata":{"annotations":{},"name":"sklearn-iris","namespace":"mmskubeflow"},"spec":{"default":{"predictor":{"serviceAccountName":"miniosa","sklearn":{"storageUri":"s3://mms-objects/models/sklearn/iris"}}}}}
  creationTimestamp: "2020-10-28T10:54:30Z"
  generation: 1
  name: sklearn-iris
  namespace: mmskubeflow
  ownerReferences:
  - apiVersion: serving.kubeflow.org/v1alpha2
    blockOwnerDeletion: true
    controller: true
    kind: InferenceService
    name: sklearn-iris
    uid: 580f4551-50ca-4263-a7e6-dee2ac4c2d5d
  resourceVersion: "5873888"
  selfLink: /apis/networking.istio.io/v1alpha3/namespaces/mmskubeflow/virtualservices/sklearn-iris
  uid: 6dc418fd-8882-4c79-9a7c-fc3360ac6b8b
spec:
  gateways:
  - kubeflow-gateway.kubeflow
  - knative-serving/cluster-local-gateway
  hosts:
  - sklearn-iris.mmskubeflow.example.com
  - sklearn-iris.mmskubeflow.svc.cluster.local
  http:
  - match:
    - authority:
        regex: ^sklearn-iris\.mmskubeflow\.example\.com(?::\d{1,5})?$
      gateways:
      - kubeflow-gateway.kubeflow
      uri:
        prefix: /v1/models/sklearn-iris:predict
    - authority:
        regex: ^sklearn-iris\.mmskubeflow(\.svc(\.cluster\.local)?)?(?::\d{1,5})?$
      gateways:
      - knative-serving/cluster-local-gateway
      uri:
        prefix: /v1/models/sklearn-iris:predict
    retries:
      attempts: 3
      perTryTimeout: 600s
    route:
    - destination:
        host: cluster-local-gateway.istio-system.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Host: sklearn-iris-predictor-default.mmskubeflow.svc.cluster.local
      weight: 100

@kd303
Copy link
Contributor Author

kd303 commented Oct 29, 2020

@yuzisun further updates, my other hunch is ClusterRbacConfig has only istio-system name-space in exclusion (by default even on a non-dex on-remise cluster, so one may have to add the namespace where the KFServing models are getting deployed.

I think some updates to the KfServing Debugging documentation may be necessary.

@omrishiv
Copy link

Just ran into this issue as well. To clarify for future me who googles this next time:
kubectl edit clusterrbacconfig default and add the model namespace to the list

@VIjayHP
Copy link

VIjayHP commented Dec 21, 2020

Ran into the same issue while performing e2e tests for net-istio:
TestIstioProbing: stream.go:248: E 08:31:32.699 networking-istio-7d9c95f99c-299ws [istio-ingress-controller] [serving-tests/istio-probing-http-leiilaxl] Probing of http://istio-probing-http-notmuhkt.example.com:80 failed , ready: false, error: unexpected status code: want 200, got 404 (depth: 0)

@yuzisun
Copy link
Member

yuzisun commented Dec 30, 2020

@yuzisun further updates, my other hunch is ClusterRbacConfig has only istio-system name-space in exclusion (by default even on a non-dex on-remise cluster, so one may have to add the namespace where the KFServing models are getting deployed.

I think some updates to the KfServing Debugging documentation may be necessary.

Good catch! Can you help add this to the debugging guide?

@kd303
Copy link
Contributor Author

kd303 commented Dec 31, 2020

@yuzisun May be I am bit confused about ClusterRbac part , we installed a new cluster with Dex and RBAC and again ran into same issue, and the problem got resolved with changing ClusterRbacConfig and adding the namespace in the ON_WITH_EXCLUSION list, now what is puzzling is there are ServiceRole and ServiceRoleBindings are created so why would we still have to add to ON_WITH_EXCLUSION?

I would be happy to add this debugging guide :)

kd303 added a commit to kd303/kfserving that referenced this issue Jan 7, 2021
Added the description for failing probes and 404 as per suggestion by @yuzisun on issue [1153]#(kserve#1153)
@yuzisun
Copy link
Member

yuzisun commented Jan 7, 2021

@yuzisun May be I am bit confused about ClusterRbac part , we installed a new cluster with Dex and RBAC and again ran into same issue, and the problem got resolved with changing ClusterRbacConfig and adding the namespace in the ON_WITH_EXCLUSION list, now what is puzzling is there are ServiceRole and ServiceRoleBindings are created so why would we still have to add to ON_WITH_EXCLUSION?

I would be happy to add this debugging guide :)

@kd303 does the service role and service rolebinding created cover for inference service ? Can you paste the output of the istio rbac?

animeshsingh added a commit that referenced this issue Jan 12, 2021
* Update KFSERVING_DEBUG_GUIDE.md

Added the description for failing probes and 404 as per suggestion by @yuzisun on issue [1153]#(#1153)

* Apply suggestions from code review

Co-authored-by: Animesh Singh <singhan@us.ibm.com>
@yuzisun
Copy link
Member

yuzisun commented Apr 9, 2023

Closing the staled issue.

@yuzisun yuzisun closed this as completed Apr 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants