Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support Istio Mesh Internal to cluster #3485

Closed
mwm5945 opened this issue Aug 12, 2021 · 13 comments · Fixed by #3609
Closed

Better support Istio Mesh Internal to cluster #3485

mwm5945 opened this issue Aug 12, 2021 · 13 comments · Fixed by #3609

Comments

@mwm5945
Copy link
Contributor

mwm5945 commented Aug 12, 2021

Currently, the VirtualServices created by Seldon are attached to a gateway, which means that all models must be invoked via the hosts in that gateway, thereby also forcing all traffic through the Istio IngressGateway. While this works for traffic coming in from outside the cluster, its cumbersome when traffic is coming from within the same cluster (having to go through multiple side cars/IGW to get to the model).

Ex Current VS:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: test-model-1-http
  namespace: my-ns
  ownerReferences:
  - apiVersion: machinelearning.seldon.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: SeldonDeployment
    name: test-model-1
spec:
  gateways:
  - my-ns/seldon-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /seldon/bonsai-maas/test-model-1/
    rewrite:
      uri: /
    route:
    - destination:
        host: test-model-1-test-model-0
        port:
          number: 9000
        subset: test-model-0

Since gateway is set to my-ns/seldon-gateway, this VS will only apply to traffic coming out of the Istio IngressGateway.

Enabling the mesh gateway means the sidecar can now direct traffic through the Istio mesh, instead of forcing it through the IGW.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: test-shadow-graph-1-http-2
  namespace: my-ns
spec:
  gateways:
  - mesh
  hosts:
  #  - '*' wildcard hosts aren't allowed with 'mesh' gateways
  - test-shadow-graph-1-test-shadow-graph-0
  http:
  - match:
    - uri:
        prefix: /seldon/my-ns/test-shadow-graph-1/
    mirror:
      host: test-shadow-graph-1-test-shadow-graph-1
      port:
        number: 9001
      subset: test-shadow-graph-1
    rewrite:
      uri: /
    route:
    - destination:
        host: test-shadow-graph-1-test-shadow-graph-0
        port:
          number: 9000
        subset: test-shadow-graph-0

This would allow the models Service to be invoked, and still be able to utilize advance traffic management patterns (i.e. shadow, canary, etc)

The second option is fine for REST, but might be a bit annoying for GRPC (having to create a new GRPC connection / host). Ideally, there'd be one host name that can be invoked without having to go through the IGW, something like:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: test-model-1-gprc
  namespace: my-ns
spec:
  gateways:
  - mesh
  hosts:
  - my-common-service-hostname
  http:
  - match:
    - headers:
        namespace:
          exact: my-ns
        seldon:
          exact: test-model-q
      uri:
        regex: .*tensorflow.*|.*seldon.protos.*
    mirror:
      host: my-other-model-1-0
      port:
        number: 9001
      subset: my-other-model-0
    route:
    - destination:
        host: test-model-1-test-model-0
        port:
          number: 5001
        subset: test-model-0
@mwm5945 mwm5945 added the triage Needs to be triaged and prioritised accordingly label Aug 12, 2021
@ukclivecox
Copy link
Contributor

This seems reasonable. Do you have suggestions on how this could be configured maybe via some additional annotations?

@mwm5945
Copy link
Contributor Author

mwm5945 commented Aug 12, 2021

Thats an excellent question...not sure to be honest. For us, I know its more "all or nothing,", meaning either the traffic is coming from within the same cluster, or its not, so it could be set at installation time. I'm also wondering if it would be possible to remove the wildcard hostname, and have multiple gateways so that you can do both. Something like

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: test-model-1-gprc
  namespace: my-ns
spec:
  gateways:
  - mesh
  - my-ns/seldon-gateway
  hosts:
  - my-common-service-hostname
  - my-external.dns.name.com
  http:
  - match:
    - headers:
        namespace:
          exact: my-ns
        seldon:
          exact: test-model-q
      uri:
        regex: .*tensorflow.*|.*seldon.protos.*
    mirror:
      host: my-other-model-1-0
      port:
        number: 9001
      subset: my-other-model-0
    route:
    - destination:
        host: test-model-1-test-model-0
        port:
          number: 5001
        subset: test-model-0

I think this would probably be an ideal "best of both worlds" situation.

@agilgur5
Copy link

agilgur5 commented Aug 22, 2021

Hi, I work with @mwm5945 and after thinking about it a bit, it would likely be non-trivial to have some "common hostname" as that would require Seldon to effectively proxy requests to models. To keep the existing paradigm, Seldon would have to add cluster-local hostname for each model, which should basically be the equivalent of a Service for the model, e.g. test-model-1.namespace.svc.cluster.local.

The other thing I realized is that this current issue actually presents a security issue as well. Normally, the Istio Ingress Gateway (IGW) only takes traffic from its LoadBalancer (LB), and therefore we can craft a specific NetworkPolicy (NP) for that to deny traffic from inside the cluster.

Due to this issue, we have to allow intra-cluster traffic to the IGW. As the IGW might be forwarding traffic to all sorts of Istio enabled services (not just Seldon models), allowing that traffic also allows for existing NPs to be bypassed. Direct communication between two services may be denied via an NP, but if they go through the IGW, then it's allowed.

I think that raises the stakes a decent bit on this issue as the current workaround is a security issue, not just a latency / efficiency issue

@ukclivecox
Copy link
Contributor

Can you presently get what you need by specifying the gateway as an annotation on the sdep as described here: https://docs.seldon.io/projects/seldon-core/en/latest/ingress/istio.html#istio-configuration-annotation-reference

@ukclivecox ukclivecox added awaiting-feedback and removed triage Needs to be triaged and prioritised accordingly labels Sep 2, 2021
@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 21, 2021

@cliveseldon sorry for the delay--yes, with a bit of trickery, you can avoid using a Gateway if the traffic is only internal.

You'll need to create a "common service" that can be defined in the virtual service, and set the gateway as mesh.

Ex:

apiVersion: v1
kind: Service
metadata:
  name: seldon
  namespace: my-ns
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  sessionAffinity: None
  type: ClusterIP

SDEP Annotations

...
spec:
    seldon.io/istio-gateway: mesh
    seldon.io/istio-host: seldon
...

From the same namespace, in a pod that has the istio sidecar enabled, somthing like the following will work

curl -X POST -v -H 'Content-Type: application/json' -d '{"data": { "names": ["a", "b"], "ndarray": [[1,2]]}}' http://seldon/seldon/my-ns/test-model-1/api/v1.0/predictions

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 21, 2021

@cliveseldon however things do break when it comes to GRPC and the Virtual Services. Because the VS's are separate from one another, and (for some reason) merging VS's for Host (mesh) networks isn't supported, you're not able to use the mesh for GRPC.

From their docs:

A VirtualService can only be fragmented this way if it is bound to a gateway. Host merging is not supported in sidecars.

Is there a specific reason why the VS's are split the way they are? I've attempted a simple combind VS and everything works correctly:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: my-vs
  namespace: my-ns
spec:
  gateways:
  - mesh
  hosts:
  - seldon # or whatever you defined your service to be
  http:
  - match:
    - uri:
        prefix: /seldon/my-ns/test-model-1/
    rewrite:
      uri: /
    route:
    - destination:
        host: test-model-1-test-model-0
        port:
          number: 9000
        subset: test-model-0
  - match:
    - headers:
        namespace:
          exact: my-ns
        seldon:
          exact: test-model-1
      uri:
        regex: .*tensorflow.*|.*seldon.protos.*|.*inference.*
    route:
    - destination:
        host: test-model-1-test-model-0
        port:
          number: 9500
        subset: test-model-0

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 21, 2021

Sure seems like these could just be combined into one:

Spec: istio_networking.VirtualService{

@ukclivecox
Copy link
Contributor

I see you have open a PR for this. Be good to discuss your use case. We are looking for future API updates to create a single SVC entrypoint and allow customers to create whatever service mesh they wish above.

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 23, 2021

With the annotation to set the hostname in the virtual service, we can essentially cut out the gateway (assuming we don't have traffic coming from outside the cluster, only internally, which is the case for us). I've created a service that doesn't have a selector, and listens on the http2 protocol that the virutal services will attach to in the mesh, after specifying the mesh "special" gateway through the other annotation, thereby allowing mesh invocation of the model, instead of having to hairpin through the IngressGateway. I see you've added the v2 label, when is that slated to be created?

@ukclivecox
Copy link
Contributor

v2 label is for current early stage discussion on work for updated APIs so is still early stage thinking of how we can improve. So your PR would still be valid for the medium term.

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 24, 2021

@cliveseldon i've verified that the change in the PR works, however i needed to add an additional work around. If there's one common service for all models in a NS, we still encounter the same issue of multiple VS's not being able to be applied in the mesh network. For example:
I create a service seldon in my-ns, but if I try to deploy multiple models into the NS, we encounter the same issue of only one VS being applied to the hostname.

Workaround:
Use the seldon.io/svc-name annotation to create a unique service for each of the predictors, and set the seldon.io/istio-hostname to be the "primary" one to invoke.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: canary-example-1
spec:
  annotations:
    seldon.io/istio-gateway: mesh            # NOTE
    seldon.io/istio-host: mysvcname        # NOTE
  name: canary-example-1
  predictors:
  - annotations:
      seldon.io/svc-name: mysvcname      # NOTE
    componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.0
          imagePullPolicy: IfNotPresent
          name: classifier
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: GRPC
      name: classifier
      type: MODEL
    labels:
      sidecar.istio.io/inject: "true"
    name: main
    replicas: 1
    traffic: 75
  - annotations:
      seldon.io/svc-name: mysvcname2        # NOTE
    componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.0
          imagePullPolicy: IfNotPresent
          name: classifier
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: GRPC
      name: classifier
      type: MODEL
    labels:
      sidecar.istio.io/inject: "true"
    name: canary
    replicas: 1
    traffic: 25

Then something like this will work, and correctly split traffic:

curl -X POST -H 'Content-Type: application/json' -d '{"data": { "names": ["a", "b"], "ndarray": [[1,2]]}}' http://mysvcname:8000/seldon/batest/canary-example-1/api/v1.0/predictions

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 28, 2021

@cliveseldon or @axsaucedo would either of you be able to take a look at my PR for this? We're working on releasing something relatively soon, and it would be awesome if we could invoke the model through the mesh over through the gateway :D

@ukclivecox
Copy link
Contributor

Hi @mwm5945 looks good. Added one comment about upgrades and started the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants