Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ambassador a dependency #445

Closed

Conversation

naseemkullah
Copy link
Contributor

Tackles #258

Please review!
Thanks

@naseemkullah naseemkullah force-pushed the seldon-core-ambassador branch from fb78fa4 to e696eac Compare February 12, 2019 13:23
@ukclivecox ukclivecox self-requested a review February 12, 2019 13:59
Copy link
Contributor

@ukclivecox ukclivecox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if we need to change/remove the RBAC

{{- if .Values.ambassador.enabled }}
{{- if .Values.single_namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ambassador
rules:
- apiGroups: [""]
resources:
- services
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["create", "update", "patch", "get", "list", "watch"]
- apiGroups: [""]
resources:
- secrets
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ambassador
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ambassador
subjects:
- kind: ServiceAccount
name: {{ .Values.rbac.service_account.name }}
namespace: {{ .Release.Namespace }}
{{- end }}
{{- if not .Values.single_namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: ambassador
rules:
- apiGroups: [""]
resources:
- services
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["create", "update", "patch", "get", "list", "watch"]
- apiGroups: [""]
resources:
- secrets
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ambassador
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ambassador
subjects:
- kind: ServiceAccount
name: {{ .Values.rbac.service_account.name }}
namespace: {{ .Release.Namespace }}
{{- end }}

@naseemkullah
Copy link
Contributor Author

Good catch, this would indeed be redundant since https://github.com/helm/charts/blob/master/stable/ambassador/templates/rbac.yaml would be generated with rbac enabled.

Mind you, it is cluster wide and not configurable to be namespace scoped. (e.g. ClusterRole not Role)

@naseemkullah
Copy link
Contributor Author

@ukclivecox
Copy link
Contributor

Good catch, this would indeed be redundant since https://github.com/helm/charts/blob/master/stable/ambassador/templates/rbac.yaml would be generated with rbac enabled.

Mind you, it is cluster wide and not configurable to be namespace scoped. (e.g. ClusterRole not Role)

Ahh.. that is an issue. We need to allow DevOps to deploy with just Roles if namespaced and ClusterRoles if clusterwide. This allows fine grained permissions for users. Maybe this was the main blocker for not using the Helm/Datawire chart... ?

@naseemkullah
Copy link
Contributor Author

I suppose it is!
We should create a proposal upstream for this.

We can take the grafana chart as an example where they do this:

https://github.com/helm/charts/blob/4df52d5628e776d943ba59cabf975907302317a8/stable/grafana/values.yaml#L5

@ukclivecox
Copy link
Contributor

Yes - we should get a PR for the upstream change. Are you willing to do that and link to this PR?

@naseemkullah
Copy link
Contributor Author

Sure!

@naseemkullah
Copy link
Contributor Author

helm/charts#11354 created.

@naseemkullah
Copy link
Contributor Author

naseemkullah commented Feb 12, 2019

I've made the required changes pre-emptively assuming the upstream PR will go through (removed ambassador rbac related objects, added rbac.namespaced value)

@naseemkullah naseemkullah mentioned this pull request Feb 12, 2019
@naseemkullah
Copy link
Contributor Author

Alright we are good to with regards to rbac. The PR was merged upstream and it is now configurable with the ambassador.rbac.namespaced value.

Something worth noting is that we still have the single_namespace value as well.

@naseemkullah
Copy link
Contributor Author

bump

@ukclivecox ukclivecox added this to the 0.2.x milestone Feb 19, 2019
@ukclivecox ukclivecox self-assigned this Feb 19, 2019
@ryandawsonuk
Copy link
Contributor

One knock-on effect of this is that the labels on ambassador become different. This is used in notebooks which match using "service=ambassador". Using the ambassador chart we get a standard app.kubernetes.io label so it becomes "app.kubernetes.io/name=ambassador"

@ryandawsonuk
Copy link
Contributor

Am also noticing that when I run through the helm_examples notebook with this change the grpc call for the a/b test fails with

_Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.UNIMPLEMENTED
    details = ""
    debug_error_string = "{"created":"@1553618148.143629760","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"","grpc_status":12}"

The REST call works. Both calls work when using master branch.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Mar 26, 2019

Turns out the "StatusCode.UNIMPLEMENTED" is a known issue not related to this PR and happens intermittently on GRPC requests, especially the first one. After waiting a while and trying again it did work for the PR branch. Will raise that issue separately.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Mar 27, 2019

I'm working on getting the E2E tests to pass for these changes in a seldon-core-ambassador-pr445 branch based on this. Currently hit a problem with multi-armed bandit test where the seldon-container-engine container in the 'mymab-abtest' pod never becomes ready:

mymab-abtest-never-becomes-ready

Not seeing this problem on master and have merged master into seldon-core-ambassador-pr445 branch. Not sure how it can be related to ambassador though as the eg-router that it is looking for really isn't there.

The eg-router is in the graph when I do kubectl get seldondeployment mymab -o json -n test1 and the spec section specifying the image is also there. The logs for the cluster-manager also show that this was there in the resource that it read:

image

But no service is created for the eg-router.

I took the json from the cluster-manager log output:

{"apiVersion":"machinelearning.seldon.io/v1alpha2","kind":"SeldonDeployment","metadata":{"creationTimestamp":"2019-03-27T15:03:22Z","generation":1.0,"labels":{"app":"seldon"},"name":"mymab","namespace":"test1","resourceVersion":"60372","selfLink":"/apis/machinelearning.seldon.io/v1alpha2/namespaces/test1/seldondeployments/mymab","uid":"76a66823-50a1-11e9-ab91-e07443ff3b3d"},"spec":{"name":"mymab","oauth_key":"oauth-key","oauth_secret":"oauth-secret","predictors":[{"componentSpecs":[{"spec":{"containers":[{"image":"seldonio/mock_classifier:1.0","imagePullPolicy":"IfNotPresent","name":"classifier-1","resources":{"requests":{"memory":"1Mi"}}}],"terminationGracePeriodSeconds":20.0}},{"metadata":{"labels":{"version":"v2"}},"spec":{"containers":[{"image":"seldonio/mock_classifier:1.0","imagePullPolicy":"IfNotPresent","name":"classifier-2","resources":{"requests":{"memory":"1Mi"}}}],"terminationGracePeriodSeconds":20.0}},{"spec":{"containers":[{"image":"seldonio/mab_epsilon_greedy:1.1","name":"eg-router"}],"terminationGracePeriodSeconds":20.0}}],"graph":{"children":[{"children":[],"endpoint":{"type":"REST"},"name":"classifier-1","type":"MODEL"},{"children":[],"endpoint":{"type":"REST"},"name":"classifier-2","type":"MODEL"}],"name":"eg-router","parameters":[{"name":"n_branches","type":"INT","value":"2"},{"name":"epsilon","type":"FLOAT","value":"0.2"},{"name":"verbose","type":"BOOL","value":"1"}],"type":"ROUTER"},"name":"abtest","replicas":1.0}]},"status":{"description":"timeout","state":"Failed"}}

Then I ran it through a json compare tool against epsilon_greedy.json and couldn't see any important differences. Next thing I'll try is to see whether a notebook that uses it works for me from the seldon-core-ambassador-pr445 branch.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Mar 28, 2019

When stepping through the notebook the mymab-abtest-eg-router-seldonio-mab-epsilon-greedy-1-1 service for eg-router does get created and the seldon-container-engine pod then does become ready. So I compared that json and again it's the same (apart from the status section).

Am now running E2E tests again and they have got past this. I guess it was non-deterministic failure. Perhaps it's working now because I've restarted minikube.

So now I do have the E2E tests running but have some failures related to connections.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Mar 28, 2019

So the above gist shows 13 failures and now down to 7. A lot of grpc call failures resolved in part through adding resource to minikube and specify resources on ambassador (this dealt with pod restarts I was seeing) and also by adding retries to deal with #473 But now it seems there are REST-related failures, some even in the ksonnet tests and not helm. This is strange and don't see those when running particular tests in isolation.

Also need to map the current use of --set single_namespace=false to what the official chart supports.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Mar 29, 2019

The main problem with those failures was that the labels became different for a helm vs ksonnet deployment and the code to port-forward is common and matches on labels. So have changed the ambassador labels for ksonnet too

Now the only remaining problems seem to be related to enabling cluster-wide access for ambassador as with the latest changes those particular tests are failing consistently. The problem now seems to be that I'm setting the single namespace env var to false and it actually doesn't support that - it works by being set or not set and ignores the content of the value.

Unfortunately I don't see a way to default this env var to enable single namespace in the seldon-core chart and turn it off in particular situations. I tried setting it to "" but that still counts as on. I also tried setting the ambassador.env to {} but then I get "Cannot overwrite table item 'env', with non table value: map[AMBASSADOR_SINGLE_NAMESPACE:true]" - which it seems others have also hit and there's an open helm issue about it.

The way it was done in the datawire version of the chart would've enabled us to default to off and then turn on. But that isn't the official chart anymore.

We could change our defaults to default to cluster wide or we submit a change to official chart and/or raise the issue with ambassador.

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Mar 29, 2019

Have created a branch with the above changes and changing the default for ambassador to cluster wide as a workaround for the namespace var issue. The workaround can be reversed if the helm/charts PR gets merged.

So closing this PR and replacing with #480

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants