Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue installing Helm chart on cluster with no internet access #2924

Closed
ntorba opened this issue Feb 3, 2021 · 9 comments
Closed

Issue installing Helm chart on cluster with no internet access #2924

ntorba opened this issue Feb 3, 2021 · 9 comments
Assignees
Labels
bug triage Needs to be triaged and prioritised accordingly

Comments

@ntorba
Copy link

ntorba commented Feb 3, 2021

Describe the bug

Running into bug when installing seldon-core-operator on EKS with no outbound internet access.

I moved all required docker images to an internal dockerhub server that the cluster has access to and I am running the helm install from source.

Seeing the following events when running kubectl describe pod seldon-controller-manager-

Events:
  Type     Reason     Age                From                                     Message
  ----     ------     ----               ----                                     -------
  Normal   Scheduled  50s                default-scheduler                        Successfully assigned amp-ops/seldon-controller-manager to ip-nnn.ec2.internal
  Warning  BackOff    27s (x3 over 47s)  kubelet, ip-nnn-.ec2.internal  Back-off restarting failed container
  Normal   Pulled     12s (x4 over 49s)  kubelet, ip-nnn-.ec2.internal  Container image "internal-dockerhub/repo/seldon-core-executor:1.2.3" already present on machine
  Normal   Created    12s (x4 over 49s)  kubelet, ip-nnn.ec2.internal  Created container manager
  Warning  Failed     12s (x4 over 48s)  kubelet, ip-nnn.ec2.internal  Error: failed to start container "manager": Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "/manager": stat /manager: no such file or directory: unknown

(removed numbers from ip addresses). Looks like the manager command is failing, but I'm not sure what is causing this.

The logs, with kubectl logs seldon-controller are empty.

To reproduce

Install on cluster with no internet.

Expected behaviour

Successful install

Environment

EKS V1.17

@ntorba ntorba added bug triage Needs to be triaged and prioritised accordingly labels Feb 3, 2021
@axsaucedo
Copy link
Contributor

It seems that the command is failing due to attempting to run the entrypoint of the docker file as opposed to something that requires downloading. My initial assumption is that this may be due to permissions on access to the file - it may be worth checking what is the user ID used to run the executor compared to the user ID that is allowed to run the /manager file. Could you provide further information on these two? Thank you @ntorba

@axsaucedo axsaucedo self-assigned this Feb 3, 2021
@ntorba
Copy link
Author

ntorba commented Feb 3, 2021

@axsaucedo Thanks for the quick reply!

Where is that User ID info? I'm not sure what you mean by User ID. Do I find that on kubernetes?

@axsaucedo
Copy link
Contributor

Yeah it doesn't appear as userID, you can find it inside of the securityContext in kubectl get pods -n seldon-system <seldon-controller-pod> -o yaml - however if it's not explicitly set it may not say anything specifically, which then I assume it would run by default docker built one. It's not 100% clear what the issue is but the error message complains that it can't find the command, so I am assuming it most probably doens't have the permissions.

@ntorba
Copy link
Author

ntorba commented Feb 3, 2021

Looks like the securityContext is empty. On an install that ran successfully on a cluster with internet access the securityContext is empty as well.

The install for cluster with no internet is being done from a jumphost, do you think that could cause an issue?

Are there any fields that can be passed to the values.yaml for securityContext that would set up the right permissions?

@axsaucedo
Copy link
Contributor

Hmm I can't see a reason why the issue would arise from this being done from a jumphost.

Unfortunately there are no fields for 1.2.3, we only just introduce a securitycontext override in 1.6.0 (current releaswe).

At this point, my only guess that given no security context is configured, the public cloud allows for containers to run with user 0 (root), which i believe is what the container uses by defualt, however the new cluster may not allow it. I can't see many reasons why the container is not able to run the /manager command, if it's available inside the image. That or the image you've pulled for some reason doesn't have the /manager, but i can't really see how that can be the case.

@ntorba
Copy link
Author

ntorba commented Feb 10, 2021

@axsaucedo Do you know how I could recreate the command kubernetes is using to launch this container that I could reproduce locally?

Unable to run the docker container in the worker-node because docker CLI is not installed

@axsaucedo
Copy link
Contributor

@ntorba you should be able to override the runAs in the securityContext in the podspec override, you could try to add the user 0, and then try to add another user, like user 3 (or another number), to see if the behaviour is the same. Having said that, it may be part of the policy that it may not work on that cluster vs this cluster. It sounds quite strange, but you would normally see this issue if the permissions are not set for the user id running the container.

@ntorba
Copy link
Author

ntorba commented Feb 10, 2021

@aarondav I was using helm chart 1.2.3, which is pretty old. Just upgraded to 1.6.0 and it is running fine. Better to use the newest version any way.

Thanks for your help!

@axsaucedo
Copy link
Contributor

Great @ntorba, it does seem like it may be related to the updated version allowing for non-root runs

@ntorba ntorba closed this as completed Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug triage Needs to be triaged and prioritised accordingly
Projects
None yet
Development

No branches or pull requests

2 participants