-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310
Conversation
/assign @imjasonh |
This proposal looks good to me. I don't mind if you are planning to land this for now. and send detailed design later. LGTM. |
ae557c7
to
994a811
Compare
The proposal lgtm, I think the next step would be some prototyping to demonstrate everything actually works the way it's described. It'd be great to not require users to give read access to their images to the one all-powerful Tekton Service Account. |
Currently, Tekton controller pod has access to all service accounts and will do | ||
entry point lookup using `go-containerregistry` for the images that don't have | ||
any command specified for them. This potentially might cause permission denied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not immediately relevant for this proposal, but FWIW I think some of this behavior is buggy. I think this should fail for any spec that doesn't reference an image digest and doesn't specify the command - part of what the entrypoint looking wants to do is guarantee here is a consistent image for all steps in a pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also raises an interesting question of whether or not we want/need similar image consistency at the PipelineRun level 🤔
I don't believe we try and do this today, but theoretically images could modify between pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wlynch indeed. I don't think we currently are using the image by-digest in the pod we create, which means images could be different between pods (in some cases). knative does this (using the image by-digest in the resulting pod) but we don't.
-image '<image_path>' -taskrun_namespace 'default' -taskrun_service_account 'default' | ||
``` | ||
|
||
### Resolve Image in EntryPoint Binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of what the image metadata lookup does is resolve images to their SHA digests, which seems useful for auditing. We should figure out how we can retain this information / bubble it back up to the controller.
lookup logic, because we resolve the entrypoint in different images. Some | ||
techniques like inter-pod communication or permanently storage must be involved | ||
to do this. Anyway, with or without cache, the cost of time will definitely be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you go into more detail on how the image resolution cache would function when running on the pod (particularly how data would be shared between entrypoint invocations?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To implement cache for steps in the same pod, I plan to serialize the image and store it in the shared volume. One image will be serialized into one file. The filename is the image reference (whether the tag or the digest). When looking up the image entrypoint, we use the image reference to check if the file exists. if it does, grab it from the shared volume. Otherwise, fetch it from the remote registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cache beyond the same pod, I could only think up a way that we need to bubble the whole image back to the controller and request the controller in every steps' container. But this way involves network communication between different pods. I'm not sure whether things would get faster if we introduced this. There is a big chance that consulting the remote registry directly costs less time compared to this kind of cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to reuse the cache
part of knative
(aka have an implementation of it) — I am fuzzy on the subject but felt it was worth mentionning.
### Use Cases | ||
|
||
TaskRun users encounter permission denied error if they don't specify command in | ||
their container specs on some cloud providers. The error happens when TaskRun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To play devil's advocate here - poking at the code it looks like go-containerregistry
only looks at imagePullSecrets
before defaulting to anonymous fetching (this is where any provider specific identity tied to the SA would come into play).
Would we be able to provide an authn.Keychain
that would replicate the behavior we want? (and what extra permissions would be needed if any?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mattmoor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @yaoxiaoqi ! Before we get too deep into the solution, I want to explore the problem a bit more.
The problem statement makes it sound like it's not possible for the controller to reference private images, but it should be something the controller supports and I'm wondering if this is actually a documentation problem.
That being said, totally into making this easier if needed and addressing any problems with it, but I'd really like to understand what the specific problem is (and what the alternatives are)
/hold
any command specified for them. This potentially might cause permission denied | ||
error when users are trying to pull private images from registry since the | ||
default service account and the service account that Tekton controller is using | ||
can't authenticate the requests on some of cloud platforms. The private image is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this could possibly be a documentation problem - it should be possible to configure the controller to pull from a private registry
It looks like the docs on how to do this got lost over here: https://github.com/tektoncd/pipeline/blob/master/docs/container-contract.md#container-contract (and probably should be over in https://github.com/tektoncd/pipeline/blob/master/docs/install.md instead 😬 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pulling images from a private registry is not a problem. Things go wrong when you're trying to pull from a private registry and do not specify the command for the image in the Steps
. The Tekton controller needs to look up the image entrypoint to know what command to execute. But it can't fetch the image and encounters permission denied error. It happens for any spec that doesn't reference an image digest and doesn't specify the command. I will dig into this bug and update the tep with more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as far as I remember, when doing the lookup, the pipeline controller does use/refer to the ServiceAccount
the PipelineRun is ran with (I also thought it wolud look at imagePullSecret but I ain't sure though…). But yeah, the problem is mainly if the image is private and none of the SA provided (to the controller or the pipelinerun/taskrun) have the correct credentials to authenticate with the registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… looking at @wlynch 's comment, it might be the opposite..
Any way, I also wonder if it's more a documentation problem than anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pulling images from a private registry is not a problem. Things go wrong when you're trying to pull from a private registry and do not specify the command for the image in the Steps.
I should have been more specific @yaoxiaoqi , echoing what @vdemeester says and looking at the docs at https://github.com/tektoncd/pipeline/blob/master/docs/container-contract.md#container-contract I think that we DO support the scenario you are describing today:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we DO support the scenario you are describing today
This is true, but I think there's still an issue here because we're inconsistent about how we handle OIDC based auth. The example @yaoxiaoqi gave highlights this - if we bypass the image lookup by passing in a Command
, the pod will authenticate successfully with the container registry without ImagePullSecret
because it's falling back to OIDC when performing the image fetch on the TaskRun pod using the TaskRuns service account. While the controller will also fallback to OIDC, it's using a different identity (the Pipeline controller service account) which isn't really expected by users.
This is likely the root cause of issues like tektoncd/pipeline#2316. As OIDC becomes more common in Kubernetes (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-issuer-discovery) this will likely become more of a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the problem that this TEP is trying to address, is it possible to update the TEP with a detailed description and example? I'm not familiar enough with OIDC or the flow you've described to be able to follow without more details (and the TEP as currently stated isn't as specific as what you're describing)
to query the entrypoint of the image that is running on this pod. This will also | ||
avoid the authentication problem when requests come from different pods on some | ||
cloud providers and the confusing retricted access to the image when command is not | ||
specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on this authentication problem a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't put it clearly in the TEP. I updated the section after I did some experiments on entrypoint lookup and reproduced the issue tektoncd/pipeline#2316 using GKE cluster. When Workload Identity enabled, user has to use tekton pipelines controller service account to impersonate the corresponding GSA since the request comes from controller pod. But users usually don't know they should do this and only bind the service account for the target pod to GSA. Billy gave a better explanation here https://wlyn.ch/posts/gcp-tekton-workload-identity/
the experiments I did lists below:
Cluster | Config | Image fetching Status | Note |
---|---|---|---|
minikube | Running registry in the cluster | Successful | Must ensure the connection between controller pod and registry pod first |
GKE | Workload identity disabled | Successful | GCE metadata server takes care of authentication |
GKE | Workload identity enabled | Failed | Only linked the target service account with GSA |
GKE | Workload identity enabled | Successful | Only linked the tekton pipelines controller service account with GSA |
It's definitely not a problem. It's a bit more work than when using public image but it does work. |
We will resolve entrypoint in the binary if `image` is specified. The steps list | ||
below: | ||
|
||
1. set up `kubeClient` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an immediate problem for us... in our Tekton environment we explicitly do not permit TaskRun pods to have access to the Kube API Service. e.g. accessing the Kube API is a "privilege" that we should not rely on.
994a811
to
aaa23a5
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi Yumeng, thanks for staying on top of this TEP. I think the next step would be some exploratory prototyping to make sure things work the way we expect them to with this design. Simon's comment indicates some users might not accept the new requirements this puts in place, and I'm not completely sure that things will work exactly like we expect even with that approach. I'd like to put this TEP on hold until we can see a working demo of this approach, to more fully understand the size of the change being proposed, and any new requirements it will inflict on end users and operators. |
This TEP proposes to move the image entrypoint lookup to entrypoint binary that runs in the TaskRun pod.
aaa23a5
to
9adfd63
Compare
Yes, I think I can figure out if we can fetch image using |
@yaoxiaoqi: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
@pritidesai to create an issue and close this PR so that we can track the requirements. |
@pritidesai: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This TEP proposes to move the image entrypoint lookup to entrypoint binary that
runs in the TaskRun pod.