TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310

yaoxiaoqi · 2021-01-21T14:10:53Z

This TEP proposes to move the image entrypoint lookup to entrypoint binary that
runs in the TaskRun pod.

bobcatfish · 2021-01-25T17:22:41Z

NavidZ · 2021-01-27T14:21:32Z

This proposal looks good to me. I don't mind if you are planning to land this for now. and send detailed design later. LGTM.
@bobcatfish @imjasonh @vdemeester any comments?

imjasonh · 2021-01-28T14:43:16Z

The proposal lgtm, I think the next step would be some prototyping to demonstrate everything actually works the way it's described. It'd be great to not require users to give read access to their images to the one all-powerful Tekton Service Account.

wlynch · 2021-01-28T17:11:13Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+Currently, Tekton controller pod has access to all service accounts and will do
+entry point lookup using `go-containerregistry` for the images that don't have
+any command specified for them. This potentially might cause permission denied


Not immediately relevant for this proposal, but FWIW I think some of this behavior is buggy. I think this should fail for any spec that doesn't reference an image digest and doesn't specify the command - part of what the entrypoint looking wants to do is guarantee here is a consistent image for all steps in a pod.

This also raises an interesting question of whether or not we want/need similar image consistency at the PipelineRun level 🤔

I don't believe we try and do this today, but theoretically images could modify between pods.

@wlynch indeed. I don't think we currently are using the image by-digest in the pod we create, which means images could be different between pods (in some cases). knative does this (using the image by-digest in the resulting pod) but we don't.

wlynch · 2021-01-28T17:23:54Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+-image '<image_path>' -taskrun_namespace 'default' -taskrun_service_account 'default'
+```
+
+### Resolve Image in EntryPoint Binary


Part of what the image metadata lookup does is resolve images to their SHA digests, which seems useful for auditing. We should figure out how we can retain this information / bubble it back up to the controller.

wlynch · 2021-01-28T17:31:04Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+lookup logic, because we resolve the entrypoint in different images. Some
+techniques like inter-pod communication or permanently storage must be involved
+to do this. Anyway, with or without cache, the cost of time will definitely be


Could you go into more detail on how the image resolution cache would function when running on the pod (particularly how data would be shared between entrypoint invocations?)

To implement cache for steps in the same pod, I plan to serialize the image and store it in the shared volume. One image will be serialized into one file. The filename is the image reference (whether the tag or the digest). When looking up the image entrypoint, we use the image reference to check if the file exists. if it does, grab it from the shared volume. Otherwise, fetch it from the remote registry.

For the cache beyond the same pod, I could only think up a way that we need to bubble the whole image back to the controller and request the controller in every steps' container. But this way involves network communication between different pods. I'm not sure whether things would get faster if we introduced this. There is a big chance that consulting the remote registry directly costs less time compared to this kind of cache.

Any way to reuse the cache part of knative (aka have an implementation of it) — I am fuzzy on the subject but felt it was worth mentionning.

wlynch · 2021-01-28T17:53:38Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+### Use Cases
+
+TaskRun users encounter permission denied error if they don't specify command in
+their container specs on some cloud providers. The error happens when TaskRun


To play devil's advocate here - poking at the code it looks like go-containerregistry only looks at imagePullSecrets before defaulting to anonymous fetching (this is where any provider specific identity tied to the SA would come into play).

Would we be able to provide an authn.Keychain that would replicate the behavior we want? (and what extra permissions would be needed if any?)

cc @mattmoor

bobcatfish

Hey @yaoxiaoqi ! Before we get too deep into the solution, I want to explore the problem a bit more.

The problem statement makes it sound like it's not possible for the controller to reference private images, but it should be something the controller supports and I'm wondering if this is actually a documentation problem.

That being said, totally into making this easier if needed and addressing any problems with it, but I'd really like to understand what the specific problem is (and what the alternatives are)

/hold

bobcatfish · 2021-01-29T20:35:53Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+any command specified for them. This potentially might cause permission denied
+error when users are trying to pull private images from registry since the
+default service account and the service account that Tekton controller is using
+can't authenticate the requests on some of cloud platforms. The private image is


I'm wondering if this could possibly be a documentation problem - it should be possible to configure the controller to pull from a private registry

It looks like the docs on how to do this got lost over here: https://github.com/tektoncd/pipeline/blob/master/docs/container-contract.md#container-contract (and probably should be over in https://github.com/tektoncd/pipeline/blob/master/docs/install.md instead 😬 )

Yes, pulling images from a private registry is not a problem. Things go wrong when you're trying to pull from a private registry and do not specify the command for the image in the Steps. The Tekton controller needs to look up the image entrypoint to know what command to execute. But it can't fetch the image and encounters permission denied error. It happens for any spec that doesn't reference an image digest and doesn't specify the command. I will dig into this bug and update the tep with more details.

So, as far as I remember, when doing the lookup, the pipeline controller does use/refer to the ServiceAccount the PipelineRun is ran with (I also thought it wolud look at imagePullSecret but I ain't sure though…). But yeah, the problem is mainly if the image is private and none of the SA provided (to the controller or the pipelinerun/taskrun) have the correct credentials to authenticate with the registry.

… looking at @wlynch 's comment, it might be the opposite..
Any way, I also wonder if it's more a documentation problem than anything else.

Yes, pulling images from a private registry is not a problem. Things go wrong when you're trying to pull from a private registry and do not specify the command for the image in the Steps.

I should have been more specific @yaoxiaoqi , echoing what @vdemeester says and looking at the docs at https://github.com/tektoncd/pipeline/blob/master/docs/container-contract.md#container-contract I think that we DO support the scenario you are describing today:

I think that we DO support the scenario you are describing today

This is true, but I think there's still an issue here because we're inconsistent about how we handle OIDC based auth. The example @yaoxiaoqi gave highlights this - if we bypass the image lookup by passing in a Command, the pod will authenticate successfully with the container registry without ImagePullSecret because it's falling back to OIDC when performing the image fetch on the TaskRun pod using the TaskRuns service account. While the controller will also fallback to OIDC, it's using a different identity (the Pipeline controller service account) which isn't really expected by users.

This is likely the root cause of issues like tektoncd/pipeline#2316. As OIDC becomes more common in Kubernetes (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-issuer-discovery) this will likely become more of a problem.

If that is the problem that this TEP is trying to address, is it possible to update the TEP with a detailed description and example? I'm not familiar enough with OIDC or the flow you've described to be able to follow without more details (and the TEP as currently stated isn't as specific as what you're describing)

bobcatfish · 2021-01-29T20:36:15Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+to query the entrypoint of the image that is running on this pod. This will also
+avoid the authentication problem when requests come from different pods on some
+cloud providers and the confusing retricted access to the image when command is not
+specified.


can you elaborate on this authentication problem a bit?

Sorry I didn't put it clearly in the TEP. I updated the section after I did some experiments on entrypoint lookup and reproduced the issue tektoncd/pipeline#2316 using GKE cluster. When Workload Identity enabled, user has to use tekton pipelines controller service account to impersonate the corresponding GSA since the request comes from controller pod. But users usually don't know they should do this and only bind the service account for the target pod to GSA. Billy gave a better explanation here https://wlyn.ch/posts/gcp-tekton-workload-identity/
the experiments I did lists below:

Cluster Config Image fetching Status Note

minikube Running registry in the cluster Successful Must ensure the connection between controller pod and registry pod first

GKE Workload identity disabled Successful GCE metadata server takes care of authentication

GKE Workload identity enabled Failed Only linked the target service account with GSA

GKE Workload identity enabled Successful Only linked the tekton pipelines controller service account with GSA

vdemeester · 2021-02-03T18:26:16Z

The problem statement makes it sound like it's not possible for the controller to reference private images, but it should be something the controller supports and I'm wondering if this is actually a documentation problem.

It's definitely not a problem. It's a bit more work than when using public image but it does work.

skaegi · 2021-03-01T18:44:58Z

teps/0041-move-image-entrypoint-lookup-to-the-taskrun-pod.md

+We will resolve entrypoint in the binary if `image` is specified. The steps list
+below:
+
+1. set up `kubeClient`


This is an immediate problem for us... in our Tekton environment we explicitly do not permit TaskRun pods to have access to the Kube API Service. e.g. accessing the Kube API is a "privilege" that we should not rely on.

tekton-robot · 2021-03-11T09:31:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign bobcatfish
You can assign the PR to them by writing /assign @bobcatfish in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

teps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

imjasonh · 2021-03-11T14:05:47Z

Hi Yumeng, thanks for staying on top of this TEP. I think the next step would be some exploratory prototyping to make sure things work the way we expect them to with this design.

Simon's comment indicates some users might not accept the new requirements this puts in place, and I'm not completely sure that things will work exactly like we expect even with that approach.

I'd like to put this TEP on hold until we can see a working demo of this approach, to more fully understand the size of the change being proposed, and any new requirements it will inflict on end users and operators.

This TEP proposes to move the image entrypoint lookup to entrypoint binary that runs in the TaskRun pod.

yaoxiaoqi · 2021-03-11T15:17:59Z

Simon's comment indicates some users might not accept the new requirements this puts in place, and I'm not completely sure that things will work exactly like we expect even with that approach.

Yes, I think I can figure out if we can fetch image using go-containerregistry in the TaskRun pod first.

tekton-robot · 2021-03-17T13:56:36Z

@yaoxiaoqi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2021-06-15T14:51:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

pritidesai · 2021-06-21T16:12:34Z

@pritidesai to create an issue and close this PR so that we can track the requirements.

pritidesai · 2021-06-21T21:14:26Z

Here is the feature request for this in the pipeline repo.

As per the discussion in the API WG, I am closing this PR. We need to first showcase this in the experimental repo to understand the impact of the proposal/changes before changing the API.

/close

tekton-robot · 2021-06-21T21:14:28Z

@pritidesai: Closed this PR.

In response to this:

Here is the feature request for this in the pipeline repo.

As per the discussion in the API WG, I am closing the TEP PR. We need to first showcase this in the experimental repo to understand the impact of the proposal/changes before changing the API.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot requested review from nikhil-thomas and sthaha January 21, 2021 14:10

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 21, 2021

vdemeester added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Jan 21, 2021

tekton-robot assigned bobcatfish and imjasonh Jan 25, 2021

yaoxiaoqi force-pushed the move-image-entrypoint-lookup branch from ae557c7 to 994a811 Compare January 27, 2021 17:41

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2021

wlynch reviewed Jan 28, 2021

View reviewed changes

bobcatfish reviewed Jan 29, 2021

View reviewed changes

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2021

Base automatically changed from master to main February 3, 2021 16:34

skaegi reviewed Mar 1, 2021

View reviewed changes

yaoxiaoqi force-pushed the move-image-entrypoint-lookup branch from 994a811 to aaa23a5 Compare March 11, 2021 09:31

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2021

TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod

9adfd63

This TEP proposes to move the image entrypoint lookup to entrypoint binary that runs in the TaskRun pod.

yaoxiaoqi force-pushed the move-image-entrypoint-lookup branch from aaa23a5 to 9adfd63 Compare March 11, 2021 14:18

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2021

vdemeester mentioned this pull request May 10, 2021

Moving image entrypoint lookup to the TaskRun pod rather than Tekton Controller tektoncd/pipeline#3626

Closed

afrittoli mentioned this pull request May 18, 2021

Multi architecture support for catalog resources tektoncd/catalog#661

Closed

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021

tekton-robot closed this Jun 21, 2021

imjasonh mentioned this pull request Oct 12, 2021

Tekton entrypoint resolution can break signature verification. tektoncd/pipeline#4299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310

TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310

yaoxiaoqi commented Jan 21, 2021

bobcatfish commented Jan 25, 2021

NavidZ commented Jan 27, 2021

imjasonh commented Jan 28, 2021

wlynch Jan 28, 2021

wlynch Jan 28, 2021

vdemeester Feb 3, 2021

wlynch Jan 28, 2021

wlynch Jan 28, 2021

yaoxiaoqi Feb 3, 2021

yaoxiaoqi Feb 3, 2021

vdemeester Feb 3, 2021

wlynch Jan 28, 2021

vdemeester Feb 3, 2021

bobcatfish left a comment

bobcatfish Jan 29, 2021

yaoxiaoqi Feb 3, 2021

vdemeester Feb 3, 2021

vdemeester Feb 3, 2021

bobcatfish Feb 4, 2021

wlynch Feb 4, 2021

bobcatfish Feb 4, 2021

bobcatfish Jan 29, 2021

yaoxiaoqi Mar 11, 2021

vdemeester commented Feb 3, 2021

skaegi Mar 1, 2021 •

edited

Loading

tekton-robot commented Mar 11, 2021

imjasonh commented Mar 11, 2021

yaoxiaoqi commented Mar 11, 2021

tekton-robot commented Mar 17, 2021

tekton-robot commented Jun 15, 2021

pritidesai commented Jun 21, 2021

pritidesai commented Jun 21, 2021 •

edited

Loading

tekton-robot commented Jun 21, 2021

Cluster	Config	Image fetching Status	Note
minikube	Running registry in the cluster	Successful	Must ensure the connection between controller pod and registry pod first
GKE	Workload identity disabled	Successful	GCE metadata server takes care of authentication
GKE	Workload identity enabled	Failed	Only linked the target service account with GSA
GKE	Workload identity enabled	Successful	Only linked the tekton pipelines controller service account with GSA

TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310

TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310

Conversation

yaoxiaoqi commented Jan 21, 2021

bobcatfish commented Jan 25, 2021

NavidZ commented Jan 27, 2021

imjasonh commented Jan 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester commented Feb 3, 2021

skaegi Mar 1, 2021 • edited Loading

Choose a reason for hiding this comment

tekton-robot commented Mar 11, 2021

imjasonh commented Mar 11, 2021

yaoxiaoqi commented Mar 11, 2021

tekton-robot commented Mar 17, 2021

tekton-robot commented Jun 15, 2021

pritidesai commented Jun 21, 2021

pritidesai commented Jun 21, 2021 • edited Loading

tekton-robot commented Jun 21, 2021

skaegi Mar 1, 2021 •

edited

Loading

pritidesai commented Jun 21, 2021 •

edited

Loading