Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0041: Move Image Entrypoint Lookup to the TaskRun Pod #310

Closed

Conversation

yaoxiaoqi
Copy link
Member

This TEP proposes to move the image entrypoint lookup to entrypoint binary that
runs in the TaskRun pod.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 21, 2021
@vdemeester vdemeester added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Jan 21, 2021
@bobcatfish
Copy link
Contributor

/assign @imjasonh
/assign @bobcatfish

@NavidZ
Copy link
Member

NavidZ commented Jan 27, 2021

This proposal looks good to me. I don't mind if you are planning to land this for now. and send detailed design later. LGTM.
@bobcatfish @imjasonh @vdemeester any comments?

@yaoxiaoqi yaoxiaoqi force-pushed the move-image-entrypoint-lookup branch from ae557c7 to 994a811 Compare January 27, 2021 17:41
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2021
@imjasonh
Copy link
Member

The proposal lgtm, I think the next step would be some prototyping to demonstrate everything actually works the way it's described. It'd be great to not require users to give read access to their images to the one all-powerful Tekton Service Account.

Comment on lines 35 to 37
Currently, Tekton controller pod has access to all service accounts and will do
entry point lookup using `go-containerregistry` for the images that don't have
any command specified for them. This potentially might cause permission denied
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not immediately relevant for this proposal, but FWIW I think some of this behavior is buggy. I think this should fail for any spec that doesn't reference an image digest and doesn't specify the command - part of what the entrypoint looking wants to do is guarantee here is a consistent image for all steps in a pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also raises an interesting question of whether or not we want/need similar image consistency at the PipelineRun level 🤔

I don't believe we try and do this today, but theoretically images could modify between pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlynch indeed. I don't think we currently are using the image by-digest in the pod we create, which means images could be different between pods (in some cases). knative does this (using the image by-digest in the resulting pod) but we don't.

-image '<image_path>' -taskrun_namespace 'default' -taskrun_service_account 'default'
```

### Resolve Image in EntryPoint Binary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of what the image metadata lookup does is resolve images to their SHA digests, which seems useful for auditing. We should figure out how we can retain this information / bubble it back up to the controller.

Comment on lines +133 to +140
lookup logic, because we resolve the entrypoint in different images. Some
techniques like inter-pod communication or permanently storage must be involved
to do this. Anyway, with or without cache, the cost of time will definitely be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you go into more detail on how the image resolution cache would function when running on the pod (particularly how data would be shared between entrypoint invocations?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To implement cache for steps in the same pod, I plan to serialize the image and store it in the shared volume. One image will be serialized into one file. The filename is the image reference (whether the tag or the digest). When looking up the image entrypoint, we use the image reference to check if the file exists. if it does, grab it from the shared volume. Otherwise, fetch it from the remote registry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cache beyond the same pod, I could only think up a way that we need to bubble the whole image back to the controller and request the controller in every steps' container. But this way involves network communication between different pods. I'm not sure whether things would get faster if we introduced this. There is a big chance that consulting the remote registry directly costs less time compared to this kind of cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to reuse the cache part of knative (aka have an implementation of it) — I am fuzzy on the subject but felt it was worth mentionning.

### Use Cases

TaskRun users encounter permission denied error if they don't specify command in
their container specs on some cloud providers. The error happens when TaskRun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To play devil's advocate here - poking at the code it looks like go-containerregistry only looks at imagePullSecrets before defaulting to anonymous fetching (this is where any provider specific identity tied to the SA would come into play).

Would we be able to provide an authn.Keychain that would replicate the behavior we want? (and what extra permissions would be needed if any?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yaoxiaoqi ! Before we get too deep into the solution, I want to explore the problem a bit more.

The problem statement makes it sound like it's not possible for the controller to reference private images, but it should be something the controller supports and I'm wondering if this is actually a documentation problem.

That being said, totally into making this easier if needed and addressing any problems with it, but I'd really like to understand what the specific problem is (and what the alternatives are)

/hold

any command specified for them. This potentially might cause permission denied
error when users are trying to pull private images from registry since the
default service account and the service account that Tekton controller is using
can't authenticate the requests on some of cloud platforms. The private image is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this could possibly be a documentation problem - it should be possible to configure the controller to pull from a private registry

It looks like the docs on how to do this got lost over here: https://github.com/tektoncd/pipeline/blob/master/docs/container-contract.md#container-contract (and probably should be over in https://github.com/tektoncd/pipeline/blob/master/docs/install.md instead 😬 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pulling images from a private registry is not a problem. Things go wrong when you're trying to pull from a private registry and do not specify the command for the image in the Steps. The Tekton controller needs to look up the image entrypoint to know what command to execute. But it can't fetch the image and encounters permission denied error. It happens for any spec that doesn't reference an image digest and doesn't specify the command. I will dig into this bug and update the tep with more details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as far as I remember, when doing the lookup, the pipeline controller does use/refer to the ServiceAccount the PipelineRun is ran with (I also thought it wolud look at imagePullSecret but I ain't sure though…). But yeah, the problem is mainly if the image is private and none of the SA provided (to the controller or the pipelinerun/taskrun) have the correct credentials to authenticate with the registry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… looking at @wlynch 's comment, it might be the opposite..
Any way, I also wonder if it's more a documentation problem than anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pulling images from a private registry is not a problem. Things go wrong when you're trying to pull from a private registry and do not specify the command for the image in the Steps.

I should have been more specific @yaoxiaoqi , echoing what @vdemeester says and looking at the docs at https://github.com/tektoncd/pipeline/blob/master/docs/container-contract.md#container-contract I think that we DO support the scenario you are describing today:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we DO support the scenario you are describing today

This is true, but I think there's still an issue here because we're inconsistent about how we handle OIDC based auth. The example @yaoxiaoqi gave highlights this - if we bypass the image lookup by passing in a Command, the pod will authenticate successfully with the container registry without ImagePullSecret because it's falling back to OIDC when performing the image fetch on the TaskRun pod using the TaskRuns service account. While the controller will also fallback to OIDC, it's using a different identity (the Pipeline controller service account) which isn't really expected by users.

This is likely the root cause of issues like tektoncd/pipeline#2316. As OIDC becomes more common in Kubernetes (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-issuer-discovery) this will likely become more of a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the problem that this TEP is trying to address, is it possible to update the TEP with a detailed description and example? I'm not familiar enough with OIDC or the flow you've described to be able to follow without more details (and the TEP as currently stated isn't as specific as what you're describing)

to query the entrypoint of the image that is running on this pod. This will also
avoid the authentication problem when requests come from different pods on some
cloud providers and the confusing retricted access to the image when command is not
specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on this authentication problem a bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't put it clearly in the TEP. I updated the section after I did some experiments on entrypoint lookup and reproduced the issue tektoncd/pipeline#2316 using GKE cluster. When Workload Identity enabled, user has to use tekton pipelines controller service account to impersonate the corresponding GSA since the request comes from controller pod. But users usually don't know they should do this and only bind the service account for the target pod to GSA. Billy gave a better explanation here https://wlyn.ch/posts/gcp-tekton-workload-identity/
the experiments I did lists below:

Cluster Config Image fetching Status Note
minikube Running registry in the cluster Successful Must ensure the connection between controller pod and registry pod first
GKE Workload identity disabled Successful GCE metadata server takes care of authentication
GKE Workload identity enabled Failed Only linked the target service account with GSA
GKE Workload identity enabled Successful Only linked the tekton pipelines controller service account with GSA

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2021
Base automatically changed from master to main February 3, 2021 16:34
@vdemeester
Copy link
Member

The problem statement makes it sound like it's not possible for the controller to reference private images, but it should be something the controller supports and I'm wondering if this is actually a documentation problem.

It's definitely not a problem. It's a bit more work than when using public image but it does work.

We will resolve entrypoint in the binary if `image` is specified. The steps list
below:

1. set up `kubeClient`
Copy link
Contributor

@skaegi skaegi Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an immediate problem for us... in our Tekton environment we explicitly do not permit TaskRun pods to have access to the Kube API Service. e.g. accessing the Kube API is a "privilege" that we should not rely on.

@yaoxiaoqi yaoxiaoqi force-pushed the move-image-entrypoint-lookup branch from 994a811 to aaa23a5 Compare March 11, 2021 09:31
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign bobcatfish
You can assign the PR to them by writing /assign @bobcatfish in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2021
@imjasonh
Copy link
Member

Hi Yumeng, thanks for staying on top of this TEP. I think the next step would be some exploratory prototyping to make sure things work the way we expect them to with this design.

Simon's comment indicates some users might not accept the new requirements this puts in place, and I'm not completely sure that things will work exactly like we expect even with that approach.

I'd like to put this TEP on hold until we can see a working demo of this approach, to more fully understand the size of the change being proposed, and any new requirements it will inflict on end users and operators.

This TEP proposes to move the image entrypoint lookup to entrypoint binary that
runs in the TaskRun pod.
@yaoxiaoqi yaoxiaoqi force-pushed the move-image-entrypoint-lookup branch from aaa23a5 to 9adfd63 Compare March 11, 2021 14:18
@yaoxiaoqi
Copy link
Member Author

Simon's comment indicates some users might not accept the new requirements this puts in place, and I'm not completely sure that things will work exactly like we expect even with that approach.

Yes, I think I can figure out if we can fetch image using go-containerregistry in the TaskRun pod first.

@tekton-robot
Copy link
Contributor

@yaoxiaoqi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2021
@pritidesai
Copy link
Member

@pritidesai to create an issue and close this PR so that we can track the requirements.

@pritidesai
Copy link
Member

pritidesai commented Jun 21, 2021

Here is the feature request for this in the pipeline repo.

As per the discussion in the API WG, I am closing this PR. We need to first showcase this in the experimental repo to understand the impact of the proposal/changes before changing the API.

/close

@tekton-robot
Copy link
Contributor

@pritidesai: Closed this PR.

In response to this:

Here is the feature request for this in the pipeline repo.

As per the discussion in the API WG, I am closing the TEP PR. We need to first showcase this in the experimental repo to understand the impact of the proposal/changes before changing the API.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Hold
Development

Successfully merging this pull request may close these issues.

9 participants