Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get credentials from AWS ECR when needed #174

Merged
merged 6 commits into from
Oct 14, 2021
Merged

Get credentials from AWS ECR when needed #174

merged 6 commits into from
Oct 14, 2021

Conversation

squaremo
Copy link
Member

@squaremo squaremo commented Sep 30, 2021

A rebase of #147 (to avoid force pushing over a main branch).

As discussed in #139

If the flag --user-aws-ecr is set, the controller will log in to AWS
Elastic Container Registry by fetching a temporary token from AWS
(when a given image is hosted in ECR).

TODOs carried over (in a comment, and based on Flux v1's implementation):

  • Caching of tokens (one per account/region pair), this fetches a fresh token every time
    • handling of expiry

The quota for getting an auth token is 500/second/region: https://docs.aws.amazon.com/general/latest/gr/ecr.html. Since image scanning tends to be done on periods of O(minutes), you would need more than 500 * 60 images to scan (ImageRepository objects) before it was not viable to get an auth token for each scan. You might have that many if you run automation in lots of clusters in the same region, under the same account. But in my estimation, not bothering with caching is the 80/20 (80% of the value with 20% of the complexity, in this case) solution.

Expiry only needs to be accounted for if tokens are cached.

  • Back-Off in case of errors

Yes, retrying after failures is an easy way to burn through your API quota (see above). As it happens though, if an image is detected as being from ECR, and the flag for ECR auto-login was set, then failing to get a token will error out of .Reconcile(), and the controller-runtime machinery will back off.

  • Possibly: special behaviour for non-global partitions (China, GovCloud)

I'm not aware of any special treatment needed for these -- perhaps if you wanted to scan an image in ECR for some non-global partition, from another partition. But let's cross that bridge when we get to it.

@squaremo
Copy link
Member Author

I would love to see tests for this, but mocking AWS services is no small task. Perhaps something like https://github.com/localstack/localstack could help.

main.go Outdated Show resolved Hide resolved
controllers/imagerepository_controller.go Outdated Show resolved Hide resolved
@stefanprodan
Copy link
Member

@squaremo does this work with IRSA or with node IAM bindings or with both?

@squaremo
Copy link
Member Author

does this work with IRSA or with node IAM bindings or with both?

From the explanation in #139 (comment) I would expect it works with the node IAM bindings that EKS does automatically, as well as IRSA (IAM roles for service accounts) and possibly other mechanisms, since "one can grant additional IAM permissions via a multitude of mechanics (IAM node role, IAM service account, or even configuring access keys manually via environment variables), and all official AWS SDKs will automatically try each until it gets a valid identity". @MartinEmrich Are you able to confirm that this is the correct expectation?

@MartinEmrich
Copy link
Contributor

@squaremo @stefanprodan

It works with the IAM instance profile that each EKS node has to have.
I guess AWS puts the ecr credentials helper (https://github.com/awslabs/amazon-ecr-credential-helper) in their provided AMIs, or they run a patched kubelet which regularly gets an ECR login token.
See also: https://docs.aws.amazon.com/AmazonECR/latest/userguide/ECR_on_EKS.html
The authenticated entity here is the Node/Kubelet itself, not the user workload.

Container image pulling should have no relationship to IAM service accounts/IRSA (after all, service accounts are assigned to pods, which run an already pulled image).

IRSA (or, putting access keys somewhere in your workloads, which is always bad and unneccesary), operates within your workloads (Pods), so are on a completely different level.

(Disclaimer: I a just a dude with a google, I have no hidden insights in how AWS does stuff internally.)

@squaremo
Copy link
Member Author

Regarding the name of the flag: I don't really have a strong opinion on use vs enable vs with; consistency can guide us there. But I would like it to be precise and descriptive, so I would go for --use-aws-iam-for-ecr or --aws-use-iam-for-ecr (I'm undecided on how I would want the flags to sort, whether by aws or by enable or use; but again, consistency should be the guide).

@MartinEmrich
Copy link
Contributor

As for the flag, I would like to urge you just one final time to make it enabled by the default, to follow the Principle Of Least Surprise. Again, this feature has no security implications, drawbacks or unexpected behaviour. (IMHO, even a flag to disable it makes no sense, as it does not protect anything from anything)

@squaremo
Copy link
Member Author

squaremo commented Oct 5, 2021

This is my understanding:

  • Flux v1 uses ecrClient.GetAuthorizationToken. This will work if the pod has permissions to read ECR in the account given.
    • A controller pod in EKS will generally have read permissions to ECR, granted through the node IAM role (mysteriously, the mechanism whereby pods get the node permissions is not mentioned there, but it is implied elsewhere)
    • Another possible mechanism for granting permissions in EKS is using IAM Roles for Service Accounts (IRSA) to give the controller service account the appropriate IAM role; it's possible to block the node permissions being granted to pods, in which case this would become necessary.
    • If you create an EKS cluster with eksctl or by following the getting started guide then the cluster nodes are given permission to read ECR
  • Flux v1 will only use this facility if you specifically allow it to, or if it detects (via the metadata service) that it's running in EC2 (which has its own problems).
  • At present, Flux v2 does not try to get authentication credentials for ECR.

This PR unconditionally asks ECR for credentials (given the enable flag, or if it were always enabled). That has the following effect:

Controller environment Effect
EKS By default, all ImageRepository objects can read ECR for the account under which EKS cluster is run
AWS in general All ImageRepository objects can read ECR according to assigned IAM permissions
Outside EC2 For ECR images, ecr.GetAuthorizationToken will fail and the ImageRepository object will be marked unready (whether or not another secret was supplied)

So I think there are two reasons to guard this feature with a flag:

  • if you are not running in AWS, but you are using ECR for some reason, you don't want it to fail unconditionally (this can be fixed in the code by only trying ecr.GetAuthorizationToken when there's no secret referenced)
  • Usually, every pod can read everything in (the cluster account in) ECR, and having ImageRepository objects work differently would be strange; however, it's possible to remove that global permission for pods, so it ought to be possible to remove it for ImageRepository objects.

@stefanprodan
Copy link
Member

stefanprodan commented Oct 5, 2021

if you are not running in AWS, but you are using ECR for some reason, you don't want it to fail unconditionally (this can be fixed in the code by only trying ecr.GetAuthorizationToken when there's no secret referenced)

I think this is the way to go, especially since many orgs don't use the same account for ECR and EKS. We can't assume that credentials present on nodes are for the registry that hosts the app images.

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example to the API docs and explain how people can enable auto login of ECR?

@squaremo
Copy link
Member Author

squaremo commented Oct 7, 2021

Can we add an example to the API docs and explain how people can enable auto login of ECR?

I was wondering about that. It's not part of the API, but there's no "running the controller" docs kept in the individual repo. I think it would go somewhere in the docs website.

@squaremo
Copy link
Member Author

squaremo commented Oct 7, 2021

Can we add an example to the API docs and explain how people can enable auto login of ECR?

I was wondering about that. It's not part of the API, but there's no "running the controller" docs kept in the individual repo. I think it would go somewhere in the docs website.

Having said that, it can be mentioned in the API guide somewhere around the SecretRef explanation.

MartinEmrich and others added 6 commits October 14, 2021 10:00
As discussed in #139

If the flag `--user-aws-ecr` is set, the controller will log in to AWS
Elastic Container Registry by fetching a temporary token from AWS
(when a given image is hosted in ECR).

Signed-off-by: Martin Emrich <martin@martinemrich.me>
 - rename to parseAwsImage, since it's not a URL
 - make return value include a bool rather than an error, since it's
   not an error for the image to not be ECR
 - return values explicitly rather than assigning them, since that can
   be prone to shadowing mistakes

Signed-off-by: Michael Bridgen <michael@weave.works>
Checking for ECR whether or not there's a secret referenced means the
reconciliation can fail despite there being a valid credential. If you
are not running in EKS (or otherwise with AWS IAM permissions), but
using ECR, you will still want to be able to supply a secret.

Signed-off-by: Michael Bridgen <michael@weave.works>
This is my attempt at a descriptive flag for this feature. It mentions
AWS and ECR, and suggests automatically getting credentials. It starts
with `aws` so it will sort alongside any other AWS-specific flags that
come along later.

Signed-off-by: Michael Bridgen <michael@weave.works>
The TODO items for scanning ECR were based on Flux v1's ECR-specific
code for image scanning; but it turns out they are not necessary for a
viable implementation. I have removed the TODOs, and given an
explanation why it's fit for purpose as it is.

Signed-off-by: Michael Bridgen <michael@weave.works>
Signed-off-by: Michael Bridgen <michael@weave.works>
main.go Show resolved Hide resolved
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @squaremo

PS. Maybe we should create issues for Azure and GCP auto login and point to this PR as an example for future contributors.

This was referenced Oct 14, 2021
@squaremo
Copy link
Member Author

Maybe we should create issues for Azure and GCP auto login and point to this PR as an example for future contributors.

Good thinking:

There's also #11 which is something like an umbrella issue for these.

@jtnz
Copy link

jtnz commented Oct 21, 2021

Hi @squaremo, I just tried this in Production today and it works. Thanks! Just one thing to note, I believe that you only have to log into a registry (account + region combination) once. This implementation appears to be doing if for every repository within the same registry, which will bulk out the logs and hit the API quite a lot when we start scanning our 100s of repos every minute or so. Perhaps some naive approach of working out all registries in the loop that will need logging into then just doing them once (in our use case that's 1 login vs. n logins for n repos)?

@squaremo
Copy link
Member Author

squaremo commented Oct 21, 2021

I just tried this in Production today and it works.

🎉 Thank you for guinea-pigging this feature!

I believe that you only have to log into a registry (account + region combination) once. This implementation appears to be doing if for every repository within the same registry, which will bulk out the logs and hit the API quite a lot when we start scanning our 100s of repos every minute or so.

Yes, it's the simplest possible implementation to start with. I did a quick calculation to check that it would be well within the quota for requests for most users (you can check my assumption here). But, I do understand being nervous about quotas, and there is certainly room to reduce the number of requests required. Flux v1 caches the tokens it gets, and since this is a flag that applies to all image scans, the implementation here could do the same.

kingdonb pushed a commit to lloydchang/image-reflector-controller that referenced this pull request Oct 31, 2021
kingdonb pushed a commit to kingdonb/image-reflector-controller that referenced this pull request Oct 31, 2021
kingdonb pushed a commit to kingdonb/image-reflector-controller that referenced this pull request Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants