Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Error in logs with Flux for docker images repositories from AWS Public ECR endpoint #3492

Closed
allamand opened this issue Jun 17, 2021 · 28 comments

Comments

@allamand
Copy link

Describe the bug

Errors are poping in the flux controller complaining for images in public ECR registry:

To Reproduce

Steps to reproduce the behaviour:

  1. Provide Flux install instructions
  2. Provide a GitHub repository with Kubernetes manifests

Expected behavior

No error il the flux logs

Logs

│ flux ts=2021-06-17T12:39:22.559110358Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes/pause auth={map[]} err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page  │
│ not found\\n\""                                                                                                                                                                                                                                                           │
│ flux ts=2021-06-17T12:39:24.167081125Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-observability/aws-sigv4-proxy auth={map[]} err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 │
│  page not found\\n\""

Note: I tried also to provide flux with the ECR Registry policy but still have the problem

eksctl create iamserviceaccount --cluster ${CLUSTER_NAME} \
    --namespace flux \
    --name flux \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly  \
    --override-existing-serviceaccounts \
    --approve

Additional context

  • Container registry provider: AWS ECR Public
@allamand allamand added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Jun 17, 2021
@pierluigilenoci
Copy link

@allamand we also have the same problem.
@kingdonb @yebyen @stefanprodan @dholbach could you please take a look?

@pierluigilenoci
Copy link

FYI

ts=2021-07-20T17:16:48.670094734Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-07-20T17:17:34.176691936Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-07-20T17:17:34.540198093Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-07-20T17:18:10.36478641Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-config>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

@kingdonb
Copy link
Member

I don't have access to an ECR registry, but this might be related to #3015 or #3124

The issue appears to be that your registry does not allow listing of tags. I am almost certain I have seen this issue before...

AWS provides this documentation about listing tags from ECR:
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/ecr/list-tags-for-resource.html

That's an AWS API endpoint, not a Registry v2 API endpoint, for reference. I do not know if ECR supports publicly listing tags but it seems like if it did, that would be a special feature that might have to be separately enabled and permitted.

This leads me to believe that the normal Docker Registry v2 API method for listing tags cannot be used by Flux. This seems to agree with the 404 response; if AWS does not publish a tags/index endpoint in their ECR Registry API, then this will not work with Flux. The docs do mention support for ECR in some places though, so I am not sure if this was ever supported.

Is this a new problem, (something that worked before, but stopped working?) or related to some changes in your cluster, behavior that wasn't tested before and might not have ever worked?

If it's not supported, the best you can do might be to add an exclusion to avoid scanning the public ECR endpoints. Please let me know if this information help at all. There is documentation for Flux v2 to support ECR, but I don't know if it covers public ECR or if it can cover that (as you may be aware, Flux v2 also requires listing images in order to promote image updates via ImagePolicy and Automation.)

@pierluigilenoci
Copy link

@kingdonb in our specific case it does not refer to a private registry but the problem occurs with the AWS public registry public.ecr.aws.

@kingdonb
Copy link
Member

kingdonb commented Jul 20, 2021

@pierluigilenoci I understand that, the issue is that ECR does not appear to respond to the Docker Registry API's method for listing existing tags as an index. AWS provides their own API instead. This will perhaps still be true for public registries.

Docker registry clients that you can use as a stand-alone tool are a bit of a dark art, but I may have one handy that I can use to confirm this expected Registry v2 endpoint is or is not supported. It must provide access to list tags in the standard way, else it will likely not be usable with Image Automation in any (current or historical) version of Flux.

@kingdonb
Copy link
Member

kingdonb commented Jul 21, 2021

OK, for example, here is a conversation with the Docker Hub registry that implements the full Registry API v2:

2.7.3 :014 > r2 = DockerRegistry2.connect('https://registry.hub.docker.com/v2')
 => #<DockerRegistry2::Registry:0x00007fde28490670 @uri=#<URI::HTTPS https://registry.hub.docker.com/v2>, @base_uri="https://registry.hub.docker.com:443", @user=nil, @password=nil, @ht...
2.7.3 :015 > t2 = r2.tags('kingdonb/flux')
 => {"name"=>"kingdonb/flux", "tags"=>["1.14.2", "master-64bef7a3", "master-c86a5b31", "omnibus-branch-6f9deeda", "rationalize-use-messages-41fb1412", "rationalize-use-messages-c0a480f...

A list of tags comes back from the tag index endpoint, this is the expected reply from a conforming Docker Registry v2 (or, possibly this is a feature that was not added until version 2.1, I may have seen when looking for more info about this?)

I'm using a Ruby Docker Registry client, https://github.com/deitch/docker_registry2 – the Docker Hub registry API endpoint v2 is a well-known URL that is usually baked into docker clients somehow, which can be used for debugging, in this case just to show what a normal conversation in Registry v2 looks like according to Flux.

I can pull the manifest as an alternative way of determining if I've connected correctly to the public/unauthenticated registry endpoint (this occasionally fails with 429 Too Many Requests, but after a few moments patience and trying again, I get an affirmative reply back like this, with the manifest of the image belonging to the requested tag):

> r2.manifest('kingdonb/flux', '1.14.2')
 => {"schemaVersion"=>2, "mediaType"=>"application/vnd.docker.distribution.manifest.v2+json", "config"=>{"mediaType"=>"application/vnd.docker.container.image.v1+json", "size"=>8035, "d...

Here's what that conversation looks like when I try the same thing with the eks-d public registry that hosts for example public.ecr.aws/eks-distro/kubernetes/pause, I have looked up a valid tag name in some documentation so I could try having the same conversation, to confirm that I've reached the correct endpoint by pulling the manifest... that works.

The tag list function returns the same 404 error that you were reporting though, @pierluigilenoci

2.7.3 :022 > r = DockerRegistry2.connect('https://public.ecr.aws/')
 => #<DockerRegistry2::Registry:0x00007fde24cd9dd0 @uri=#<URI::HTTPS https://public.ecr.aws/>, @base_uri="https://public.ecr.aws:443", @user=nil, @password=nil, @http_options={:open_ti...

2.7.3 :023 > t = r.manifest('eks-distro/kubernetes/pause', 'v1.18.9-eks-1-18-1')
 => {"mediaType"=>"application/vnd.oci.image.index.v1+json", "schemaVersion"=>2, "manifests"=>[{"mediaType"=>"application/vnd.oci.image.manifest.v1+json", "digest"=>"sha256:2237620baa2...

2.7.3 :024 > r.tags('eks-distro/kubernetes/pause')
Traceback (most recent call last):
       16: from /Users/kingdonb/.rvm/rubies/ruby-2.7.3/lib/ruby/2.7.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
...
        1: from /Users/kingdonb/.rvm/gems/ruby-2.7.3/gems/docker_registry2-1.10.0/lib/registry/registry.rb:327:in `rescue in do_bearer_req'
DockerRegistry2::NotFound (404 Not Found)

This seems to indicate the public.ecr.aws registry is not a fully v2 conformant registry. Clients that want to support it will need to implement the AWS API, (and it's not clear to me how I can know whether tag listing is even supported for public image repos. I guess I would need to use an AWS account to access the public registry data, this is very counter-intuitive.)

@pierluigilenoci
Copy link

Thank you @kingdonb,
I opened a support request with AWS to understanding how to fix the problem.

@kingdonb kingdonb added blocked and removed blocked-needs-validation Issue is waiting to be validated before we can proceed labels Jul 22, 2021
@pierluigilenoci
Copy link

@kingdonb AWS support confirmed the problem.

Ref: aws/containers-roadmap#1262

@kingdonb
Copy link
Member

Thanks for making the connection to AWS, aws/containers-roadmap#1262 (comment) (yikes!)

@hspencer77
Copy link

hspencer77 commented Aug 1, 2021

Not sure if this was attempted, but I was able to get Flux to work with public ECR images by making sure the IAM role associated to the flux deployment had the appropriate IAM access policy permissions - (policy ARN arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly). I was able to use the image from aws-otel-collector successful on an EKS cluster.

@kingdonb
Copy link
Member

kingdonb commented Aug 1, 2021

That's interesting, so if your cluster is not on AWS and you are using images published via public.ecr.aws I guess that sort of implies you'll need to be using an AWS account and providing token credentials for the authenticator, (not exactly public!)

@hspencer77 do you happen to know if this method uses the AWS API registry index method, or if it works through the docker registry API? My assumption is since it uses an IAM role, it must be the AWS API.

@hspencer77
Copy link

@kingdonb , based upon the documentation, it looks like it relies on having an authentication token. For example, when I didn't pass an authentication token, I received the following error:

$ curl -ihttps://public.ecr.aws/v2/aws-observability/aws-sigv4-proxy/manifests/latest
HTTP/2 401
date: Sun, 01 Aug 2021 22:54:54 GMT
content-type: application/json; charset=utf-8
content-length: 58
docker-distribution-api-version: registry/2.0
www-authenticate: Bearer realm="https://public.ecr.aws/token/",service="public.ecr.aws",scope="aws"

{"errors":[{"code":"DENIED","message":"Not Authorized"}]}

When I passed an authentication token, I was able to list the image tag manifests:

$ curl -i -H "Authorization: Bearer $TOKEN" https://public.ecr.aws/v2/aws-observability/aws-sigv4-proxy/manifests/latest
HTTP/2 200
date: Sun, 01 Aug 2021 22:54:39 GMT
content-type: application/vnd.docker.distribution.manifest.v2+json
content-length: 737
docker-distribution-api-version: registry/2.0

{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
      "mediaType": "application/vnd.docker.container.image.v1+json",
      "size": 910,
      "digest": "sha256:d715c3ce0bfe87ff99e8992868b2205d47e4da0dceb14ca922cbee2bc9243316"
   },
   "layers": [
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 122912,
         "digest": "sha256:c30f89a20efc79f274fd67906263abd59c1f7f90074b819fe6f3eb0cb7a365b2"
      },
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 6487633,
         "digest": "sha256:31d49d395b79017c89705728cdb858dea63f896bdcf4e39c4512c2d8073ce4e9"
      }
   ]
}

Another thing I noticed in the documentat is that there wasn't any example of docker images. Pretty much, it only shows pull and push examples. The documentation about supported image manifests may help as well.

@hspencer77
Copy link

@kingdonb as a follow-up, even though I was able to pull the image successfully and have it running on my EKS cluster, I still see this error any time I do a fluxctl sync:

ts=2021-08-03T01:43:28.459192512Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-observability/aws-otel-collector auth={map[]} err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

The image I am using is this:

Image:      public.ecr.aws/aws-observability/aws-otel-collector:latest

I have also confirmed that even though the error shows up in the logs, the image is successfully pulled down (in another deployment where specific version 2.18.0 is defined):

Events:
  Type     Reason           Age   From               Message
  ----     ------           ----  ----               -------
  Warning  LoggingDisabled  92s   fargate-scheduler  Disabled logging because aws-logging configmap was not found. configmap "aws-logging" not found
  Normal   Scheduled        34s   fargate-scheduler  Successfully assigned ads-dev/ads-processors-6c9f678444-5nhr9 to fargate-ip-10-27-6-165.us-west-2.compute.internal
  Normal   Pulling          33s   kubelet            Pulling image "realz/logrotate:0.1"
  Normal   Pulled           30s   kubelet            Successfully pulled image "realz/logrotate:0.1"
  Normal   Created          30s   kubelet            Created container logrotate
  Normal   Started          30s   kubelet            Started container logrotate
  Normal   Pulling          30s   kubelet            Pulling image "public.ecr.aws/aws-observability/aws-for-fluent-bit:2.18.0"
  Normal   Pulled           20s   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:2.18.0"
  Normal   Created          19s   kubelet            Created container aws-for-fluent-bit
  Normal   Started          18s   kubelet            Started container aws-for-fluent-bit

Here is the error in the logs:

...
ts=2021-08-03T06:19:28.143642096Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-observability/aws-for-fluent-bit auth={map[]} err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-03T06:19:30.967071929Z caller=images.go:159 component=sync-loop err="fetching image metadata for public.ecr.aws/aws-observability/aws-for-fluent-bit: item not in cache, last error: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
...

@pierluigilenoci
Copy link

A little update on the situation on my side (EKS cluster).

I tried, as suggested, to add the arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly to the Service Role for EC2 Node Group and I updated flux to the latest version (1.23.2) but the error messages keep appearing.

ts=2021-08-13T08:11:25.986658906Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/[REDACTED] branch=[REDACTED] HEAD=[REDACTED]
ts=2021-08-13T08:12:39.053016064Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:13:18.695749073Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:13:19.058279705Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:13:28.145340636Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:14:17.151279983Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:16.102895506Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:22.972276987Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:23.344056067Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-13T08:15:54.544792709Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

@kingdonb
Copy link
Member

Hi @pierluigilenoci

I have kingdonb/flux at 0294896 which contains all of the changes I planned to release in 1.24.0.

I'm pushing that image out for testing as kingdonb/flux:master-0294896a in just a few minutes.

If you'd like to test this, there are a number of fixes and updates there which we haven't discussed individually, but I actually don't think this is going to solve your issue by itself. I think there is a configuration change needed on your end, and I'm not prepared to help with that from here. You might also need to add the AWS region to the Flux config through helm chart values, so that it bypasses the AWS metadata API for those things and just uses the role. (This is the issue from #3015)

I am a little bit lost when it comes to AWS stuff as I don't have an AWS account that I am testing against. I am sorry I cannot be more helpful with this.

If this is very important to you, we have paid support options that we can follow, and I'll be happy to contact you in private about them – but this is the limit of what I can do within the bounds of community support. There are people on my team with much more AWS knowledge than I have, but their availability is usually subject to a paid support contract.

Please see:

https://fluxcd.io/support/#i-am-stuck
and
https://fluxcd.io/support/#my-employer-needs-additional-help

If you are already engaging our paid support then I can certainly try to help escalate this.

@kingdonb
Copy link
Member

Hello all

Flux 1.24.0 has been released. Please re-check this issue, if you have been waiting for the release.

I am not sure if all responders on this issue have the same issue, or if at this point the original poster has resolved their issue and we are spamming them. (If so, we can close it out and let anyone who has a separate issue report it again. You're welcome to refer back to this issue, but I need a complete and original report to follow up, so we can be sure we are not creating spam clouds around issues that don't pertain to the original reporter's issue.)

@pierluigilenoci
Copy link

Hi @kingdonb,
I tried today version 1.24.0

kubectl get pods flux-b6c85484f-pbwb9 -o json | jq ".spec.containers | .[0].image"                                                                small_fix*
"docker.io/fluxcd/flux:1.24.0"

I get the same result:

ts=2021-08-26T16:50:49.399762465Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

@kingdonb
Copy link
Member

@pierluigilenoci from my previous message, highlight:

I think there is a configuration change needed on your end

Please make sure your configuration matches the config described here: #3492 (comment)

Flux can speak the AWS API, as long as an IAM role is assigned and okToUseAWS is true. This means that either Flux must be able to reach the AWS Metadata API (so it can determine the region to use), or the AWS region must be manually set via Helm values configuration. Please make sure you have set these values as appropriate:

flux/chart/flux/values.yaml

Lines 191 to 196 in c1267f5

# AWS ECR settings
ecr:
region:
includeId:
excludeId:
require: false

If you are on the latest version, and your IAM role is permitted to pull from public registries with the arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly permission, it should work as it has been confirmed by @hspencer77 above. Please be aware though, many (standard) EKS configs will block access from pods to the AWS metadata API, so you must set the region or Flux will not be able to use ECR, as okToUseAWS will not be set to true.

This impassable condition was addressed by #3124, so until that was merged into the 1.23 release series, it would in many cases not have been possible for this to work. It is a little bit counter-intuitive that you must set the AWS region to use ECR but I think it is unavoidable. (I am not sure how to document this well or to prevent this from coming up again, except to recommend that people should move on to Flux v2 as soon as possible, as any documentation call-out in the Flux v1 docs seems likely to be overlooked/unlikely to be noticed, but I will happily accept a PR for the docs in fluxcd/website if someone can show how the docs could be improved to better support this workflow.)

Please keep us in the loop if this still isn't working, or if that information resolves it. There should be no technical reason for blocking this from working, but there are certainly a few things which you might trip over that may not be possible to resolve.

@pierluigilenoci
Copy link

pierluigilenoci commented Aug 31, 2021

All EKS nodes in my cluster has a AWS::IAM::Role with these permissions:

arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
arn:aws:iam::aws:policy/AmazonElasticContainerRegistryPublicReadOnly
arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

and Flux 1.24 is configured in this way:

registry:
  ecr:
    region: eu-west-1

Logs:

ts=2021-08-31T10:35:07.256293914Z caller=aws.go:125 component=aws info="using regions from local config"
ts=2021-08-31T10:35:07.256649046Z caller=aws.go:117 component=aws info="restricting ECR registry scans" regions=[eu-west-1] include-ids=[] exclude-ids="[602401143452 918309763551]"
ts=2021-08-31T10:35:07.499084486Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.24.0
ts=2021-08-31T10:33:54.29279459Z caller=sync.go:542 method=Sync cmd=apply args= count=15
ts=2021-08-31T10:33:55.117563567Z caller=sync.go:608 method=Sync cmd="kubectl apply -f -" took=824.690172ms err=null output="helmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/flux unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged\nhelmrelease.helm.fluxcd.io/[REDACTED] unchanged"
ts=2021-08-31T10:33:55.227492868Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/[REDACTED]/[REDACTED] branch=[REDACTED] HEAD=[REDACTED]
ts=2021-08-31T10:33:55.22753494Z caller=images.go:17 component=sync-loop msg="polling for new images for automated workloads"
ts=2021-08-31T10:34:33.064357264Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/eks-distro/kubernetes-csi/external-provisioner auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""
ts=2021-08-31T10:34:35.102430814Z caller=warming.go:180 component=warmer canonical_name=public.ecr.aws/aws-secrets-manager/secrets-store-csi-driver-provider-aws auth="{map[index.docker.io:<registry creds for [REDACTED]@index.docker.io, from /dockercfg/docker-credentials>]}" err="requesting tags: error parsing HTTP 404 response body: invalid character 'p' after top-level value: \"404 page not found\\n\""

So?

@kingdonb
Copy link
Member

kingdonb commented Aug 31, 2021

@pierluigilenoci I'm sorry I didn't read this thread more carefully, @hspencer77 had those same errors in their logs as well.

It might take some time to get around to reproducing this on my own. I don't have an AWS account or sponsored resources of any kind on AWS cloud. If someone from AWS was to step in and work on this, it would be more likely to get solved sooner.

Is there full consensus that the image warmer does not successfully reach public.ecr.aws? I thought someone said that it worked, but now it isn't clear that it does. If it was something wrong in an existing feature then surely we should fix it, but I honestly don't know why AWS does this differently or if they have plans to remediate this and make it work more like any other standard Docker public registry. If so, then we would be better off waiting, rather than investing more into an EOL integration which is questionable whether or not it should be in the scope of Flux's maintenance mode. (Did this ever work before?)

If it requires special support from Flux to use with their image registry, it will need to have existing feature support inside of the codebase, else it would be a new feature and therefore definitely would fall outside of the scope of maintenance mode.

I know little about Flux's AWS integration. It is clear that it is meant to work with ECR, but it is not clear that the public.ecr.aws endpoint is actually treated by Flux as an ECR endpoint that requires IAM authentication.

Perhaps there is someone from the EKS-D team or elsewhere in AWS support who can pop in and give an opinion on how we should proceed. (Do you know anyone @pierluigilenoci ? I think you mentioned that you have an AWS support contract, so if you have someone on the team who already has context around this problem, it will be easier to get the conversation started.)

@kingdonb
Copy link
Member

kingdonb commented Aug 31, 2021

It's also not clear to me that simply giving the role to the nodes can solve any problem. You need the Flux daemon pod to have the role. EKS isolates the pods and prevents them from reaching the AWS Metadata API to prevent them from grabbing arbitrary roles that have been assigned to the nodes for some higher function of administration that ought not be granted to every pod on the cluster. We discussed assigning the role to the node group, but did you place whatever annotation on the Flux daemon pod/service account that gives the role assignment to the actual Flux pod?

Some basic research indicates knowledge of this strategy (IRSA) might be important:
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html

I guess it is the Flux pod's service account that must be annotated to give it access to the desired/specified IAM role. Again I am not skilled or trained for AWS but this is just what I found, I've heard of IRSA before and understood this information about the blocked metadata API from working on #3124.

Sorry again for the trouble around this. Hope some of this information helps, please do keep us informed so we can close this issue with a positive resolution if it is possible!

@hspencer77
Copy link

@kingdonb , not sure if you saw this, but I think you could get an AWS account for free (given the purpose of the flux project) - https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/. Figured this would be a good way to help figure this issue out (and any others in the future).

@kingdonb
Copy link
Member

kingdonb commented Aug 31, 2021

Thanks @hspencer77, that's helpful, and I will consider it for the future!

I think AWS has been forthcoming with cloud resource credits in the past for us, I haven't been with the Flux project for that long, but what we really need is someone who is already in a position to reproduce this issue to spend the time on solving it, and add any documentation for Flux support of this feature if it is necessary.

I can build an EKS cluster some time and try it out, but I will be learning a lot of AWS stuff practically from scratch; there may be time for that at some point, but unfortunately with the level of investment required to get myself started solving this issue it would be pretty counter-intuitive to me in my position and role. This especially, even doubly so, given that we don't know the state of this capability in Flux v2, and any work we do here to solve it on the Flux v1 side would potentially need to be repeated over there, else Flux v1 will potentially have features that Flux v2 doesn't.

I would absolutely like nothing more than to guarantee that AWS public ECR registries are usable with all versions of Flux! The easiest way for me to do that, would be to convince AWS that they should abide by the standard and not require AWS role assignments granting permission and special AWS API clients for pulling image index from what are billed as public (standards-compliant) Docker image repos.

We are very friendly with AWS as it should be clear from our collaboration on eksctl and so I very much don't want to say negative things, especially in public settings when I haven't personally had an opportunity to bring this feedback to them privately, so I will have to phrase this very carefully... and/but I am not a customer and as this does not fall in the scope of any support contract that I have under my purview, there isn't really a forum where I can raise this issue myself so...

https://aws.amazon.com/ecr/faqs/

To be perfectly clear, AWS has several statements on ^this FAQ page about ECR's compatibility and support. It is said to be compatible with the Docker Registry v2 and OCI formats. This appears to be a clear divergence from the Registry v2 standard, at least with respect to image indexes, and so it seems to be the case that rather than writing AWS-specific things into Flux... it will benefit many more projects and teams, if instead ECR team can be convinced to support the unabridged standard.

It would be great to have a statement from AWS on whether they will always respond to Docker Registry tag index for any public ECR repo with 404 page not found, or if they plan on supporting this part of the Registry v2 standard at some point in the future. Their FAQ isn't clear about it, (it isn't even clear to me that anyone else would have asked AWS for this before.)

@pierluigilenoci
Copy link

pierluigilenoci commented Sep 1, 2021

@kingdonb I am sorry but I am unable to provide further information and currently, I do not have time to investigate this in a profitable way. 😞

In my opinion, the heart of the matter is how the ECR responds to requests and how they implemented the protocol. I believe that for ECR public images it is not necessary to be authenticated.
Ref: aws/containers-roadmap#1262

@yebyen
Copy link
Contributor

yebyen commented Sep 1, 2021

@pierluigilenoci we are agreed

I was able to reach someone from AWS who is looking into this and we should have a more helpful answer soon.

In the mean time I have been reminded that we offer Flux 2 migration workshops that are free, for those who are struggling with parts of the migration from legacy Flux v1.

l would like to pass on this questionnaire to anyone who is interested:

https://bit.ly/FluxMigrationSurvey

@pierluigilenoci
Copy link

@yebyen thank you very much, I filled out the questionnaire.

The problem is that we use helm to install Flux and Flux 2 doesn't have a chart available.
Ref: fluxcd/flux2#431 (comment)

@pierluigilenoci
Copy link

For those interested there has been the Flux2 chart for some time
https://github.com/fluxcd-community/helm-charts/tree/main/charts/flux2

@pjbgf pjbgf added the wontfix label Jul 26, 2022
@pjbgf
Copy link
Member

pjbgf commented Jul 26, 2022

This project is in Migration and security support only, so unfortunately this issue won't be fixed. We recommend users to migrate to Flux 2 at their earliest convenience.

More information about the Flux 2 transition timetable can be found at: https://fluxcd.io/docs/migration/timetable/.

@pjbgf pjbgf closed this as completed Jul 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants