Periodic e2e test for workload cluster upgrade from v1.20 to v1.21 are failing #4463

fabriziopandini · 2021-04-12T09:44:12Z

Looking at https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-20-1-21&width=20 those test are consistently failing

/kind failing-test
@srm09
@sbueringer

sbueringer · 2021-04-12T18:34:16Z

I think this might be the same underlying issue as in #4462

sbueringer · 2021-04-12T20:40:11Z

/assign

sbueringer · 2021-04-14T03:51:10Z

This test is still failing, even after the recent CoreDNS fix, links:

Summary

Issue is related to Periodic e2e test for workload cluster upgrade from v1.21 to latest are failing #4462 (it's also about the new CoreDNS image name), but it's not exactly the same
imho we have to adjust our CoreDNS update to handle the image name change for CoreDNS 1.8.0

Analysis

build.log:

Apr 13 17:55:14.366: INFO: At 2021-04-13 17:39:05 +0000 UTC - event for coredns-74ff55c5b-m685f: {kubelet k8s-upgrade-and-conformance-uq7ssp-md-0-768fcb9cdb-kjs62} Pulled: Container image "k8s.gcr.io/coredns:1.7.0" already present on machine
...
Apr 13 17:55:14.368: INFO: At 2021-04-13 17:45:42 +0000 UTC - event for coredns-79f99479b9-2xgn4: {kubelet k8s-upgrade-and-conformance-uq7ssp-md-0-65b45f5798-vbzjb} Failed: Failed to pull image "k8s.gcr.io/coredns:v1.8.0": rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/coredns:v1.8.0": failed to resolve reference "k8s.gcr.io/coredns:v1.8.0": k8s.gcr.io/coredns:v1.8.0: not found

in the test initially we install Kubernetes 1.20 with CoreDNS 1.7.0. This uses the image k8s.gcr.io/coredns:1.7.0 as kubeadm 1.20 uses the old image name
When we update from Kubernetes 1.20 and 1.21 we update the CoreDNS deployment to use 1.8.0 instead of 1.7.0. So now we try to use k8s.gcr.io/coredns:v1.8.0 which does not exist
This is not a problem for new 1.21 clusters as kubeadm 1.21 automatically uses the new image name, but the way we update the CoreDNS deployment is not aware of the new image name so it keeps the image name and only changes the tag, which does not work from 1.20 => 1.21

Some details:

available CoreDNS images under k8s.gcr.io:
- k8s.gcr.io/coredns <= 1.6.5 / 1.6.6 / 1.6.7 / 1.7.0 (gcloud container images list-tags k8s.gcr.io/coredns)
- k8s.gcr.io/coredns/coredns v1.6.6 / v1.6.7 / v1.6.9 / v1.7.0 / v1.7.1 / v1.8.0 / v1.8.3 (gcloud container images list-tags k8s.gcr.io/coredns/coredns)
kubeadm:
- 1.20: CoreDNSImageName: "coredns": https://github.com/kubernetes/kubernetes/blob/release-1.20/cmd/kubeadm/app/constants/constants.go#L332
- 1.21: CoreDNSImageName: "coredns/coredns" CoreDNSVersion: v1.8.0: https://github.com/kubernetes/kubernetes/blob/release-1.21/cmd/kubeadm/app/constants/constants.go#L330-L334
- Note: in the kubeadm structs repository and tag are configurable, but CoreDNSImageName is hard-coded. Image is: repository + CoreDNSImageName + tag
As far as I know this:
- does not affect new installations of clusters, kubeadm 1.20 and 1.21 use the respective image names and they work, as long as nobody tries to combine e.g. Kubernetes 1.20 with CoreDNS 1.8.0 (because k8s.gcr.io/coredns:1.8.0 does not exist
- does not affect updates like 1.18 => 1.19, 1.19 => 1.20 or 1.21 => 1.22 (k/k main)
- does not affect updates when the image tag is not changed
- it only affects the update from 1.20 => 1.21 and when we try to change the image tag. Although it doesn't matter if we first update from 1.20 to 1.21 and then update the CoreDNS tag or if it is done at the same time. The issue is that a cluster has been created with the old name (by kubeadm <=1.20) and as soon as we try to use an image tag later on which does not exist for k8s.gcr.io/coredns it will fail because our update handling does not automatically "fix" the image name
- it affects every cluster which has been created with kubeadm < 1.21 and needs the CoreDNS update deployment code at some point later on

Possible solution

We currently calculate the image like this: imageRepository + coreDNSKey + imageTag:

cluster-api/controlplane/kubeadm/internal/workload_cluster_coredns.go

Lines 159 to 165 in e8cb88e

    
           toImageRepository := fmt.Sprintf("%s/%s", parsedImage.Repository, parsedImage.Name) 
        
           if clusterConfig.ImageRepository != "" { 
        
           	toImageRepository = fmt.Sprintf("%s/%s", clusterConfig.ImageRepository, coreDNSKey) 
        
           } 
        
           if clusterConfig.DNS.ImageRepository != "" { 
        
           	toImageRepository = fmt.Sprintf("%s/%s", clusterConfig.DNS.ImageRepository, coreDNSKey) 
        
           }

We use the constant coreDNSKey = "coredns", which is simply not correct anymore starting with CoreDNS 1.8.0 (and kubeadm 1.21 does use coredns/coredns hard-coded)
I would suggest to detect the CoreDNS version >= 1.8.0 and then calculate the image like this: imageRepository + CoreDNSImageName + imageTag (`CoreDNSImageName="coredns/coredns")
I think this would automatically handle the image name change in the correct cases (I think it's unsafer to try to do this based on Kubernetes version)
Downsides:
- using a hard-coded "coredns/coredns" image name is consistent with kubeadm, but imho a major problem is that this only works with certain registries. <imageRepo>/coredns/coredns:<imageTag> only works good in registries which allow nesting. If a registry only allows <registry-url>/org/image:tag, this requires a coredns org. (afaik this affects Quay, but not Harbor)
- Alternative would be to add the additional coredns to the repository part of the image name but I think this is not an option as it would change the semantic of the ImageRepository field and our kubeadm types would not be in sync with upstream kubeadm anymore.

To be honest, not sure if there's a good/perfect solution for this problem. It's easy to fix our CoreDNS update to handle the change. I think there is no good way to work around the limitation of upstream kubeadm regarding a fixed coredns/coredns part in the CoreDNS image name downstream in ClusterAPI. So I would suggest fixing our handling of the CoreDNS update to be consistent with kubeadm. If the coredns/coredns part should become configurable, this has to be done upstream in kubeadm and then we can support it in ClusterAPI.

sbueringer · 2021-04-14T04:08:57Z

/cc @fabriziopandini @neolit123

fabriziopandini · 2021-04-14T09:19:43Z

I'm +1 to do what is possible now in CAPI, and to address the nested registry problem in kubeadm (@neolit123 this could be part of v1beta3 changes)

Also, this is candidate for backport, given that it could affect users on the v0.3.x series
@CecileRobertMichon @vincepri @detiber

fabriziopandini · 2021-04-14T10:58:25Z

/assign
/lifecycle active

sbueringer · 2021-04-15T19:59:00Z

/reopen

The CoreDNS issue looks fixed, but I think there is something else going wrong: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-20-1-21-main/1382743995154698240

k8s-ci-robot · 2021-04-15T19:59:06Z

@sbueringer: Reopened this issue.

In response to this:

/reopen

The CoreDNS issue looks fixed, but I think there is something else going wrong: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-20-1-21-main/1382743995154698240

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-04-21T20:08:36Z

@fabriziopandini I think we can close this issue again as the remaining issues seem to be not specific to the 1.20=>1.21 upgrade and tracked via: #4502

WDYT?

fabriziopandini · 2021-04-21T20:31:22Z

/close
If we detect new problems we can open an issue and eventually reference this thread

k8s-ci-robot · 2021-04-21T20:31:27Z

@fabriziopandini: Closing this issue.

In response to this:

/close
If we detect new problems we can open an issue and eventually reference this thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Apr 12, 2021

sbueringer mentioned this issue Apr 12, 2021

🌱 Upgrade to kind v0.11.0 #4469

Merged

k8s-ci-robot assigned sbueringer Apr 12, 2021

sbueringer mentioned this issue Apr 13, 2021

cluster-api: fix tag of coredns 1.8.0 image in upgrade tests kubernetes/test-infra#21778

Merged

k8s-ci-robot assigned fabriziopandini Apr 14, 2021

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Apr 14, 2021

fabriziopandini mentioned this issue Apr 14, 2021

🐛 Fix CoreDNS upgrade from v1.20 to v1.21 #4476

Merged

k8s-ci-robot closed this as completed in #4476 Apr 14, 2021

fabriziopandini mentioned this issue Apr 15, 2021

🐛 Fix CoreDNS upgrade from v1.20 to v1.21 #4481

Merged

k8s-ci-robot reopened this Apr 15, 2021

fabriziopandini mentioned this issue Apr 21, 2021

Change constants CoreDNSImageName kubernetes/kubernetes#101300

Closed

k8s-ci-robot closed this as completed Apr 21, 2021

sbueringer mentioned this issue Feb 11, 2022

Improve CoreDNS support and validation in KubeadmControlPlane #2599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodic e2e test for workload cluster upgrade from v1.20 to v1.21 are failing #4463

Periodic e2e test for workload cluster upgrade from v1.20 to v1.21 are failing #4463

fabriziopandini commented Apr 12, 2021

sbueringer commented Apr 12, 2021

sbueringer commented Apr 12, 2021

sbueringer commented Apr 14, 2021 •

edited

Loading

sbueringer commented Apr 14, 2021

fabriziopandini commented Apr 14, 2021

fabriziopandini commented Apr 14, 2021

sbueringer commented Apr 15, 2021

k8s-ci-robot commented Apr 15, 2021

sbueringer commented Apr 21, 2021

fabriziopandini commented Apr 21, 2021

k8s-ci-robot commented Apr 21, 2021

Periodic e2e test for workload cluster upgrade from v1.20 to v1.21 are failing #4463

Periodic e2e test for workload cluster upgrade from v1.20 to v1.21 are failing #4463

Comments

fabriziopandini commented Apr 12, 2021

sbueringer commented Apr 12, 2021

sbueringer commented Apr 12, 2021

sbueringer commented Apr 14, 2021 • edited Loading

Summary

Analysis

Some details:

Possible solution

sbueringer commented Apr 14, 2021

fabriziopandini commented Apr 14, 2021

fabriziopandini commented Apr 14, 2021

sbueringer commented Apr 15, 2021

k8s-ci-robot commented Apr 15, 2021

sbueringer commented Apr 21, 2021

fabriziopandini commented Apr 21, 2021

k8s-ci-robot commented Apr 21, 2021

sbueringer commented Apr 14, 2021 •

edited

Loading