Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic e2e test for workload cluster upgrade from v1.20 to v1.21 are failing #4463

Closed
fabriziopandini opened this issue Apr 12, 2021 · 11 comments · Fixed by #4476
Closed
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.

Comments

@fabriziopandini
Copy link
Member

Looking at https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-20-1-21&width=20 those test are consistently failing

/kind failing-test
@srm09
@sbueringer

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Apr 12, 2021
@sbueringer
Copy link
Member

I think this might be the same underlying issue as in #4462

@sbueringer
Copy link
Member

/assign

@sbueringer
Copy link
Member

sbueringer commented Apr 14, 2021

This test is still failing, even after the recent CoreDNS fix, links:

Summary

Analysis

  • build.log:
    Apr 13 17:55:14.366: INFO: At 2021-04-13 17:39:05 +0000 UTC - event for coredns-74ff55c5b-m685f: {kubelet k8s-upgrade-and-conformance-uq7ssp-md-0-768fcb9cdb-kjs62} Pulled: Container image "k8s.gcr.io/coredns:1.7.0" already present on machine
    ...
    Apr 13 17:55:14.368: INFO: At 2021-04-13 17:45:42 +0000 UTC - event for coredns-79f99479b9-2xgn4: {kubelet k8s-upgrade-and-conformance-uq7ssp-md-0-65b45f5798-vbzjb} Failed: Failed to pull image "k8s.gcr.io/coredns:v1.8.0": rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/coredns:v1.8.0": failed to resolve reference "k8s.gcr.io/coredns:v1.8.0": k8s.gcr.io/coredns:v1.8.0: not found
  • in the test initially we install Kubernetes 1.20 with CoreDNS 1.7.0. This uses the image k8s.gcr.io/coredns:1.7.0 as kubeadm 1.20 uses the old image name
  • When we update from Kubernetes 1.20 and 1.21 we update the CoreDNS deployment to use 1.8.0 instead of 1.7.0. So now we try to use k8s.gcr.io/coredns:v1.8.0 which does not exist
  • This is not a problem for new 1.21 clusters as kubeadm 1.21 automatically uses the new image name, but the way we update the CoreDNS deployment is not aware of the new image name so it keeps the image name and only changes the tag, which does not work from 1.20 => 1.21

Some details:

  • available CoreDNS images under k8s.gcr.io:

    • k8s.gcr.io/coredns <= 1.6.5 / 1.6.6 / 1.6.7 / 1.7.0 (gcloud container images list-tags k8s.gcr.io/coredns)
    • k8s.gcr.io/coredns/coredns v1.6.6 / v1.6.7 / v1.6.9 / v1.7.0 / v1.7.1 / v1.8.0 / v1.8.3 (gcloud container images list-tags k8s.gcr.io/coredns/coredns)
  • kubeadm:

  • As far as I know this:

    • does not affect new installations of clusters, kubeadm 1.20 and 1.21 use the respective image names and they work, as long as nobody tries to combine e.g. Kubernetes 1.20 with CoreDNS 1.8.0 (because k8s.gcr.io/coredns:1.8.0 does not exist
    • does not affect updates like 1.18 => 1.19, 1.19 => 1.20 or 1.21 => 1.22 (k/k main)
    • does not affect updates when the image tag is not changed
    • it only affects the update from 1.20 => 1.21 and when we try to change the image tag. Although it doesn't matter if we first update from 1.20 to 1.21 and then update the CoreDNS tag or if it is done at the same time. The issue is that a cluster has been created with the old name (by kubeadm <=1.20) and as soon as we try to use an image tag later on which does not exist for k8s.gcr.io/coredns it will fail because our update handling does not automatically "fix" the image name
    • it affects every cluster which has been created with kubeadm < 1.21 and needs the CoreDNS update deployment code at some point later on

Possible solution

  • We currently calculate the image like this: imageRepository + coreDNSKey + imageTag:

    toImageRepository := fmt.Sprintf("%s/%s", parsedImage.Repository, parsedImage.Name)
    if clusterConfig.ImageRepository != "" {
    toImageRepository = fmt.Sprintf("%s/%s", clusterConfig.ImageRepository, coreDNSKey)
    }
    if clusterConfig.DNS.ImageRepository != "" {
    toImageRepository = fmt.Sprintf("%s/%s", clusterConfig.DNS.ImageRepository, coreDNSKey)
    }

  • We use the constant coreDNSKey = "coredns", which is simply not correct anymore starting with CoreDNS 1.8.0 (and kubeadm 1.21 does use coredns/coredns hard-coded)

  • I would suggest to detect the CoreDNS version >= 1.8.0 and then calculate the image like this: imageRepository + CoreDNSImageName + imageTag (`CoreDNSImageName="coredns/coredns")

  • I think this would automatically handle the image name change in the correct cases (I think it's unsafer to try to do this based on Kubernetes version)

  • Downsides:

    • using a hard-coded "coredns/coredns" image name is consistent with kubeadm, but imho a major problem is that this only works with certain registries. <imageRepo>/coredns/coredns:<imageTag> only works good in registries which allow nesting. If a registry only allows <registry-url>/org/image:tag, this requires a coredns org. (afaik this affects Quay, but not Harbor)
    • Alternative would be to add the additional coredns to the repository part of the image name but I think this is not an option as it would change the semantic of the ImageRepository field and our kubeadm types would not be in sync with upstream kubeadm anymore.

To be honest, not sure if there's a good/perfect solution for this problem. It's easy to fix our CoreDNS update to handle the change. I think there is no good way to work around the limitation of upstream kubeadm regarding a fixed coredns/coredns part in the CoreDNS image name downstream in ClusterAPI. So I would suggest fixing our handling of the CoreDNS update to be consistent with kubeadm. If the coredns/coredns part should become configurable, this has to be done upstream in kubeadm and then we can support it in ClusterAPI.

@sbueringer
Copy link
Member

/cc @fabriziopandini @neolit123

@fabriziopandini
Copy link
Member Author

I'm +1 to do what is possible now in CAPI, and to address the nested registry problem in kubeadm (@neolit123 this could be part of v1beta3 changes)

Also, this is candidate for backport, given that it could affect users on the v0.3.x series
@CecileRobertMichon @vincepri @detiber

@fabriziopandini
Copy link
Member Author

/assign
/lifecycle active

@sbueringer
Copy link
Member

/reopen

The CoreDNS issue looks fixed, but I think there is something else going wrong: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-20-1-21-main/1382743995154698240

@k8s-ci-robot k8s-ci-robot reopened this Apr 15, 2021
@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen

The CoreDNS issue looks fixed, but I think there is something else going wrong: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-workload-upgrade-1-20-1-21-main/1382743995154698240

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

@fabriziopandini I think we can close this issue again as the remaining issues seem to be not specific to the 1.20=>1.21 upgrade and tracked via: #4502

WDYT?

@fabriziopandini
Copy link
Member Author

/close
If we detect new problems we can open an issue and eventually reference this thread

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
If we detect new problems we can open an issue and eventually reference this thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants