Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

Closed
danports opened this issue Jul 2, 2024 · 7 comments · Fixed by #16647
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@danports
Copy link
Contributor

danports commented Jul 2, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

1.30.0-beta.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.29.6

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes && kops rolling-update cluster --yes

5. What happened after the commands executed?
After the update, dns-controller reports the following in its logs:

W0702 18:21:16.134048       1 dnscontroller.go:134] Unexpected error in DNS controller, will retry in 2m40s: error querying for zones: error querying for DNS zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region

The DNS records for the cluster are never updated after new control plane nodes are brought up during the rolling update and so eventually the rolling update fails:

I0702 18:17:33.236104    2974 instancegroups.go:553] Cluster did not validate within deadline: error listing nodes: Get "https://my.cluster.com/api/v1/nodes": dial tcp x.y.z.a:443: i/o timeout.
E0702 18:17:33.236525    2974 instancegroups.go:512] Cluster did not validate within 30m0s

6. What did you expect to happen?
dns-controller should have updated the DNS records and the rolling update should have completed successfully.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

It's a pretty vanilla AWS cluster, can provide if needed though.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

The rolling update isn't the problem here, it's dns-controller.

9. Anything else we need to know?

Manually editing the dns-controller deployment and adding the AWS_DEFAULT_REGION environment variable is sufficient to get dns-controller to start updating DNS records successfully again.

Slack thread for context: https://kubernetes.slack.com/archives/C3QUFP0QM/p1719945935453279

Relevant code is here:

func newRoute53() (*Interface, error) {

Based on the dns-controller error message, it seems like IMDS is not queried for the region, nor does the cfg.Region == "" check work.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2024
@danports danports changed the title dns-controller fails to update Route53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 Jul 2, 2024
@rifelpet
Copy link
Member

rifelpet commented Jul 3, 2024

@danports I have a potential fix in #16647, any chance you are able to test it?

If you can run the kops CLI in a linux amd64 environment you can follow these instructions:

export KOPS_BASE_URL="https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-calico/pull-07891b013effcacf773fa1dfde89b4ba4a347e1d/1.30.0-beta.2+v1.30.0-beta.1-14-g79a3620ee6"

wget -q "$KOPS_BASE_URL/linux/amd64/kops"
chmod +x ./kops

otherwise you'd need to build the kops CLI from source, setting the same KOPS_BASE_URL env var above.

@danports
Copy link
Contributor Author

danports commented Jul 3, 2024

Can I just override the image in the dns-controller deployment? If so, what image name should I use?

@rifelpet
Copy link
Member

rifelpet commented Jul 3, 2024

@danports I don't have an image registry you can use off-hand, but this tar.gz contains the dns-controller image to test:

https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-calico/pull-07891b013effcacf773fa1dfde89b4ba4a347e1d/1.30.0-beta.2+v1.30.0-beta.1-16-g56e23d81fb/images/dns-controller-amd64.tar.gz

you can docker load the tar.gz, then retag it and push it to a registry of your choice.

@danports
Copy link
Contributor Author

danports commented Jul 3, 2024

Interesting, I would have thought images were pushed to a registry for e2e testing. I will give that a try later when I have some time, or whenever the next beta goes live, whichever happens first. 🙂

@danports
Copy link
Contributor Author

This is unfortunately still broken in the final 1.30.0 release. The dns-controller logs are identical.

@rifelpet
Copy link
Member

It looks like we forgot to cherrypick this because milestones were set incorrectly. I've opened a cherrypick in #16757 and it will be included in 1.30.1 which I'm hoping we'll release in the next week or so.

@danports
Copy link
Contributor Author

Confirmed this issue is resolved in 1.30.1. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants