dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

danports · 2024-07-02T21:02:48Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.
1.30.0-beta.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.29.6

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes && kops rolling-update cluster --yes

5. What happened after the commands executed?
After the update, dns-controller reports the following in its logs:

W0702 18:21:16.134048       1 dnscontroller.go:134] Unexpected error in DNS controller, will retry in 2m40s: error querying for zones: error querying for DNS zones: error listing hosted zones: operation error Route 53: ListHostedZones, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region

The DNS records for the cluster are never updated after new control plane nodes are brought up during the rolling update and so eventually the rolling update fails:

I0702 18:17:33.236104    2974 instancegroups.go:553] Cluster did not validate within deadline: error listing nodes: Get "https://my.cluster.com/api/v1/nodes": dial tcp x.y.z.a:443: i/o timeout.
E0702 18:17:33.236525    2974 instancegroups.go:512] Cluster did not validate within 30m0s

6. What did you expect to happen?
dns-controller should have updated the DNS records and the rolling update should have completed successfully.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

It's a pretty vanilla AWS cluster, can provide if needed though.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

The rolling update isn't the problem here, it's dns-controller.

9. Anything else we need to know?

Manually editing the dns-controller deployment and adding the AWS_DEFAULT_REGION environment variable is sufficient to get dns-controller to start updating DNS records successfully again.

Slack thread for context: https://kubernetes.slack.com/archives/C3QUFP0QM/p1719945935453279

Relevant code is here:

kops/dnsprovider/pkg/dnsprovider/providers/aws/route53/route53.go

Line 47 in c40a9d2

func newRoute53() (*Interface, error) {

Based on the dns-controller error message, it seems like IMDS is not queried for the region, nor does the cfg.Region == "" check work.

The text was updated successfully, but these errors were encountered:

rifelpet · 2024-07-03T03:14:36Z

@danports I have a potential fix in #16647, any chance you are able to test it?

If you can run the kops CLI in a linux amd64 environment you can follow these instructions:

export KOPS_BASE_URL="https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-calico/pull-07891b013effcacf773fa1dfde89b4ba4a347e1d/1.30.0-beta.2+v1.30.0-beta.1-14-g79a3620ee6"

wget -q "$KOPS_BASE_URL/linux/amd64/kops"
chmod +x ./kops

otherwise you'd need to build the kops CLI from source, setting the same KOPS_BASE_URL env var above.

danports · 2024-07-03T03:21:24Z

Can I just override the image in the dns-controller deployment? If so, what image name should I use?

rifelpet · 2024-07-03T16:19:40Z

@danports I don't have an image registry you can use off-hand, but this tar.gz contains the dns-controller image to test:

https://storage.googleapis.com/k8s-staging-kops/pulls/pull-kops-e2e-k8s-aws-calico/pull-07891b013effcacf773fa1dfde89b4ba4a347e1d/1.30.0-beta.2+v1.30.0-beta.1-16-g56e23d81fb/images/dns-controller-amd64.tar.gz

you can docker load the tar.gz, then retag it and push it to a registry of your choice.

danports · 2024-07-03T20:53:54Z

Interesting, I would have thought images were pushed to a registry for e2e testing. I will give that a try later when I have some time, or whenever the next beta goes live, whichever happens first. 🙂

danports · 2024-08-17T02:22:15Z

This is unfortunately still broken in the final 1.30.0 release. The dns-controller logs are identical.

rifelpet · 2024-08-17T02:37:14Z

It looks like we forgot to cherrypick this because milestones were set incorrectly. I've opened a cherrypick in #16757 and it will be included in 1.30.1 which I'm hoping we'll release in the next week or so.

danports · 2024-09-13T21:05:40Z

Confirmed this issue is resolved in 1.30.1. Thanks for the fix!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2024

danports changed the title ~~dns-controller fails to update Route53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1~~ dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 Jul 2, 2024

rifelpet mentioned this issue Jul 3, 2024

Set the STS client's region via IMDS for AssumeRoleWithWebIdentity #16647

Merged

k8s-ci-robot closed this as completed in #16647 Jul 3, 2024

sridhar81 mentioned this issue Aug 20, 2024

After upgrade from Kubernetes 1.29.2 to 1.30, dns-controller fails #16761

Closed

rifelpet mentioned this issue Aug 25, 2024

dns: use resolved region rather than re-resolving every time #16778

Merged

vitaliyf mentioned this issue Sep 20, 2024

AWS sts:AssumeRole stopped working with role/OrganizationAccountAccessRole in 1.30.x #16849

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

danports commented Jul 2, 2024 •

edited

Loading

rifelpet commented Jul 3, 2024

danports commented Jul 3, 2024

rifelpet commented Jul 3, 2024

danports commented Jul 3, 2024

danports commented Aug 17, 2024

rifelpet commented Aug 17, 2024

danports commented Sep 13, 2024

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

dns-controller fails to update Route 53 zones after upgrading kOps from 1.29.0 to 1.30.0-beta.1 #16645

Comments

danports commented Jul 2, 2024 • edited Loading

rifelpet commented Jul 3, 2024

danports commented Jul 3, 2024

rifelpet commented Jul 3, 2024

danports commented Jul 3, 2024

danports commented Aug 17, 2024

rifelpet commented Aug 17, 2024

danports commented Sep 13, 2024

danports commented Jul 2, 2024 •

edited

Loading