Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

Closed
bcorijn opened this issue Aug 2, 2017 · 19 comments
Closed

DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

bcorijn opened this issue Aug 2, 2017 · 19 comments
Milestone

Comments

@bcorijn
Copy link
Contributor

bcorijn commented Aug 2, 2017

Hi,

I tried rolling my cluster from 1.6.4 to 1.7.2 with kops 1.7.0 an hour ago or so, starting with 2 out of my 3 masters. I noticed that after doing this my kubectl commands were becoming very slow, something that in the past had been a symptom of the api.* records in Route53 being incorrect. Checking them indeed showed that the new IP's had not been recorded.
Next suspect was a faulty DNS-Controller, but I saw it had been updated to v1.7.1 and had started correctly. The logs however do show something interesting: it is trying to replace all my record sets with new IP's. This is already a first bug I believe, since the only change were my masters there should be no changes to my records, as those only include nodes?
Besides that, it seems that the amount of nodes times the amount of records I have, exceeds the AWS limit. This causes the following error:

I0802 15:02:52.300886       1 dnscontroller.go:301] applying DNS changeset for zone example.com.::Z30ETTSQ28ABDA
W0802 15:02:52.648670       1 dnscontroller.go:303] error applying DNS changeset for zone example.com.::Z30ETTSQ28ABDA: InvalidChangeBatch: Number of records limit of 1000 exceeded.
	status code: 400, request id: a8247983-7793-11e7-96b3-7d2348af7e8f
W0802 15:02:52.648712       1 dnscontroller.go:119] Unexpected error in DNS controller, will retry: error applying DNS changeset for zone example.com.::Z30ETTSQ28ABDA: InvalidChangeBatch: Number of records limit of 1000 exceeded.

So none of my records are being updated currently. Rolling dns-controller back to 1.6.1 has the same issue, I presume I just never had this many records to be changed before...

@shavo007
Copy link
Contributor

shavo007 commented Aug 8, 2017

@bcorijn where are those logs located?

@bcorijn
Copy link
Contributor Author

bcorijn commented Aug 8, 2017

@shavo007 log output from the dns-controller pod.

I managed to fix it by deleting some annotations, which caused me to drop below 1000 records in the changeset. Upping the logging to v8 showed me that it had the impression that all my nodes were modified and my record needed to be replaced, which was a delete + action for each node / annotation combination.

@bcorijn
Copy link
Contributor Author

bcorijn commented Aug 29, 2017

This just happened again, this time without any update. I did add a new IG though with one instance, which I guess has triggered all records to be refreshed to add another IP to them.
Happy to share logs on this, but since it contains a lot of information about my environment, I prefer not to just put them here publicly...
You can clearly see though it is going through all services which have an annotation and adding a new A record (containing all currently existing nodes) to the changeset batch, which counts as one DELETE A *record* and one CREATE A *record* in the end.
So as soon as #annotated-services * #nodes * 2 > 1000 you would hit this problem once a node gets added/deleted.

@bcorijn
Copy link
Contributor Author

bcorijn commented Sep 29, 2017

Looking into the AWS R53 API for some other projects, I think the DELETE + CREATE action could actually be replaced by an UPSERT which would already slash the amount of changes in half?

justinsb added a commit to justinsb/kops that referenced this issue Nov 15, 2017
Will use half the operations vs REMOVE + ADD.

Issue kubernetes#3121
@justinsb
Copy link
Member

Great suggestion on using upsert @bcorijn ... I implemented upsert in the upstream DNS library, but I guess I never actually was able to start using it in kops (it took a while to merge). Upsert is #3859

@justinsb
Copy link
Member

I had a go at a more complete fix in #3860 as well. I think we want to vendor the dnsprovider library, which makes it fairly intrusive - but I feel like we should have vendored dnsprovider a long time ago so we could fix this stuff faster.

@justinsb
Copy link
Member

We do have a fix to use upsert now, which will be in the next 1.8 release (#3859). I think #3860 is too invasive for (this stage of) the 1.8 release cycle. So I'm going to point this at 1.9 instead, but keep it as blocks-release.

@justinsb justinsb added this to the 1.9 milestone Nov 18, 2017
@bcorijn
Copy link
Contributor Author

bcorijn commented Nov 20, 2017

Good to hear this got into 1.8, moving to an upsert should already move the limit a lot higher.

@chrislovecnm
Copy link
Contributor

This is blocks next for 1.9 now?

@egeland
Copy link

egeland commented Nov 23, 2017

This is affecting us, too - on k8s 1.7.10

@chrislovecnm
Copy link
Contributor

This is addressed on kops 1.8.x work more changes in 1.9.x

k8s-github-robot pushed a commit that referenced this issue Dec 15, 2017
Automatic merge from submit-queue.

Copy dnsprovider into our code, implement route53 batching

Fixes #3121
@egeland
Copy link

egeland commented Feb 22, 2018

@justinsb We just tried the 1.8.0 dns-controller image in our k8s 1.7.10 cluster, and it's still throwing that same batch size error:

W0222 00:24:42.376639       1 dnscontroller.go:303] error applying DNS changeset for zone example.com.::Z3JCJUN63OZMY6: InvalidChangeBatch: RDATA character limit of 32000 exceeded.
	status code: 400, request id: c5b9c830-1766-11e8-b533-994079acce2a
W0222 00:24:42.376676       1 dnscontroller.go:119] Unexpected error in DNS controller, will retry: error applying DNS changeset for zone example.com.::Z3JCJUN63OZMY6: InvalidChangeBatch: RDATA character limit of 32000 exceeded.
	status code: 400, request id: c5b9c830-1766-11e8-b533-994079acce2a

Definitely running the image:

kubectl --namespace=kube-system get deploy,rs,pod -owide |grep dns-cont
deploy/dns-controller        1         1         1            1           263d      dns-controller            kope/dns-controller:1.8.0                                                                                                                                                  k8s-app=dns-controller

rs/dns-controller-2358905950        1         1         1         26m       dns-controller            kope/dns-controller:1.8.0                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=2358905950
rs/dns-controller-3330613807        0         0         0         263d      dns-controller            kope/dns-controller:1.6.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=3330613807
rs/dns-controller-3628016177        0         0         0         103d      dns-controller            kope/dns-controller:1.7.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=3628016177
rs/dns-controller-670048740         0         0         0         263d      dns-controller            kope/dns-controller:1.6.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=670048740
rs/dns-controller-967320038         0         0         0         103d      dns-controller            kope/dns-controller:1.7.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=967320038

po/dns-controller-2358905950-8g3cs                                           1/1       Running   0          26m       10.44.38.112    ip-10-44-38-112.ap-southeast-2.compute.internal

@egeland
Copy link

egeland commented Feb 22, 2018

We built dns-controller from master, and dropped the batch size from the 900 that's hardcoded, to 100 (we tried 500, but that failed) - and it worked.

@pwillie is working on a PR to make batch size an argument

@pwillie
Copy link
Contributor

pwillie commented Feb 22, 2018

Pull request here: #4496

@bcorijn
Copy link
Contributor Author

bcorijn commented Feb 22, 2018

Yeah, ran into it again myself as well a few days ago on 1.8.. PR looks good to allow people who suffer from this a workaround!

@gertjan-carbon
Copy link

Just ran the 1.9.0.alpha.2 controller and still run in to this

@kumudt
Copy link

kumudt commented Apr 10, 2018

can this be cherry-picked to 1.8.x
or is there a way to fix this temporarily on 1.8.0 dns-controller.

@gertjan-carbon
Copy link

gertjan-carbon commented Apr 10, 2018

By adding the flag: --route53-batch-size=100
However, Im running the 1.9.0 dns controller instead of the 1.8.0
(just changed the image in the yaml that was deployed by Kops)
And this solved the issue for us.

@kumudt
Copy link

kumudt commented Apr 10, 2018

for me setting it to 50 worked. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants