DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

bcorijn · 2017-08-02T16:12:04Z

Hi,

I tried rolling my cluster from 1.6.4 to 1.7.2 with kops 1.7.0 an hour ago or so, starting with 2 out of my 3 masters. I noticed that after doing this my kubectl commands were becoming very slow, something that in the past had been a symptom of the api.* records in Route53 being incorrect. Checking them indeed showed that the new IP's had not been recorded.
Next suspect was a faulty DNS-Controller, but I saw it had been updated to v1.7.1 and had started correctly. The logs however do show something interesting: it is trying to replace all my record sets with new IP's. This is already a first bug I believe, since the only change were my masters there should be no changes to my records, as those only include nodes?
Besides that, it seems that the amount of nodes times the amount of records I have, exceeds the AWS limit. This causes the following error:

I0802 15:02:52.300886       1 dnscontroller.go:301] applying DNS changeset for zone example.com.::Z30ETTSQ28ABDA
W0802 15:02:52.648670       1 dnscontroller.go:303] error applying DNS changeset for zone example.com.::Z30ETTSQ28ABDA: InvalidChangeBatch: Number of records limit of 1000 exceeded.
	status code: 400, request id: a8247983-7793-11e7-96b3-7d2348af7e8f
W0802 15:02:52.648712       1 dnscontroller.go:119] Unexpected error in DNS controller, will retry: error applying DNS changeset for zone example.com.::Z30ETTSQ28ABDA: InvalidChangeBatch: Number of records limit of 1000 exceeded.

So none of my records are being updated currently. Rolling dns-controller back to 1.6.1 has the same issue, I presume I just never had this many records to be changed before...

The text was updated successfully, but these errors were encountered:

shavo007 · 2017-08-08T05:33:20Z

@bcorijn where are those logs located?

bcorijn · 2017-08-08T16:13:30Z

@shavo007 log output from the dns-controller pod.

I managed to fix it by deleting some annotations, which caused me to drop below 1000 records in the changeset. Upping the logging to v8 showed me that it had the impression that all my nodes were modified and my record needed to be replaced, which was a delete + action for each node / annotation combination.

bcorijn · 2017-08-29T15:49:35Z

This just happened again, this time without any update. I did add a new IG though with one instance, which I guess has triggered all records to be refreshed to add another IP to them.
Happy to share logs on this, but since it contains a lot of information about my environment, I prefer not to just put them here publicly...
You can clearly see though it is going through all services which have an annotation and adding a new A record (containing all currently existing nodes) to the changeset batch, which counts as one DELETE A *record* and one CREATE A *record* in the end.
So as soon as #annotated-services * #nodes * 2 > 1000 you would hit this problem once a node gets added/deleted.

bcorijn · 2017-09-29T15:14:55Z

Looking into the AWS R53 API for some other projects, I think the DELETE + CREATE action could actually be replaced by an UPSERT which would already slash the amount of changes in half?

Will use half the operations vs REMOVE + ADD. Issue kubernetes#3121

justinsb · 2017-11-15T07:17:24Z

Great suggestion on using upsert @bcorijn ... I implemented upsert in the upstream DNS library, but I guess I never actually was able to start using it in kops (it took a while to merge). Upsert is #3859

justinsb · 2017-11-15T08:18:24Z

I had a go at a more complete fix in #3860 as well. I think we want to vendor the dnsprovider library, which makes it fairly intrusive - but I feel like we should have vendored dnsprovider a long time ago so we could fix this stuff faster.

justinsb · 2017-11-18T22:18:14Z

We do have a fix to use upsert now, which will be in the next 1.8 release (#3859). I think #3860 is too invasive for (this stage of) the 1.8 release cycle. So I'm going to point this at 1.9 instead, but keep it as blocks-release.

bcorijn · 2017-11-20T09:36:28Z

Good to hear this got into 1.8, moving to an upsert should already move the limit a lot higher.

chrislovecnm · 2017-11-23T02:22:06Z

This is blocks next for 1.9 now?

egeland · 2017-11-23T03:37:41Z

This is affecting us, too - on k8s 1.7.10

chrislovecnm · 2017-11-23T04:36:37Z

This is addressed on kops 1.8.x work more changes in 1.9.x

Automatic merge from submit-queue. Copy dnsprovider into our code, implement route53 batching Fixes #3121

egeland · 2018-02-22T00:35:18Z

@justinsb We just tried the 1.8.0 dns-controller image in our k8s 1.7.10 cluster, and it's still throwing that same batch size error:

W0222 00:24:42.376639       1 dnscontroller.go:303] error applying DNS changeset for zone example.com.::Z3JCJUN63OZMY6: InvalidChangeBatch: RDATA character limit of 32000 exceeded.
	status code: 400, request id: c5b9c830-1766-11e8-b533-994079acce2a
W0222 00:24:42.376676       1 dnscontroller.go:119] Unexpected error in DNS controller, will retry: error applying DNS changeset for zone example.com.::Z3JCJUN63OZMY6: InvalidChangeBatch: RDATA character limit of 32000 exceeded.
	status code: 400, request id: c5b9c830-1766-11e8-b533-994079acce2a

Definitely running the image:

kubectl --namespace=kube-system get deploy,rs,pod -owide |grep dns-cont
deploy/dns-controller        1         1         1            1           263d      dns-controller            kope/dns-controller:1.8.0                                                                                                                                                  k8s-app=dns-controller

rs/dns-controller-2358905950        1         1         1         26m       dns-controller            kope/dns-controller:1.8.0                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=2358905950
rs/dns-controller-3330613807        0         0         0         263d      dns-controller            kope/dns-controller:1.6.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=3330613807
rs/dns-controller-3628016177        0         0         0         103d      dns-controller            kope/dns-controller:1.7.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=3628016177
rs/dns-controller-670048740         0         0         0         263d      dns-controller            kope/dns-controller:1.6.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=670048740
rs/dns-controller-967320038         0         0         0         103d      dns-controller            kope/dns-controller:1.7.1                                                                                                                                                  k8s-app=dns-controller,pod-template-hash=967320038

po/dns-controller-2358905950-8g3cs                                           1/1       Running   0          26m       10.44.38.112    ip-10-44-38-112.ap-southeast-2.compute.internal

egeland · 2018-02-22T05:46:58Z

We built dns-controller from master, and dropped the batch size from the 900 that's hardcoded, to 100 (we tried 500, but that failed) - and it worked.

@pwillie is working on a PR to make batch size an argument

pwillie · 2018-02-22T06:13:33Z

Pull request here: #4496

bcorijn · 2018-02-22T08:44:52Z

Yeah, ran into it again myself as well a few days ago on 1.8.. PR looks good to allow people who suffer from this a workaround!

gertjan-carbon · 2018-03-26T15:15:17Z

Just ran the 1.9.0.alpha.2 controller and still run in to this

kumudt · 2018-04-10T15:23:31Z

can this be cherry-picked to 1.8.x
or is there a way to fix this temporarily on 1.8.0 dns-controller.

gertjan-carbon · 2018-04-10T15:47:48Z

By adding the flag: --route53-batch-size=100
However, Im running the 1.9.0 dns controller instead of the 1.8.0
(just changed the image in the yaml that was deployed by Kops)
And this solved the issue for us.

kumudt · 2018-04-10T15:51:21Z

for me setting it to 50 worked. Thanks.

justinsb added the blocks-next label Aug 29, 2017

justinsb added a commit to justinsb/kops that referenced this issue Nov 15, 2017

Use upsert when applying DNS records

97d2546

Will use half the operations vs REMOVE + ADD. Issue kubernetes#3121

justinsb mentioned this issue Nov 15, 2017

Use upsert when applying DNS records #3859

Merged

justinsb mentioned this issue Nov 15, 2017

Copy dnsprovider into our code, implement route53 batching #3860

Merged

justinsb added this to the 1.9 milestone Nov 18, 2017

k8s-github-robot closed this as completed in #3860 Dec 15, 2017

k8s-github-robot pushed a commit that referenced this issue Dec 15, 2017

Merge pull request #3860 from justinsb/batching

b44d894

Automatic merge from submit-queue. Copy dnsprovider into our code, implement route53 batching Fixes #3121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

bcorijn commented Aug 2, 2017

shavo007 commented Aug 8, 2017

bcorijn commented Aug 8, 2017

bcorijn commented Aug 29, 2017 •

edited

Loading

bcorijn commented Sep 29, 2017

justinsb commented Nov 15, 2017

justinsb commented Nov 15, 2017

justinsb commented Nov 18, 2017

bcorijn commented Nov 20, 2017

chrislovecnm commented Nov 23, 2017

egeland commented Nov 23, 2017

chrislovecnm commented Nov 23, 2017

egeland commented Feb 22, 2018

egeland commented Feb 22, 2018

pwillie commented Feb 22, 2018

bcorijn commented Feb 22, 2018

gertjan-carbon commented Mar 26, 2018

kumudt commented Apr 10, 2018

gertjan-carbon commented Apr 10, 2018 •

edited

Loading

kumudt commented Apr 10, 2018 •

edited

Loading

DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

DNS-Controller hitting AWS changeset limit after 1.7 upgrade #3121

Comments

bcorijn commented Aug 2, 2017

shavo007 commented Aug 8, 2017

bcorijn commented Aug 8, 2017

bcorijn commented Aug 29, 2017 • edited Loading

bcorijn commented Sep 29, 2017

justinsb commented Nov 15, 2017

justinsb commented Nov 15, 2017

justinsb commented Nov 18, 2017

bcorijn commented Nov 20, 2017

chrislovecnm commented Nov 23, 2017

egeland commented Nov 23, 2017

chrislovecnm commented Nov 23, 2017

egeland commented Feb 22, 2018

egeland commented Feb 22, 2018

pwillie commented Feb 22, 2018

bcorijn commented Feb 22, 2018

gertjan-carbon commented Mar 26, 2018

kumudt commented Apr 10, 2018

gertjan-carbon commented Apr 10, 2018 • edited Loading

kumudt commented Apr 10, 2018 • edited Loading

bcorijn commented Aug 29, 2017 •

edited

Loading

gertjan-carbon commented Apr 10, 2018 •

edited

Loading

kumudt commented Apr 10, 2018 •

edited

Loading