-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] do not break when TCP DNS answer exceeds 64k #3948
Conversation
It will avoid having discovery broken when having large number of instances of a service (works with SRV and A* records). Fixes hashicorp#3850
Added debug log to give information about truncation.
|
||
// Beyond 2500 records, performance gets bad | ||
// Limit the number of records at once, anyway, it won't fit in 64k | ||
// For SRV Records, the max is around 500 records, for A, less than 2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to apply the right max for SRV records, you can get that from the Qtype field req.Question[0].Qtype == dns.TypeSRV
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
agent/dns_test.go
Outdated
service, question, protocol, numServices, out) | ||
t.Fatalf("Should have truncate:=%v for %s", shouldBeTruncated, info) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also verify that the size of the answer is as expected when it was truncated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
Stricter Unit tests that checks if truncation was OK.
See hashicorp#3977 While trying to improve furthermore hashicorp#3948 (This pull request is still valid since we are not using Compression to compute the result anyway). I saw a strange behaviour of dns library. Basically, msg.Len() and len(msg.Pack()) disagree on Message len. Thus, calculation of DNS response is false consul relies on msg.Len() instead of the result of Pack() This is linked to miekg/dns#453 and a fix has been provided with miekg/dns#454 Would it be possible to upgrade miekg/dns to a more recent function ? Consul might for instance upgrade to a post 1.0 release such as https://github.com/miekg/dns/releases/tag/v1.0.4
As soon as #3978 will be merged, I'll be able to re-enable compression of response for being able to output more results and not reaching 64k limit. Not possible for now since Len() badly computes size of Pack() due to bug in DNS library. I propose to re-add compression in another request After #3978 is merged. This patch works well already and avoid having no response on large services |
See #3977 While trying to improve furthermore #3948 (This pull request is still valid since we are not using Compression to compute the result anyway). I saw a strange behaviour of dns library. Basically, msg.Len() and len(msg.Pack()) disagree on Message len. Thus, calculation of DNS response is false consul relies on msg.Len() instead of the result of Pack() This is linked to miekg/dns#453 and a fix has been provided with miekg/dns#454 Would it be possible to upgrade miekg/dns to a more recent function ? Consul might for instance upgrade to a post 1.0 release such as https://github.com/miekg/dns/releases/tag/v1.0.4
@preetapan Ok, I thought I could improve this patch by using Compression while computing the Answser (so basically, returning more than 64K uncompressed data and have more SRV records), that is why I created DNS upgrade issue ( #3977 ), unfortunatly, even in DNS 1.0.4, the bug still exists in the DNS library as reported here with a test case: miekg/dns#657 It means this patch is the best we can do until the DNS library is fixed. On the long term, it would be possible to improve a bit performance by performing a dichotomic search of optimal number of records, but I will submit a subsequent patch for that, when the Compress computation will be fixed in the DNS library itself. So this patch is already working well and allow Consul to DNS queries even when there are many services (You hit the limit very quickly if you have long DNS prefixes, in our clusters, around 100 nodes are enough to crash DNS responses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems to work as expected. Responses get truncated to acceptable lengths prior to DNS label compression. This resulted in actual message sizes ranging from 25K to 37K in my testing although the service names involved should be able to be heavily compressed.
The new tests work as expected as well as some of my own manual testing.
@mkeeler a patch is in progress on the DNS side: miekg/dns#657 . This Patch in Consul also dramatically improve response time for very large clusters since it truncates to a max. Next optimization is for dichotomic search for max entries which should improve performance furthermore. I have a patch ready for that as well, I planned to push this once compression issue is fixed since max number of entries will grow, but I can submit it before DNS fix is merged if you want |
@pierresouchay No need to go ahead and submit the other PR. I was just commenting before to note the current behavior. Originally I was expecting to see ~64k packets on the wire but wireshark was showing me much smaller ones. I figured out why but thought I would note it. It all looks good to me and does help response times. I am checking with @preetapan to make sure she didn't want a second look before merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
It will avoid having discovery broken when having large number of instances of a service (works with SRV and A* records).
Fixes #3850