Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] do not break when TCP DNS answer exceeds 64k #3948

Merged
merged 7 commits into from
Mar 30, 2018

Conversation

pierresouchay
Copy link
Contributor

It will avoid having discovery broken when having large number of instances of a service (works with SRV and A* records).

Fixes #3850


// Beyond 2500 records, performance gets bad
// Limit the number of records at once, anyway, it won't fit in 64k
// For SRV Records, the max is around 500 records, for A, less than 2k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to apply the right max for SRV records, you can get that from the Qtype field req.Question[0].Qtype == dns.TypeSRV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

service, question, protocol, numServices, out)
t.Fatalf("Should have truncate:=%v for %s", shouldBeTruncated, info)
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also verify that the size of the answer is as expected when it was truncated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

pierresouchay added a commit to pierresouchay/consul that referenced this pull request Mar 21, 2018
See hashicorp#3977

While trying to improve furthermore hashicorp#3948 (This pull request is still valid since we are not using Compression to compute the result anyway).

I saw a strange behaviour of dns library.
Basically, msg.Len() and len(msg.Pack()) disagree on Message len.

Thus, calculation of DNS response is false consul relies on msg.Len() instead of the result of Pack()

This is linked to miekg/dns#453 and a fix has been provided with miekg/dns#454

Would it be possible to upgrade miekg/dns to a more recent function ?

Consul might for instance upgrade to a post 1.0 release such as https://github.com/miekg/dns/releases/tag/v1.0.4
@pierresouchay
Copy link
Contributor Author

As soon as #3978 will be merged, I'll be able to re-enable compression of response for being able to output more results and not reaching 64k limit.

Not possible for now since Len() badly computes size of Pack() due to bug in DNS library.

I propose to re-add compression in another request After #3978 is merged.

This patch works well already and avoid having no response on large services

preetapan pushed a commit that referenced this pull request Mar 28, 2018
See #3977

While trying to improve furthermore #3948 (This pull request is still valid since we are not using Compression to compute the result anyway).

I saw a strange behaviour of dns library.
Basically, msg.Len() and len(msg.Pack()) disagree on Message len.

Thus, calculation of DNS response is false consul relies on msg.Len() instead of the result of Pack()

This is linked to miekg/dns#453 and a fix has been provided with miekg/dns#454

Would it be possible to upgrade miekg/dns to a more recent function ?

Consul might for instance upgrade to a post 1.0 release such as https://github.com/miekg/dns/releases/tag/v1.0.4
@pierresouchay
Copy link
Contributor Author

@preetapan Ok, I thought I could improve this patch by using Compression while computing the Answser (so basically, returning more than 64K uncompressed data and have more SRV records), that is why I created DNS upgrade issue ( #3977 ), unfortunatly, even in DNS 1.0.4, the bug still exists in the DNS library as reported here with a test case: miekg/dns#657

It means this patch is the best we can do until the DNS library is fixed.

On the long term, it would be possible to improve a bit performance by performing a dichotomic search of optimal number of records, but I will submit a subsequent patch for that, when the Compress computation will be fixed in the DNS library itself.

So this patch is already working well and allow Consul to DNS queries even when there are many services (You hit the limit very quickly if you have long DNS prefixes, in our clusters, around 100 nodes are enough to crash DNS responses)

@mkeeler mkeeler self-requested a review March 30, 2018 16:05
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems to work as expected. Responses get truncated to acceptable lengths prior to DNS label compression. This resulted in actual message sizes ranging from 25K to 37K in my testing although the service names involved should be able to be heavily compressed.

The new tests work as expected as well as some of my own manual testing.

@pierresouchay
Copy link
Contributor Author

@mkeeler a patch is in progress on the DNS side: miekg/dns#657 . This Patch in Consul also dramatically improve response time for very large clusters since it truncates to a max. Next optimization is for dichotomic search for max entries which should improve performance furthermore. I have a patch ready for that as well, I planned to push this once compression issue is fixed since max number of entries will grow, but I can submit it before DNS fix is merged if you want

@mkeeler
Copy link
Member

mkeeler commented Mar 30, 2018

@pierresouchay No need to go ahead and submit the other PR. I was just commenting before to note the current behavior. Originally I was expecting to see ~64k packets on the wire but wireshark was showing me much smaller ones. I figured out why but thought I would note it. It all looks good to me and does help response times. I am checking with @preetapan to make sure she didn't want a second look before merging this.

Copy link
Contributor

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants