Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DoH/DoT/TCP-based lookups and connection re-use #439

Conversation

phillip-stephens
Copy link
Contributor

@phillip-stephens phillip-stephens commented Sep 10, 2024

Adds "name server stickiness" to lookups so Resolvers will prioritize processing queries from the same nameserver to avoid having to re-handshake TCP/TLS/HTTPS.

Changes

  • added persistent TCP connections since they were created as one-offs before and immediately thrown away
    • As long as a nameserver matches the IP/Port of the existing TCP connection, it'll be re-used
  • refactored wireLookup into wireLookupTCP and wireLookupUDP so the TCP variant could have connection info
  • added a priority queue and a global queue for each worker. Workers will prioritize work from their priority queue (which all share a single nameserver) but if no work is available, they'll context switch to connecting to another nameserver from the global queue.
  • Edge Cases
    • AXFR - AXFR is unique in that if the user doesn't provide a name server, it first does an NS lookups for that domain. This means we should not suggest a NameServer for these lookups. It's a little gross, but I added a check inside the worker thread to discard the name server "suggestion" if we're doing AXFR. If a user specifies a name server (as opposed to our suggestion), this shouldn't be thrown away
    • To handle the case where we have more name servers (and ordinarily more WorkerPools) than workers, I capped the number of pools at the --threads count. Using nameservers > thread count will result in a performance penalty since we can't send the same name server lookups to the same workers anymore.

Overview

In #431 (which this branch is based on, wanted to have this logic reviewed before merging into that to break up this larger feature), I added functionality with DoH and DoT connections that a given resolver would only re-handshake if the nameserver was different than the remote address on it's existing connection.

The issue is that in ExternalLookup here, if the user didn't provide a nameserver then a random one would be selected. Let's look at an example to see how this causes us to unnecessarily tear-down connections:

./zdns google.com yahoo.com eBay.com --threads=2 --name-servers=1.1.1.1,8.8.8.8 --tls
And let's say that after the random NS selection, this gives us

In this toy example, thread #0 would tear down it's TLS connection and re-establish one to 8.8.8.8 even though thread #1 already has a connection to 8.8.8.8.

Load Imbalance

As another design consideration, all external resolvers do not behave equally.

I measured the resolution time for 7k queries and to what resolver they were sent to.

        IP        mean          max         std  count
0  1.0.0.1   94.638474  3996.846985  243.672404   1752
1  1.1.1.1  112.375685  5456.484652  313.006790   1781
2  8.8.4.4   35.020819  2615.063632  101.810439   1732
3  8.8.8.8   31.811647  1253.615863   70.430397   1711

This led to the threads responsible for Google queries sitting idle while the Cloudflare ones were busy working.

Design

To address both re-using TCP connections and dealing with load imbalance, this PR implements a Priority and Global work queue.

A new inputDeMultiplexor chooses an external NS for each input line and passes it to the assigned priority queue. Each priority queue is for queries to a specific name-server, ex: @1.1.1.1. If the priority queue is full, then the demultiplexer will block on both the Priority/Global queue. In this way, we prioritize sending work to the threads which have a pre-existing connection to the name server, but also avoid large work imbalance issues.

Similarly, threads will prefer to read from their Priority queue, before blocking on both Priority and Global queues.

Tradeoff: Work Imbalance vs. Connection Re-use

Every time a worker chooses a work item from the Global queue, it will have to re-handshake. However, without this Global/Priority queue split, workers with Priority on a very fast nameserver will sit idle when they could be doing work too. To showcase this, see the below experiment.

I ran an experiment to check different points on this spectrum:

  • main - no TCP connection re-use
  • this branch - first try to pull from Priority, then either Global/Priority
  • 10 ms. wait - try to pull from Priority for 10 ms., then check either Global/Priority
  • 1 s. wait

As the blocking time increases, the odds that a non-Priority thread will need to take an item from the Global queue to load-balance decreases. This decreases new TCP handshakes but increases the runtime.

image

3x runs per condition
7,000 domains run with "A", "--verbosity=3", "--threads=100", "--tcp-only", "--name-servers=1.1.1.1,1.0.0.1,8.8.8.8,8.8.4.4",

Unaffected

--iterative lookups will ignore this suggestion and chose a random root server. Since we don't support --tls or --https with --iterative anyway, this isn't a concern.

Performance

Using the benchmark (7k domains), edited the command to use ./zdns A --name-servers=1.0.0.1,1.1.1.1,8.8.8.8,8.8.4.4 --threads=100 (external lookups)

  • main branch
    • Normal, UDP-based - 8.31 s. / 14,006 packets on lab VM (varies between 8-10s)
    • TCP-based - 9.82 s. / 70,012 packets
  • This branch
    • UDP-based - 6.60 s./ 14004 packets
    • TCP-based - 10.29 s./ 44,226 packets

Testing

  • Tested --no-recycle-sockets --tcp-only to be sure that we're not creating persistent TCP connections if the user doesn't want that

@phillip-stephens phillip-stephens marked this pull request as ready for review September 11, 2024 20:14
@phillip-stephens phillip-stephens requested a review from a team as a code owner September 11, 2024 20:14
@phillip-stephens phillip-stephens changed the title DRAFT - DoH/DoT/TCP-based lookups and connection re-use DoH/DoT/TCP-based lookups and connection re-use Sep 11, 2024
@phillip-stephens
Copy link
Contributor Author

After talking with Zakir, we can make this quite a bit simpler if we just re-use the existing connection in ExternalLookup if the user/CLI doesn't suggest a new one. Closing this but leaving branch in case we ever want to re-visit. #445 has the new approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant