Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong CNAME reply time calculation (too fast) #1232

Closed
r2ixuz opened this issue Oct 27, 2021 · 26 comments
Closed

Wrong CNAME reply time calculation (too fast) #1232

r2ixuz opened this issue Oct 27, 2021 · 26 comments

Comments

@r2ixuz
Copy link

r2ixuz commented Oct 27, 2021

For example if I resolve the A record for "alive.github.com" it points to an CNAME record. Which is shown in the Pi-hole reply column as too fast: 13ms in this example where alone the RTT to Google DNS is for me at its best 24ms.

This bug is new since "Docker Tag 2021.10.1 Pi-hole v5.6 FTL v5.11 Web Interface v5.8".

@yubiuser yubiuser transferred this issue from pi-hole/web Oct 27, 2021
@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

13ms in this example where alone the RTT to Google DNS is for me at its best 24ms.

How did you measure the RTT to Google's DNS server? As ping does not utilize UDP but ICMP, there might be a difference. We've seen ISP caching some DNS stuff on their own so - while a ping would indeed go to the Google servers - the known DNS query may in fact be answered a lot closer to you.

Have you verified this against a few manual dig queries?

We can enable the PCAP tracing embedded in Pi-hole if you want to investigate with us further.

For this, create a new file /etc/dnsmasq.d/99-pcap.conf with the following content:

dumpfile=/tmp/record.pcap
dumpmask=0xFFFF

When you restart pihole-FTL, it will immediately start recording. You can either share your recording here or we discuss further steps so you can analyze your file offline and we can compare the measured time here to the time displayed in the Query Log.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

I know how the reply should be normally. Besides I use Smokeping to test multiple DNS dig probes against the upstreams I use.

Furthermore I did several dig test over the last couple of months and no one was faster than 20ms. And I can see this behaviour only since the last stable update as mentioned before. There are even sometimes responses under 10ms...

This behaviour occurs ONLY for CNAME response. A records does have normal repsonse times.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

<<>> DiG 9.16.15 <<>> acme-v02.api.letsencrypt.org @dns2
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32383
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;acme-v02.api.letsencrypt.org.	IN	A

;; ANSWER SECTION:
acme-v02.api.letsencrypt.org. 7017 IN	CNAME	prod.api.letsencrypt.org.
prod.api.letsencrypt.org. 117	IN	CNAME	ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com.
ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com. 117 IN A 172.65.32.248

;; Query time: 26 msec

Bildschirmfoto 2021-10-27 um 21 01 58

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

Please don't bother with the different server names "dns2" and "Google". Both are the same local DoH instance which points only to Google DNS. On Pi-hole I use a "Local DNS Record" which points "dns2" IP to "Google" so it is better understandable in the Query Log.

@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

Long-lived (and cached) CNAMEs with short-lived (and already expired) children need to be forwarded partially to get the final reply. Pi-hole measures the reply time from receiving the query from the client until the cached record was found. Then we start again measuring the delay of the new query (for the short-lived query) to the upstream. Finally, both are added and give the overall delay. This seems to be broken somehow in your case.

If you could do the pcap-recording, we'd exactly see how many milliseconds there are between

  1. query received from client
  2. sub-query sent upstream
  3. sub-query received from upstream
  4. complete reply sent to the client

and this should reveal what we are looking at here.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

I think it has something todo with your last change in this stable release regarding CNAME deep inspection and my dnsmasq setting "all-servers". The other DNS upstream I use is normally way faster than the Google one. Maybe due to that it is getting mixed somehow.

How can I interpret this pcap file myself. For data privacy reasons I would not like to upload it, sorry.

@yubiuser
Copy link
Member

How can I interpret this pcap file myself.

https://www.wireshark.org/

@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

Another idea is adding DEBUG_QUERIES=true to /etc/pihole/pihole-FTL.conf (create if it does not exist). After a restart, query details will be logged in /var/log/pihole-FTL.log with millisecond-accurate timestamps (even if the 100µs digit is missing).

How can I interpret this pcap file myself.

https://www.wireshark.org/

You can also check out tcpdump if you prefer a terminal application. But it is a lot harder to use this one compared to the graphical tool suggested by @yubiuser

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

Hm, I can see only "Standard query response" at Wireshark for one suspicious "too fast response" which comes from the usually faster other upstream. So the Google DNS does not appear with a response in Wireshark for that query even though the Pi-hole's Query Log states that Google DNS was fastest.

So now I understand where Pi-hole is getting this fast response time from. I still think it has something todo with the "all-servers" setting.

@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

So now I understand where Pi-hole is getting this fast response time from. I still think it has something todo with the "all-servers" setting.

But why shouldn't this be correct?

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

Because when using the "all-servers" setting over the past years Pi-hole was not able to show which one was fastest of my both upstreams. It showed always the same upstream.

Pi-hole is only to able to make difference since, let me guess, the last 2 or 3 stable releases.

@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

Yes, you are right. This was a change in one of the more recent releases (but not this one). Case closed?

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

Well no, because the shown server is still wrong for CNAMEs. Maybe I was not clear enough. I understand that the time is being shown is correct, but Pi-hole still shows the wrong upstream name/IP in its query log in such a case.

@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

Okay, so this is a different problem.

It'll be helpful if you could isolate a CNAME where this went wrong in the pihole-FTL.log file (with debug logging enabled, see above) and provide it here. You are obviously free to obfuscate IP addresses due to privacy reasons, however, please mark somehow what was obfuscated.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

[2021-10-27 22:26:12.805 25051M] **** new UDP IPv6 query[A] query "acme-v02.api.letsencrypt.org" from eth0:fd00::2e91:abff:fe26:e911#36597 (ID 53, FTL 30840, src/dnsmasq/forward.c:1575)
[2021-10-27 22:26:12.805 25051M] acme-v02.api.letsencrypt.org is not known
[2021-10-27 22:26:12.805 25051M] DNS cache: fd00::2e91:abff:fe26:e911/acme-v02.api.letsencrypt.org is not blocked
[2021-10-27 22:26:12.805 25051M] **** got cache reply: acme-v02.api.letsencrypt.org is (CNAME) (ID 53, src/dnsmasq/rfc1035.c:1502)
[2021-10-27 22:26:12.805 25051M] Set reply to CNAME (3) in src/dnsmasq_interface.c:2039
[2021-10-27 22:26:12.805 25051M] **** forwarded acme-v02.api.letsencrypt.org to 192.168.1.251#53 (ID 53, src/dnsmasq/forward.c:520)
[2021-10-27 22:26:12.805 25051M] **** forwarded acme-v02.api.letsencrypt.org to 192.168.1.252#53 (ID 53, src/dnsmasq/forward.c:520)
[2021-10-27 22:26:12.816 25051M] FTL_CNAME called with: src = (null), dst = acme-v02.api.letsencrypt.org, id = 53
[2021-10-27 22:26:12.816 25051M] acme-v02.api.letsencrypt.org is known as not to be blocked
[2021-10-27 22:26:12.816 25051M] Query 53: CNAME acme-v02.api.letsencrypt.org
[2021-10-27 22:26:12.816 25051M] **** got upstream reply from 192.168.1.251#53: acme-v02.api.letsencrypt.org is (CNAME) (ID 53, src/dnsmasq/rfc1035.c:782)
[2021-10-27 22:26:12.816 25051M] FTL_CNAME called with: src = acme-v02.api.letsencrypt.org, dst = prod.api.letsencrypt.org, id = 53
[2021-10-27 22:26:12.817 25051M] prod.api.letsencrypt.org is not known
[2021-10-27 22:26:12.817 25051M] DNS cache: fd00::2e91:abff:fe26:e911/prod.api.letsencrypt.org is not blocked
[2021-10-27 22:26:12.817 25051M] Query 53: CNAME acme-v02.api.letsencrypt.org ---> prod.api.letsencrypt.org
[2021-10-27 22:26:12.817 25051M] **** got upstream reply from 192.168.1.251#53: prod.api.letsencrypt.org is (CNAME) (ID 53, src/dnsmasq/rfc1035.c:782)
[2021-10-27 22:26:12.817 25051M] FTL_CNAME called with: src = prod.api.letsencrypt.org, dst = ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com, id = 53
[2021-10-27 22:26:12.817 25051M] ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com is not known
[2021-10-27 22:26:12.817 25051M] DNS cache: fd00::2e91:abff:fe26:e911/ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com is not blocked
[2021-10-27 22:26:12.817 25051M] Query 53: CNAME prod.api.letsencrypt.org ---> ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com
[2021-10-27 22:26:12.817 25051M] **** got upstream reply from 192.168.1.251#53: ca80a1adb12a4fbdac5ffcbc944e9a61.pacloudflare.com is 172.65.32.248 (ID 53, src/dnsmasq/rfc1035.c:893)

Bildschirmfoto 2021-10-27 um 22 28 11

So here you can see: Clearly dns1/192.168.1.251/Cloudflare is answering all queries but the Query Log shows that dns2/192.168.1.252/Google answered fastest which is wrong.

@DL6ER
Copy link
Member

DL6ER commented Oct 27, 2021

Thanks for the log snippet, this revealed where the problem is: The check if another upstream replied than the last one we forwarded a query to is skipped when the reply type is already known. This is the case here due to

[2021-10-27 22:26:12.805 25051M] Set reply to CNAME (3) in src/dnsmasq_interface.c:2039

This should be fixed by #1235

I triggered building a special docker container you could use to check if the bugfix is sufficient for you (tag fix_cached_cname_upstream_dest, it needs some time to be built until it can be downloaded). The fix for the other bug is already contained therein.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 27, 2021

Thank you. I will test it tomorrow.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 28, 2021

This Docker Image fixed it. Great job, thanks!

@r2ixuz r2ixuz closed this as completed Oct 28, 2021
@PromoFaux
Copy link
Member

Reopening as this bug is not yet fixed. Please keep in mind that docker tag will not be updated

awaiting merge of #1235

@PromoFaux PromoFaux reopened this Oct 28, 2021
@r2ixuz
Copy link
Author

r2ixuz commented Oct 28, 2021

@DL6ER Is it possible that this fix is also needed for SERVFAIL responses when using "all-servers"? Seems to be the case too. Doesn't really bother me, just for your information.

@DL6ER
Copy link
Member

DL6ER commented Oct 28, 2021

It's always needed with all-servers.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 28, 2021

With the Docker image you provided for me I have the same "problem" for SERVFAIL responses like it was with CNAMES before your fix. So SERVFAIL responses have wrong response times (too fast) displayed in the Query Log. Your fix does not work there.

@DL6ER
Copy link
Member

DL6ER commented Oct 28, 2021

Thanks for testing and mentioning this. Upstream error handling is differently than successful reply handling and, in the former case, detecting which server was the earliest is missing so queries always show the list server that we sent the query to. I'll check this out over the next days and come back to you

@DL6ER
Copy link
Member

DL6ER commented Oct 31, 2021

@r2ixuz The custom docker tag fix_cached_cname_upstream_dest will be updated in about half an hour with another fix (#1237) which should also cover the SERVFAIL/REFUSED/... replies from multiple upstreams.

@r2ixuz
Copy link
Author

r2ixuz commented Oct 31, 2021

Looks good. Your fix is working! Thanks.

@DL6ER
Copy link
Member

DL6ER commented Dec 22, 2021

The next version of FTL has been released. Please update and run

pihole checkout master

to get back on-track. The fix/feature branch you switched to will not receive any further updates.

Thanks for helping us to make Pi-hole better for us all!

If you have any issues, please either reopen this ticket or (preferably) create a new ticket describing the issues in further detail and only reference this ticket. This will help us to help you best.

@DL6ER DL6ER closed this as completed Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants