Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-resolved is very slow when DNS servers do not reply to ipv6 lookups #1737

Open
stephaje opened this issue Dec 1, 2024 · 14 comments
Open

Comments

@stephaje
Copy link

stephaje commented Dec 1, 2024

Your system information

  • Steam client version: 1731990050
  • SteamOS version: 3.6.20
  • Opted into Steam client beta?: No
  • Opted into SteamOS beta?: No
  • Have you checked for updates in Settings > System?: Yes

Please describe your issue in as much detail as possible:

On some networks, systemd-resolved is incapable of resolving DNS requests for FQDNs that lack an AAAA DNS record. This causes inconsistent network connectivity, where some games and services may operate perfectly, while others claim that they cannot make necessary connections.

For example, while trying to play Void Crew on the Deck, the following can be observed in systemd-resolved:

(deck@steamdeck ~)$ sudo journalctl -u systemd-resolved -f
Nov 30 14:45:32 steamdeck systemd-resolved[539]: Firing regular transaction 55746 for [<player-auth.services.api.unity.com](http://&lt;player-auth.services.api.unity.com/) IN AAAA> scope dns on wlan0/* (validate=yes).
Nov 30 14:45:32 steamdeck systemd-resolved[539]: Using feature level UDP+EDNS0 for transaction 55746.
Nov 30 14:45:32 steamdeck systemd-resolved[539]: Using DNS server 192.168.1.1 for transaction 55746.
Nov 30 14:45:32 steamdeck systemd-resolved[539]: Sending query packet with id 55746 of size 63.
Nov 30 14:45:37 steamdeck systemd-resolved[539]: Timeout reached on transaction 55746.
Nov 30 14:45:37 steamdeck systemd-resolved[539]: Retrying transaction 55746, after switching servers.
Nov 30 14:45:37 steamdeck systemd-resolved[539]: Firing regular transaction 55746 for [<player-auth.services.api.unity.com](http://&lt;player-auth.services.api.unity.com/) IN AAAA> scope dns on wlan0/* (validate=yes).
Nov 30 14:45:37 steamdeck systemd-resolved[539]: Using feature level UDP+EDNS0 for transaction 55746.
Nov 30 14:45:37 steamdeck systemd-resolved[539]: Sending query packet with id 55746 of size 63.
Nov 30 14:45:42 steamdeck systemd-resolved[539]: Timeout reached on transaction 55746.
Nov 30 14:45:42 steamdeck systemd-resolved[539]: Retrying transaction 55746, after switching servers.
Nov 30 14:45:42 steamdeck systemd-resolved[539]: Firing regular transaction 55746 for [<player-auth.services.api.unity.com](http://&lt;player-auth.services.api.unity.com/) IN AAAA> scope dns on wlan0/* (validate=yes).
Nov 30 14:45:42 steamdeck systemd-resolved[539]: Using feature level UDP+EDNS0 for transaction 55746.
Nov 30 14:45:42 steamdeck systemd-resolved[539]: Sending query packet with id 55746 of size 63.
Nov 30 14:45:47 steamdeck systemd-resolved[539]: Timeout reached on transaction 55746.
Nov 30 14:45:47 steamdeck systemd-resolved[539]: Retrying transaction 55746, after switching servers.

This loop continues until the request times out:

(deck@steamdeck ~)$ resolvectl query [player-auth.services.api.unity.com](http://player-auth.services.api.unity.com/)
[player-auth.services.api.unity.com](http://player-auth.services.api.unity.com/): resolve call failed: Query timed out

A ping of the same FQDN will resolve after 2+ minutes.

This appears to only happen with FQDNs that lack AAAA records.

Disabling systemd-resolved and replacing /etc/resolv.conf with nameserver configuration pointing to the service DNS server address instead of the loopback address fixes this issue. Void crew works, and ping resolves the IP address in <2 seconds.

A longer version of my initial investigation can be found here: https://steamcommunity.com/app/1675200/discussions/1/4629233379372497278/?tscn=1733024973#c4629233379372509830

Steps for reproducing this issue:

  1. run resolvectl query for an FQDN without an AAAA record
@fledermaus
Copy link

I'm not seeing a repro here, so we may need to debug this on the specific seteup(s) where you see the problem.

I assume from the logs above that you've cranked the log level for systemd-resolved up to debug (this is fine).

  • Have you made any config changes other than to the debug level?
  • Do you see a line that looks something like this in your full logs?
    Added NODATA cache entry for player-auth.services.api.unity.com IN AAAA 29s

For comparison, my resolvectl status looks like this:

Global
           Protocols: +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
    resolv.conf mode: foreign
Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 9.9.9.9#dns.quad9.net 8.8.8.8#dns.google 2606:4700:4700::1111#cloudflare-dns.com
                      2620:fe::9#dns.quad9.net 2001:4860:4860::8888#dns.google
          DNS Domain: lan

Link 3 (wlan0)
    Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.1.1
       DNS Servers: 192.168.1.1
        DNS Domain: lan

Link 5 (enp4s0f3u1u4c2)
    Current Scopes: none
         Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

And the lookup for the FQDN in question results in the following log lines:

Got message type=method_call sender=:1.415 destination=org.freedesktop.resolve1 path=/org/freedesktop/resolve1 interface=org.freedesktop.resolve1.Manager member=ResolveHostname  cookie=2 reply_cookie=0 signature=isit error-name=n/a error-message=n/a
idn2_lookup_u8: player-auth.services.api.unity.com → player-auth.services.api.unity.com
Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=GetConnectionUnixProcessID cookie=22 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Got message type=method_return sender=org.freedesktop.DBus destination=:1.409 path=n/a interface=n/a member=n/a  cookie=4294967295 reply_cookie=22 signature=u error-name=n/a error-message=n/a
D-Bus hostname resolution request from client PID 29565 (resolvectl) with UID 1000
Looking up RR for player-auth.services.api.unity.com IN A.
Looking up RR for player-auth.services.api.unity.com IN AAAA.
Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=AddMatch cookie=23 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=GetNameOwner cookie=24 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Got message type=method_return sender=org.freedesktop.DBus destination=:1.409 path=n/a interface=n/a member=n/a  cookie=4294967295 reply_cookie=24 signature=s error-name=n/a error-message=n/a
Removing cache entry for p2p-lhr1.discovery.steamserver.net IN A (expired 11s ago)
Cache miss for player-auth.services.api.unity.com IN AAAA
Firing regular transaction 27392 for <player-auth.services.api.unity.com IN AAAA> scope dns on wlan0/* (validate=yes).
Using feature level UDP+EDNS0 for transaction 27392.
Using DNS server 192.168.1.1 for transaction 27392.
Announcing packet size 1472 in egress EDNS(0) packet.
Emitting UDP, link MTU is 1500, socket MTU is 0, minimal MTU is 40
Sending query packet with id 27392 of size 63.
Cache miss for player-auth.services.api.unity.com IN A
Firing regular transaction 36896 for <player-auth.services.api.unity.com IN A> scope dns on wlan0/* (validate=yes).
Using feature level UDP+EDNS0 for transaction 36896.
Using DNS server 192.168.1.1 for transaction 36896.
Announcing packet size 1472 in egress EDNS(0) packet.
Emitting UDP, link MTU is 1500, socket MTU is 0, minimal MTU is 40
Sending query packet with id 36896 of size 63.
Got message type=method_return sender=org.freedesktop.DBus destination=:1.409 path=n/a interface=n/a member=n/a  cookie=4294967295 reply_cookie=23 signature= error-name=n/a error-message=n/a
Match type='signal',sender='org.freedesktop.DBus',path='/org/freedesktop/DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',arg0=':1.415' successfully installed.
Received dns UDP packet of size 133, ifindex=3, ttl=0, fragsize=0, sender=192.168.1.1, destination=192.168.1.101
Processing incoming packet of size 133 on transaction 27392 (rcode=SUCCESS).
Added NODATA cache entry for player-auth.services.api.unity.com IN AAAA 29s
Regular transaction 27392 for <player-auth.services.api.unity.com IN AAAA> on scope dns on wlan0/* now complete with <success> from network (unsigned; non-confidential).
Received dns UDP packet of size 79, ifindex=3, ttl=0, fragsize=0, sender=192.168.1.1, destination=192.168.1.101
Processing incoming packet of size 79 on transaction 36896 (rcode=SUCCESS).
Added positive unauthenticated non-confidential cache entry for player-auth.services.api.unity.com IN A 3588s on wlan0/INET/192.168.1.1
Regular transaction 36896 for <player-auth.services.api.unity.com IN A> on scope dns on wlan0/* now complete with <success> from network (unsigned; non-confidential).
Freeing transaction 27392.
Sent message type=method_return sender=n/a destination=:1.415 path=n/a interface=n/a member=n/a cookie=25 reply_cookie=2 signature=a(iiay)st error-name=n/a error-message=n/a
Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RemoveMatch cookie=26 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Freeing transaction 36896.

Could you add your resolvectl status output and ideally your unredacted capture of the systemd-resolved journal entries corresponding to the slow lookup to this issue?

My guess is that it's something about the response that the DNS server is (or isn't) giving for the AAAA lookup that's trupping resolved up, but we'll need to figure out exactly what that is to make resolved work around it.

@fledermaus
Copy link

Ah, found your resolvctl status in the steam issue, recording here:

Global
           Protocols: +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
    resolv.conf mode: foreign
Fallback DNS Servers: 1.1.1.1#cloudflare-dns.com 9.9.9.9#dns.quad9.net 8.8.8.8#dns.google 2606:4700:4700::1111#cloudflare-dns.com
                      2620:fe::9#dns.quad9.net 2001:4860:4860::8888#dns.google

Link 2 (wlan0)
    Current Scopes: DNS LLMNR/IPv4
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.1.1
       DNS Servers: 192.168.1.1

@stephaje
Copy link
Author

stephaje commented Dec 2, 2024

Glad to help; I'm on the same network for a couple more hours so I can hopefully get the rest of the info you need.

The only change I've made to configuration was the debug level, after trying to dig into this issue via systemd repo issue reports.

I'm not seeing any NODATA entries in either of my logs. I've attached full resolved log for a resolvectl query call (which times out) and a ping calls (which eventually resolves).

ping-log.txt

resolvectl-log.txt

@fledermaus
Copy link

Thanks, we'll have a look at this and see if we can figure out what's going on.

FWIW we've tried on a bunch of networks so far and we don't have a repro (other than yours) yet so this is really useful.

@stephaje
Copy link
Author

stephaje commented Dec 2, 2024

Yeah, I don't think I have this issue on my home network, but I'm seeing it while out of town for the holidays. I was assuming it was an ISP or network configuration issue, but https://test-ipv6.com/ reports a perfect score and that brings me to the threshold of my IPv6 knowledge/capabilities. Let me know if you need any other metadata in the next hour and I can get it for you.

@fledermaus
Copy link

Ok. I don't think I need any more info. Might take a little while to fix or might be quick, depends on whether there's a tunable in the resolved code for this already.

Looks like this - when asked for an AAAA record that doesn't exist:

  • well-behaved network/DNS: says "I dunno" quickly
  • problematic setup: never answers at all

The problem appears to be that resolved is taking a really long time to realise it's been ghosted.

Chances are that test-ipv6 et al aren't picking this up because they have no way of knowing your local DNS set up gives you the silent treatment instead of saying "I don't know".

@fledermaus
Copy link

Upstream issue:

systemd/systemd#22575

@stephaje
Copy link
Author

stephaje commented Dec 2, 2024

Yeah, I saw that and also the response from someone of (seemingly some) authority who wants to label it purely a router issue. I think that's a bad take, but doubly so when considering the use of the service on a device intended to be used while traveling where users may have no control over network infrastructure.

There are other reports of this same issue for Void Crew specifically in various Steam Discussion threads, and I'm sure other games as well if it was dug into more.

Any ideas if there is an easier workaround in the meantime?

@fledermaus
Copy link

I'll have to do some digging - I'll let you know if I find anything quicker/easier than your current workaround and we'll get this fixed, ideally upstream but for SteamOS at any rate.

@fledermaus
Copy link

No obvious easy workaround, but I have managed to at least partially repro the problem with some slightly complicated iptables trickery to block the AAAA "don't know" responses so I can actually test possible fixes.

@kisak-valve kisak-valve changed the title systemd-resolved hangs for FQDNs without AAAA records systemd-resolved is very slow when DNS servers do not reply to ipv6 lookups Dec 5, 2024
@deathblade201243
Copy link

so I have an issue like this if I add dns servers into /etc/systemd/resolved.conf but only if TLS is enabled otherwise they work normally. adding a DNS via system settings > connections > IPv4/IPV6 works perfectly fine though. idk if this is exactly the same as this issue but felt its worth mentioning.

@fledermaus
Copy link

Anything TLS related is going to be a separate issue, that would be DNS over TCP and there's a whole other set of constraints, especially if you ask for TLS but the servers don't support it or get it wrong.

@fledermaus
Copy link

Testing a possible fix internally. Not sure what the ETA wil be or even if it's an acceptable fix at this point, but it does look fixable.

@fledermaus
Copy link

systemd/systemd#35514

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants