Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mDNS packet flood when running multiple instances #50695

Closed
agners opened this issue May 15, 2021 · 13 comments
Closed

mDNS packet flood when running multiple instances #50695

agners opened this issue May 15, 2021 · 13 comments
Assignees

Comments

@agners
Copy link
Member

agners commented May 15, 2021

The problem

It seems that multiple HA Core instances can cause an excessive amount of mDNS packets (peaking at 1k/s in my network). I think it also needs at least one instance using Home Assistant OS and with a hostname homeassistant. From what I understand it is an interaction between systemd-resolved (running on Home Assistant OS as a mDNS resolver) and Home Assistant Core acting as a mDNS responder.

I am not sure if this is a Core issue, but it looks like. Discussing quickly with @bdraco he suggested to open an issue.

Note: For testing Home Assistant OS I often reinstall and end up with default configuration, so this certainly is somewhat specific to my use-case. But since homeassistant is the standard host name, and probably quite some people don't bother to change the hostname of a HAOS installation, somebody running a second installation (even only for a test) might run into the same problem as well.

What is version of Home Assistant Core has the issue?

core-2021.5.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

zeroconf

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zeroconf/

Example YAML snippet

No response

Anything in the logs that might be useful for us?

This is a Wireshark snippet of the packet flood.

In that particular case, 192.168.10.10 is a regular Linux host with systemd-resolved running. 192.168.10.175 and 192.168.10.176 are Home Assistant OS installation with hostname "homeassistant". All other IPs are Home Assistant OS installation with different hostname (188, 229).


602738	2721.634075959	192.168.10.10	224.0.0.251	MDNS	294	Standard query 0x0000 TXT Home._home-assistant._tcp.local, "QM" question TXT
602739	2721.634465740	192.168.10.176	224.0.0.251	MDNS	294	Standard query 0x0000 TXT Home._home-assistant._tcp.local, "QM" question TXT
602740	2721.634657760	192.168.10.188	224.0.0.251	MDNS	294	Standard query 0x0000 TXT Home._home-assistant._tcp.local, "QM" question TXT
602741	2721.634739824	192.168.10.175	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602742	2721.634816838	192.168.10.176	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602743	2721.635373051	192.168.10.175	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602744	2721.635772049	192.168.10.175	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602745	2721.637708419	192.168.10.229	224.0.0.251	MDNS	294	Standard query 0x0000 TXT Home._home-assistant._tcp.local, "QM" question TXT
602746	2721.638197767	192.168.10.176	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602747	2721.641551515	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602748	2721.648221720	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602749	2721.649120385	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602750	2721.650531351	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602751	2721.651805730	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602752	2721.652215909	192.168.10.188	224.0.0.251	MDNS	294	Standard query 0x0000 TXT Home._home-assistant._tcp.local, "QM" question TXT
602753	2721.652602323	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602754	2721.652742196	192.168.10.176	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602755	2721.655688059	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602756	2721.656877489	192.168.10.229	224.0.0.251	MDNS	288	Standard query response 0x0000 TXT, cache flush
602757	2721.691750741	192.168.10.10	224.0.0.251	MDNS	294	Standard query 0x0000 TXT Home._home-assistant._tcp.local, "QM" question TXT


### Additional information

_No response_
@probot-home-assistant
Copy link

Hey there @bdraco, mind taking a look at this issue as its been labeled with an integration (zeroconf) you are listed as a codeowner for? Thanks!
(message by CodeOwnersMention)

@agners
Copy link
Member Author

agners commented May 15, 2021

What is interesting that the flood disappears sometimes for several minutes and then starts again.

After further investigation it seems that hostnames are not really the issue here, but the fact that I am using the same instance names Home. Once I renamed all of them to different names, the flooding seems gone now for more than a hour.

What systemd-resolved is doing when detecting conflicting hostname is its renaming itself, maybe this could be done with the instance name?

May 15 08:21:29 homeassistant systemd-resolved[393]: Detected conflict on homeassistant.local IN AAAA fd14:949b:c9cc::ec3                  
May 15 08:21:29 homeassistant systemd-resolved[393]: Hostname conflict, changing published hostname from 'homeassistant' to 'homeassistant2'.

@thecode
Copy link
Member

thecode commented May 16, 2021

I have a similar issue I also have 2 instances using the same name, but I encounted a problem which started with this error:

2021-05-07 00:04:21 WARNING (zeroconf-Engine-242) [zeroconf] Choked at offset 32942 while unpacking b'\x00\x00\x00\x00\x00\x02\x00\x00\x00\x03\x00\x001homeassistant3 [46ebcbd293bf45359d585de0706dd53c]\x0c_workstation\x04_tcp\x05local\x00\x00\xff\x00\x01\x0ehomeassistant3\xc0P\x00\xff\x00\x010homeassistant [46ebcbd293bf45359d585de0706dd53c]\xc0>\x00\xff\x00\x01\xc0\x0c\x00\x10\x80\x01\x00\x00\x00x\x00\x01\x00\xc0\x0c\x00!\x80\x01\x00\x00\x00x\x00\x1c\x00\x00\x00\x00\x00\x00\x0ehomeassistant3\x05local\x00\xc0[\x00\x01\x80\x01\x00\x00\x00x\x00\x04\xc0\xa8\xc0d'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 745, in __init__
    self.read_others()
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 816, in read_others
    domain = self.read_name()
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 877, in read_name
    length = self.data[off]
IndexError: index out of range
2021-05-07 00:04:21 DEBUG (zeroconf-Engine-242) [zeroconf] Received from '192.168.1.100':5353 (socket 16): (236 bytes) 

IP in the log is IP of HA OS VM version 6.0.dev20210429
I also started by investigating the hostname, hostname currently uses invalid characters, I changed in this file:
https://github.com/home-assistant/operating-system/blob/dev/buildroot-external/rootfs-overlay/usr/lib/systemd/dnssd/workstation.dnssd
The name format from Name=%H [%m] to Name=%H-%m which did not change anything.
Since running on dev os version I did not open an issue yet, but those to might be related, The log also indicates the name conflict and running wireshark shows that the first packet is malformed.

I discussed it it briefly with @bdraco, did not have time to investigate future yet.

@agners what HA OS did this issue occurred? does it reproduced on 5.x release also?

@agners
Copy link
Member Author

agners commented May 16, 2021

I am testing 6.0.rc1 (and later dev release) right now. But I am pretty sure it happened in 5.x releases already.

@bdraco
Copy link
Member

bdraco commented May 16, 2021

We do have a guard against packet floods in python-zeroconf here https://github.com/jstasiak/python-zeroconf/blob/master/zeroconf/__init__.py#L1438

Most mDNS implementations don't actually pass the Bonjour conformance test (https://developer.apple.com/bonjour/) when it comes to conflicting names so the best way is to usually avoid the problem is possible by not using conflicting names, but seems like it would be difficult since we want the user to be able to hit homassistant.local..

This does seem like a problem with systemd-resolved as I can see the behavior when I have multiple Operating system boxes on the network, but running 5 instances of core in a container doesn't present the issue

1 similar comment
@bdraco
Copy link
Member

bdraco commented May 16, 2021

We do have a guard against packet floods in python-zeroconf here https://github.com/jstasiak/python-zeroconf/blob/master/zeroconf/__init__.py#L1438

Most mDNS implementations don't actually pass the Bonjour conformance test (https://developer.apple.com/bonjour/) when it comes to conflicting names so the best way is to usually avoid the problem is possible by not using conflicting names, but seems like it would be difficult since we want the user to be able to hit homassistant.local..

This does seem like a problem with systemd-resolved as I can see the behavior when I have multiple Operating system boxes on the network, but running 5 instances of core in a container doesn't present the issue

@frenck
Copy link
Member

frenck commented May 17, 2021

Related to home-assistant/plugin-multicast#1 maybe?

@agners
Copy link
Member Author

agners commented May 17, 2021

@bdraco afaict, the package flood is not about the host name, it's about the Home Assistant service announcement (see Home._home-assistant._tcp.local, with dash in home-assistant and the instance name Home). You can disable an instances systemd-resolved and its still participating in the flood.

It does seem like systemd-resolved is involved, but I think its the query side. I am guessing is that systemd-resolved get's confused about who is actually owning Home._home-assistant._tcp.local, and starts a query. That then leads to all the responses along with cache flush flags. I now renamed all instances to use unique names in core (Configuration -> General, no more instance with Home) and floods are gone.

@frenck hm, there are several reports, maybe it matches some reports, not sure But its not a loop between a router mDNS forwarding and Home Assistant at the same time. In my case there is no mDNS forwarding between VLANs enabled (in fact the instances are just on the regular LAN, no VLAN used for those).

@thecode
Copy link
Member

thecode commented May 17, 2021

I am more worried about the malformed mDNS packet sent by system-resolved. I can reproduce this easily on the test system I wrote about above and wireshark on the machine shows that the packet is sent out malformed which may hint to a memory corruption, maybe it should be separated as another issue from this one, what do you think?

@bdraco
Copy link
Member

bdraco commented May 17, 2021

Looks like there are two places where we have a non unique name.

In theory we should get a non unique name exception when registering the second home._homeassistant._tcp.local.

@bdraco
Copy link
Member

bdraco commented May 17, 2021

There are a few unreleased zeroconf fixes that might help here, but they can't fix the dns sd conflict on homeassistant.local. since that's coming from the systemd service

@agners
Which version of core are your test machines running?

@agners
Copy link
Member Author

agners commented May 17, 2021

FWIW, @bdraco and me debugged the issue quite a bit here.

The case I have been seeing is definitely related to duplicate instance names (in Configuration -> General). Once those are resolved, things work as expected.

However, since quite a while (~October 2020) Home Assistant Core should detect duplicate instance names. This should lead to messages such as:

2021-05-17 16:59:56 ERROR (MainThread) [homeassistant.components.zeroconf] Home Assistant instance with identical name present in the local network

For some reason this seems to constantly not appear in my cases. It is probably related to the amount of Zeroconf answers which need to be processed, plus maybe also the fact that this is during startup where there is a high workload anyways. Increasing the wait time before registering the service to 10s+ (in async_check_service() in the Python zeroconf package) seems to help. Increasing the browse time before as #50784 is doing will likely help too.

@thecode yeah if the origin is systemd-resolved I would recommend checking with Wireshark to verify that the packet is indeed malformed. If so, then this needs a new issue in the OS repository.

@agners
Copy link
Member Author

agners commented May 19, 2021

Retested with today's nightly, now the duplicate instance gets properly detected:

2021-05-19 20:06:51 INFO (MainThread) [homeassistant.components.zeroconf] Starting Zeroconf broadcast
2021-05-19 20:06:51 ERROR (MainThread) [homeassistant.components.zeroconf] Home Assistant instance with identical name present in the local network

This is most likely fixed with #50784, and maybe (also) by #50807.

@agners agners closed this as completed May 19, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Jun 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants