-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zigbee is failing constantly, requires reloading/reinitialization #105506
Comments
Hey there @dmulcahey, @Adminiuga, @puddly, @TheJulianJES, mind taking a look at this issue as it has been labeled with an integration ( Code owner commandsCode owners of
(message by CodeOwnersMention) zha documentation |
Would you be able to enable debug logging for the ZHA integration and upload a complete log of it crashing and failing to reload? |
I did enable debug logging, and just reloaded it. Let me check if it's crashed yet... nope, still working. Will do so after it crashes. I had an older log file but I let it run too long and it became too big to upload, so just restarted debug logging again. |
So as an additional observation I have seen errors about the NCP failing to respond, watchdog failure, etc. and then I see it enter an "Initialization" state on its own, but succeeds. I have seen this twice now. I will wait for an unrecoverable error before uploading the logs but the recoverable ones seem to be happening every 5-10m or so (I assuming this is probably the Watchdog restarting it). |
Having the exact same problem! This was changed in 12.1 version: "Fix ZHA quirk ID custom entities matching all devices", could this be the problem? |
Ugh, it finally crashed but the debug log was still too big to upload (over 400MB!). It seems it appended to my old log rather than start a new one, also. I am going to manually edit it just to show the last bit. I will keep the full log around, let me know if you want me to extract any other sections. |
@mmccool In your log, did you manually reload the ZHA integration about five times? If so, can you describe why? Up until the point you reload it, the log looks normal. |
|
I've also witnessed issues with ZHA integration after upgrading to 2023.12. I will upload logs next time when it happens. Usually it breaks after restarting HASS. |
@Letalis I think you may have submitted your comment before your screenshot and log files actually finished uploading! Edit them into your original comment, the |
For the log, I did NOT manually reload it. For each of those five instances it entered the "Initialization" state on its own and restarted on its own. It succeeded restarting/reloading several times on its own until finally this self-initiated loop failed. I'm guessing these restarts were initiated by a watchdog timer. Also, I truncated the log. It did this a lot more than five times. |
Here the same problem: Logger: homeassistant.components.zha.core.cluster_handlers [0x4D68:1:0x0500]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')] |
Is this the same problem as reported in issue #105445? I also have the problem that all Zigbee and Matter devices are unavailable after upgrading HA Core to >= 2023.12.0. Zigbee, Matter and Thread integrations are permanently in a failure state. |
Same here. 66 devices / 443 entities via RaspBee II on HAOS 2023.12.1. After a few hours ZHA fails and does not recover by itself. Reload of ZHA helps instantly, but just for another few hours. Disabling ZHA's entity update polling doesn't help. CPU and memory is at bay. No WiFi on Zigbee channel. It was running flawlessly for months.
In the desperate I've enabled Source Routing, which takes off some load from the mesh/coordinator.
(configuration.yaml) For now it seems to work. I'll report, if that changes. What about the 38400 baud limit (which is just 4,8 kbytes/s), we have on the serial side (RaspBee II)? I've ordered a SMLIGHT SLZB-06M coordinator, maybe things are different with that type of hardware... |
+1 keeps getting offline and is not selfhealing, always worked until last update. Same logs as above. Now trying the option of [Citizen-2CB8A24A] Update : suggestion option did not work, only 24 devices 169 entities. |
I should also say I have a fair number of Zigbee devices: 49 devices, 376 entities. However, not so sure it's a bottleneck issue as it worked ok until recently - unless the last update made it significantly less efficient. I'm also running on a "large" CM4 (8GB) so I don't think memory is a problem. Unlike several commentors on #105445, who seem to have a very similar problem, I'm using "mainstream/supported" HW: the HA Yellow platform. Also, this issue it does try to restart (so does not stay in "Initializing" state) but tends to fail and get stuck in a "Failed to set up" state (typically after passing through "Failed to set up, retrying"...). Reloading tends to fail once, then succeeds after one retry - it does not succeed immediately. |
Just wanted to chime in... I was having problems with my SONOFF Zigbee 3.0 USB Dongle Plus on my X86 install. ZHA was failing every few hours after installing 2023.12.0. Restored my 2023.11.3 backup. Installed 2023.12.0 again... ZHA did the same thing again... Failing every few hours. Went to my 2023.11.3 backup again and ZHA has been solid on my system for days. I didn't download any logs because I thought it was something that I had done. Going to stay on 2023.11.3 for the time being. |
If you can flip back for a short while and capture logs it would be greatly appreciated |
Well, ZHA failed again this morning. Source Routing doesn't help.
|
A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour? |
Same thing has happened a few times for me, and just now. It failed and reinitialized a few times in quick succession before I manually intervened. I manually reloaded the integration and it coincidentally stopped failing. |
If I do a preventive reload or reboot of the HA Blue I have a living hell for a couple of minutes to get it stable again. |
It's not about reloading the box ("reboot"), it's just about reloading the integration. That happens very fast and might work for another hour without failure. If the integration fails, things are broken anyways. It cannot get worse. Recurring manual intervention cannot be the solution for the meantime. |
Please include debug logs in your comments that show ZHA running from startup, crashing, and not reloading. It'll really speed up figuring out the root cause of this problem. |
Please try the current beta |
I assume you mean current beta of Skyconnect firmware? |
No, Home Assistant beta: https://github.com/home-assistant/core/releases/tag/2024.1.0b3 |
Ok, thank you, will try... Once I get sober tommorow 🤣
Happy new year 🎉
niedz., 31 gru 2023, 20:55 użytkownik TheJulianJES ***@***.***>
napisał:
… No, Home Assistant beta:
https://github.com/home-assistant/core/releases/tag/2024.1.0b3
—
Reply to this email directly, view it on GitHub
<#105506 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJS7S3OCNWMOFXOFW47H323YMG7MBAVCNFSM6AAAAABAQE5L4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGAZDOMZWGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Just installed 2024.1.b3, no joy. Now ZHA does not start at all, just loops between “Initializing”and “Failed to restart, will retry”. Will grab logs when I can. |
In my case installing HA 2024.1.b3 actually helped. Works without devices going offline for over 24h. However I have problem with one Zigbee thermometer. Since I updated my SkyConnect firmware as a process of troubleshooting this issue one thermometer looses signal and stops reporting. After restart (removing battery) it works for a while. When I move it closer it of course works all the time but thing is this thermometer was where it was for months and there was no issues with signal strengh. Now after updating firmware it kinda looks like that the signal strengh is weaker somehow?!?!? Note that thermometer doesn't connect to gateway directly but rather through smart socket as signal repeater. So it's also not like SkyConnect itself signal strengh is weaker. |
Since installing 2024.1.0, ZHA is failing every day at 00:01 (1 minute past midnight). So here I am at 00:15 uploading a log. :-/. Unfortunately logging wasn't turned on during the first failure, but I turned it on while it was cycling between initializing/ok/failed and turned off logging after a few cycles. I seem to have been able to get it to stop cycling by clicking the UI to add a device. https://1drv.ms/u/s!AnaRLop754fh0_4OT8fXjmfEJzPMBg?e=8BVFVm Here's some specific messages that might help: Logger: bellows.ezsp NCP entered failed state. Requesting APP controller restart Logger: bellows.zigbee.application Watchdog heartbeat timeout: EzspError('EZSP is not running') Logger: homeassistant.helpers.dispatcher Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7fc3b4adfa50>> Logger: homeassistant.config_entries Error unloading entry HubZ Smart Home Controller - HubZ ZigBee Com Port, s/n: C1300B08 - Silicon Labs for sensor Logger: zigpy.zcl [0xC4BA:1:0x0019] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2187, in _handle_cluster_request await self._handle_query_next_image( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2259, in _handle_query_next_image await self.query_next_image_response( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 411, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 483, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 317, in request await send_request File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 833, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 868, in send_packet status, _ = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast' Logger: bellows.ezsp Exception running handler Logger: zigpy.application Zigbee channel 15 utilization is 80.38%! Logger: zigpy.appdb Discarding _save_attribute event Logger: zhaquirks Loaded custom quirks. Please contribute them to https://github.com/zigpy/zha-device-handlers Logger: homeassistant.components.sensor Platform zha does not generate unique IDs. ID 00:0d:6f:00:05:49:9d:53-1-2820-power_factor already exists - ignoring sensor.christmas_tree_lights_electrical_measurement_power_factor Logger: homeassistant.components.zha.core.cluster_handlers [0x74A3:1:0x0006]: async_initialize: all attempts have failed: [TimeoutError(), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')] Also for fun, here's the full set of relevant HA Core logs that happened at that time 12:10:07 AM – (WARNING) Recorder - message first occurred at January 4, 2024 at 4:43:04 PM and shows up 158 times |
Updated to 2024.1.2 yesterday. Have not seen a full crash yet, but I am seeing the same behavior I reported above for the beta version: many frequent restarts - restarts generally succeed, but it is adding significant noise and flakiness to the system (e.g. at any given time an automation may not have access to the devices it needs to work, switches may randomly not register presses, etc), enough that I'm still leaving most of my Zigbee-based automations disabled. It would be great to see this fixed soon...
|
OK, grabbed some logs. This is for 2024.1.2 core, 2023.12.1 supervisor, and OS 11.3 on Home Assistant Yellow with Yellow Multi-PAN. I also have some images from the history of a template rule based on a zigbee device that gives a timeline. There seem to be two kinds of restarts: there is a "looping" restart after all, where it loops for about 5m, restarting a dozen times, then there is a pause, then a final restart that succeeds. Then there are "occasional" isolated restarts that happen a couple of times an hour. This log captured one of the "looping" restarts, and I'll see if I can capture an "occasional" restart later. |
Another log, with a couple of "occasional" restarts in it. Also starts out at the tail end of a "looping" restart. Also, to be clear: none of these are manual restarts. |
@puddly - curious if you need any more logs to help diagnose? I see a number of open issues all describing the same type of problems. Curiously, my midnight ZHA crashes stopped happening for a couple of nights, but then returned the last two nights in addition to a couple of crashes at random times of day. Each time I've needed to restart HA Core to get stable. |
Tossing a "me too" in this thread. Last stable version was 2023.11.x. Sometimes stuff goes offline and stays down for hours until I restart HA. Other times, devices go up/down/up/down multiple times per minute until I restart. This morning got even weirder--most (all? not sure) of my lights themselves started cycling on and off multiple times per second, all showing as "unavailable" in my dashboard. Restarting HA stopped it. Scared the heck out of my toddler. Didn't have any kind of debug logging switched on, but I couldn't find anything at all in Core/Supervisor/Host/Silicon Labs Multiprotocol logs. It's getting old having to restart HA every time I want to turn a lamp on. Happy to provide any kind of logs or do any kind of testing that could feasibly help. (Currently on 2024.1.2.) |
@mmccool In your last log, did you manually reload the integration at any point, reconfigure it via a config flow from the frontend, or anything of the sort? |
Nope. All those were automatic restarts/reloads. I just started debug logging, watched the system via the history to see what it was doing, and stopped logging when I saw enough reload events, but did not touch it otherwise. |
ZHA did crash last night and stayed down until I manually restarted. In fact I updated the OS to 2024.11.4, seeing if that would make a difference. It actually does seem a little more stable but is still glitching. |
i got same problem in my installation. attached diagnostic file from zha, ha log anche silicon labs log: config_entry-zha-84f66216a6b035b13594f4815dcb04f6.json.txt |
I've had enough. I received a new zigbee stick today and I'm migrating my 40 devices over to zigbee2mqtt. Hoping it will also address some other issues as well (contact sensors that send an open message immediately after they close).
|
OK, I followed the advice of @puddly and disabled Multi-PAN and am now using the "real" ZHA integration - and so far, it has worked! No crashes or restarts in the last 12h. So I think my issue was not, in fact, a ZHA issue per se but a Multi-PAN/Zigbee issue. Well, the instructions do say Multi-PAN is experimental... Next I'm going to try setting up a separate Sky Connect dongle for Thread (Matter). I will update the name of this issue to say "Zigbee" instead of ZHA. In my defense, "Multi-PAN" is not listed as a separate integration, but it does load different firmware. Comment: as a side effect of reverting to plain ZHA, it reflashed the firmware on the radio. At some point I was wondering if it was a hardware problem, but it's also possible that Multi-PAN was fine but the firmware got corrupted. However... I don't really feel like reinstalling Multi-PAN to test that. It would be nice if there was an option to reflash firmware, however. |
So disabling Multi-PAN has resolved my issue - my Zigbee is now stable. I have also installed a separate radio for Thread/Matter and that also seems to work. The conclusion is that my issue is now resolved, but Multi-PAN still has a problem which should probably be addressed eventually (Multi-PAN is experimental after all, although its reliability seems worse now than before). My personal recommendation is to use separate radios for Zigbee and Thread/Matter if you want reliability - or just use one or the other if you only have one radio. But it's simple to add a second radio using the SkyConnect USB kit. |
Thx for following up. |
FYI I was having a similar issue and unticking use beta in the matter server stopped it from crashing. Also as an alternative you could set up an automation for if a zigbee devices goes to unavailable, reload the ZHA integration |
I was also seeing several daily ZHA failures for a couple of weeks on my HA Yellow. In the end I disabled MultiProtocol and HA has been running stably for almost a week now. Notably without multiprotocol the network feels a lot more responsive. |
Same issue. I think mainly since the last update either 2024.1 or .2 fine after restart and guaranteed a few hours later zigbee decides are offline and sure enough the stupid bugger Yellow Multi-Pan is failing setup again. WTF! |
I turned off Multipan on my HA yellow two weeks ago and system has been stable since then. Also Zigbee devices seem to be more responsive. |
The problem
The Zigbee (
ZHAZigbee Multi-PAN) integration on my HA instance is failing, falling back into a "Failure to setup" state. Reloading it fails again, but then retries, entering an "Initialization" stage, after which it works... for a while, from a few minutes to an hour, before failing again. Occasionally the above does not work and a power cycle is needed to get it to reload.I am running on Home Assistant Yellow, and in particular am running "Yellow Multi-PAN".
NOTE: originally I posted this issue as a ZHA problem but the problem resolved after disabling Multi-PAN and using plain ZHA, so I think it is a problem specifically with Multi-PAN, not the "baseline" ZHA per se.
What version of Home Assistant Core has the issue?
core-2023.12.1
What was the last working version of Home Assistant Core?
core-2023.10
What type of installation are you running?
Home Assistant OS
Integration causing the issue
ZHA (Multi-PAN)
Link to integration documentation on our website
https://www.home-assistant.io/integrations/zha/
but with Multi-PAN enabled as per https://yellow.home-assistant.io/guides/enable-multiprotocol/
Diagnostics information
config_entry-zha-d179b6f823a03243c5e4b54db431cebc.json.txt
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Additional information
I think the problems started with 2023.12.0 but I am not completely sure - it did have a tendency to crash before, but not nearly as often. Before 2023.12 it maybe happened once a week, which was annoying but semi-tolerable. Now it is basically unusable, and essentially breaks any automation that depends on a zigbee device.
I should also mention I have set up and used Matter devices but don't have any active right now. The symptoms do seem similar to #105344 and #105345 - but I have no Xbee devices.
The text was updated successfully, but these errors were encountered: