Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zigbee is failing constantly, requires reloading/reinitialization #105506

Closed
mmccool opened this issue Dec 11, 2023 · 81 comments
Closed

Zigbee is failing constantly, requires reloading/reinitialization #105506

mmccool opened this issue Dec 11, 2023 · 81 comments
Assignees

Comments

@mmccool
Copy link

mmccool commented Dec 11, 2023

The problem

The Zigbee (ZHA Zigbee Multi-PAN) integration on my HA instance is failing, falling back into a "Failure to setup" state. Reloading it fails again, but then retries, entering an "Initialization" stage, after which it works... for a while, from a few minutes to an hour, before failing again. Occasionally the above does not work and a power cycle is needed to get it to reload.

I am running on Home Assistant Yellow, and in particular am running "Yellow Multi-PAN".

NOTE: originally I posted this issue as a ZHA problem but the problem resolved after disabling Multi-PAN and using plain ZHA, so I think it is a problem specifically with Multi-PAN, not the "baseline" ZHA per se.

What version of Home Assistant Core has the issue?

core-2023.12.1

What was the last working version of Home Assistant Core?

core-2023.10

What type of installation are you running?

Home Assistant OS

Integration causing the issue

ZHA (Multi-PAN)

Link to integration documentation on our website

https://www.home-assistant.io/integrations/zha/
but with Multi-PAN enabled as per https://yellow.home-assistant.io/guides/enable-multiprotocol/

Diagnostics information

config_entry-zha-d179b6f823a03243c5e4b54db431cebc.json.txt

Example YAML snippet

No response

Anything in the logs that might be useful for us?

[0xDFC0:1:0x0001]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
1:32:45 PM – (WARNING) Zigbee Home Automation - message first occurred at December 9, 2023 at 3:35:49 PM and shows up 126 times
[0x0898:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
1:29:04 PM – (ERROR) runner.py - message first occurred at 1:27:34 PM and shows up 31 times
Discarding _save_attribute event
1:29:01 PM – (WARNING) components/zha/binary_sensor.py - message first occurred at 1:29:01 PM and shows up 19 times
[0x4C2A:1:0x0702]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
1:28:47 PM – (WARNING) Zigbee Home Automation - message first occurred at 1:28:47 PM and shows up 6 times
Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7f62493690>>
1:28:46 PM – (WARNING) helpers/dispatcher.py - message first occurred at December 9, 2023 at 4:16:54 PM and shows up 12 times
Error setting up entry Yellow Multi-PAN for zha
December 9, 2023 at 4:36:40 PM – (ERROR) config_entries.py

Details from [0xDFC0:1:0x0001]:
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: December 9, 2023 at 3:35:49 PM (126 occurrences)
Last logged: 1:32:45 PM

[0xDFC0:1:0x0001]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
[0xDFC0:1:0x0500]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
[0x9278:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x9278:1:0x0702]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x9278:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]

Details from [0x0898:1:0xff01]:
Logger: zigpy.zcl
Source: runner.py:188
First occurred: 1:27:34 PM (31 occurrences)
Last logged: 1:29:04 PM

[0x0898:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0xB790:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0x7426:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0xC67C:1:0xff01] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request return await self.device.request( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 710, in send_packet raise ControllerError("ApplicationController is not running") bellows.exception.ControllerError: ApplicationController is not running The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 518, in write_attributes_safe res = await self.write_attributes(attributes, manufacturer=manufacturer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 83, in wrapper with wrap_zigpy_exceptions(): File "/usr/local/lib/python3.11/contextlib.py", line 155, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 75, in wrap_zigpy_exceptions raise HomeAssistantError(message) from exc homeassistant.exceptions.HomeAssistantError: Failed to send request: ApplicationController is not running
[0xBC11:1:0x000a] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 411, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 474, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request await self._application.request( File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 767, in send_packet status, _ = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Details from 
Logger: zigpy.appdb
Source: components/zha/binary_sensor.py:227
First occurred: 1:29:01 PM (19 occurrences)
Last logged: 1:29:01 PM

Discarding _save_attribute event

Details from 
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:389
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 1:28:47 PM (6 occurrences)
Last logged: 1:28:47 PM

[0x4C2A:1:0x0702]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0x4C2A:1:0x0b04]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0xE85E:1:0x0006]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0xE85E:1:0x0b04]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
[0xE85E:1:0x0702]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/core/helpers.py", line 338, in wrapper
    return await func(cluster_handler, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 389, in async_initialize
    await self._get_attributes(
  File "/usr/src/homeassistant/homeassistant/components/zha/core/cluster_handlers/__init__.py", line 491, in _get_attributes
    read, _ = await self.cluster.read_attributes(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 524, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/__init__.py", line 377, in request
    return await self._endpoint.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 253, in request
    return await self.device.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 299, in request
    await self._application.request(
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 824, in request
    await self.send_packet(
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 767, in send_packet
    status, _ = await self._ezsp.sendUnicast(
                      ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Detail from 
Logger: homeassistant.helpers.dispatcher
Source: helpers/dispatcher.py:59
First occurred: December 9, 2023 at 4:16:54 PM (12 occurrences)
Last logged: 1:28:46 PM

Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7f62493690>>

Details from [0x4C2A:1:0x0702]
Logger: homeassistant.config_entries
Source: config_entries.py:402
First occurred: December 9, 2023 at 4:36:40 PM (1 occurrences)
Last logged: December 9, 2023 at 4:36:40 PM

Error setting up entry Yellow Multi-PAN for zha
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 402, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/__init__.py", line 167, in async_setup_entry
    zha_gateway = await ZHAGateway.async_from_config(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 192, in async_from_config
    await instance.async_initialize()
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 210, in async_initialize
    await app.startup(auto_form=True)
  File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 223, in startup
    await self.connect()
  File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 144, in connect
    await ezsp.startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 111, in startup_reset
    await self._gw.wait_for_startup_reset()
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 170, in wait_for_startup_reset
    await self._startup_reset_future
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 1003, in _read_ready__data_received
    self._protocol.data_received(data)
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 76, in data_received
    self.frame_received(frame)
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 103, in frame_received
    self.data_frame_received(data)
  File "/usr/local/lib/python3.11/site-packages/bellows/uart.py", line 124, in data_frame_received
    self._application.frame_received(self._randomize(data[1:-3]))
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/__init__.py", line 305, in frame_received
    self._protocol(data)
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/protocol.py", line 83, in __call__
    sequence, frame_id, data = self._ezsp_frame_rx(data)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/v4/__init__.py", line 33, in _ezsp_frame_rx
    return data[0], data[2], data[3:]
           ~~~~^^^
IndexError: index out of range

Additional information

I think the problems started with 2023.12.0 but I am not completely sure - it did have a tendency to crash before, but not nearly as often. Before 2023.12 it maybe happened once a week, which was annoying but semi-tolerable. Now it is basically unusable, and essentially breaks any automation that depends on a zigbee device.

I should also mention I have set up and used Matter devices but don't have any active right now. The symptoms do seem similar to #105344 and #105345 - but I have no Xbee devices.

@home-assistant
Copy link

Hey there @dmulcahey, @Adminiuga, @puddly, @TheJulianJES, mind taking a look at this issue as it has been labeled with an integration (zha) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of zha can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign zha Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


zha documentation
zha source
(message by IssueLinks)

@puddly
Copy link
Contributor

puddly commented Dec 11, 2023

Would you be able to enable debug logging for the ZHA integration and upload a complete log of it crashing and failing to reload?

@mmccool
Copy link
Author

mmccool commented Dec 11, 2023

I did enable debug logging, and just reloaded it. Let me check if it's crashed yet... nope, still working. Will do so after it crashes. I had an older log file but I let it run too long and it became too big to upload, so just restarted debug logging again.

@mmccool
Copy link
Author

mmccool commented Dec 11, 2023

So as an additional observation I have seen errors about the NCP failing to respond, watchdog failure, etc. and then I see it enter an "Initialization" state on its own, but succeeds. I have seen this twice now. I will wait for an unrecoverable error before uploading the logs but the recoverable ones seem to be happening every 5-10m or so (I assuming this is probably the Watchdog restarting it).

@tangodelta73
Copy link

tangodelta73 commented Dec 11, 2023

Having the exact same problem! This was changed in 12.1 version: "Fix ZHA quirk ID custom entities matching all devices", could this be the problem?

@mmccool
Copy link
Author

mmccool commented Dec 11, 2023

Ugh, it finally crashed but the debug log was still too big to upload (over 400MB!). It seems it appended to my old log rather than start a new one, also. I am going to manually edit it just to show the last bit. I will keep the full log around, let me know if you want me to extract any other sections.

@mmccool
Copy link
Author

mmccool commented Dec 11, 2023

@puddly
Copy link
Contributor

puddly commented Dec 11, 2023

@mmccool In your log, did you manually reload the ZHA integration about five times? If so, can you describe why? Up until the point you reload it, the log looks normal.

@Letalis
Copy link

Letalis commented Dec 11, 2023

Screenshot_20231211-220948
error_log-2.txt
Encounters this from time to time.
Especially after a reboot.
Needs to disable and activate it . Then it comes to a "normal" state. But accepting ligthcontroll. Gives failed to send signal or something like that.
Will gather logs aswell

@pimeys
Copy link

pimeys commented Dec 11, 2023

I've also witnessed issues with ZHA integration after upgrading to 2023.12. I will upload logs next time when it happens. Usually it breaks after restarting HASS.

@puddly
Copy link
Contributor

puddly commented Dec 11, 2023

@Letalis I think you may have submitted your comment before your screenshot and log files actually finished uploading! Edit them into your original comment, the Uploading ... text will be replaced once they finish uploading.

@mmccool
Copy link
Author

mmccool commented Dec 11, 2023

For the log, I did NOT manually reload it. For each of those five instances it entered the "Initialization" state on its own and restarted on its own. It succeeded restarting/reloading several times on its own until finally this self-initiated loop failed. I'm guessing these restarts were initiated by a watchdog timer.

Also, I truncated the log. It did this a lot more than five times.

@marcroca81
Copy link

marcroca81 commented Dec 12, 2023

Here the same problem:
home-assistant_zha_2023-12-12T09-17-49.702Z.log

Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/init.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 10:13:23 (2 occurrences)
Last logged: 10:13:23

[0x4D68:1:0x0500]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x4D68:1:0x0406]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]

@ckadluba
Copy link

ckadluba commented Dec 12, 2023

Is this the same problem as reported in issue #105445?

I also have the problem that all Zigbee and Matter devices are unavailable after upgrading HA Core to >= 2023.12.0. Zigbee, Matter and Thread integrations are permanently in a failure state.

@Citizen-2CB8A24A
Copy link

Citizen-2CB8A24A commented Dec 12, 2023

Same here. 66 devices / 443 entities via RaspBee II on HAOS 2023.12.1. After a few hours ZHA fails and does not recover by itself. Reload of ZHA helps instantly, but just for another few hours. Disabling ZHA's entity update polling doesn't help. CPU and memory is at bay. No WiFi on Zigbee channel. It was running flawlessly for months.

 Logger: zigpy.application
Source: components/zha/core/gateway.py:210
First occurred: 10. Dezember 2023 um 19:05:15 (10 occurrences)
Last logged: 03:46:53

    Zigbee channel 15 utilization is 81.18%!
    If you are having problems joining new devices, are missing sensor updates, or have issues keeping devices joined, ensure your coordinator is away from interference sources such as USB 3.0 devices, SSDs, WiFi routers, etc.
    Zigbee channel 15 utilization is 82.35%!
    Zigbee channel 15 utilization is 76.47%!
    Zigbee channel 15 utilization is 77.65%!
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 10. Dezember 2023 um 19:05:27 (528 occurrences)
Last logged: 13:13:59

[0xC543:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0xB202:3:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0x1C75:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.MAC_NO_ACK: 233>'), TimeoutError(), TimeoutError(), TimeoutError()]
[0x5E7F:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError(), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0x3E5F:1:0x0008]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError()]

In the desperate I've enabled Source Routing, which takes off some load from the mesh/coordinator.

zha:
  zigpy_config:
    source_routing: true

(configuration.yaml)

For now it seems to work.

I'll report, if that changes.

What about the 38400 baud limit (which is just 4,8 kbytes/s), we have on the serial side (RaspBee II)?
This sounds like a bottleneck to me, if the whole Zigbee protocol is processed by ZHA and might became more complex now. By using ZHA the bandwidth demand on the serial interface is completely different, in comparison of using the whole Zigbee stack on the RaspBee II chip itself (Deconz approach).

I've ordered a SMLIGHT SLZB-06M coordinator, maybe things are different with that type of hardware...

@psuurbach
Copy link

psuurbach commented Dec 12, 2023

+1 keeps getting offline and is not selfhealing, always worked until last update. Same logs as above. Now trying the option of [Citizen-2CB8A24A] Update : suggestion option did not work, only 24 devices 169 entities.

@mmccool
Copy link
Author

mmccool commented Dec 12, 2023

I should also say I have a fair number of Zigbee devices: 49 devices, 376 entities. However, not so sure it's a bottleneck issue as it worked ok until recently - unless the last update made it significantly less efficient. I'm also running on a "large" CM4 (8GB) so I don't think memory is a problem. Unlike several commentors on #105445, who seem to have a very similar problem, I'm using "mainstream/supported" HW: the HA Yellow platform. Also, this issue it does try to restart (so does not stay in "Initializing" state) but tends to fail and get stuck in a "Failed to set up" state (typically after passing through "Failed to set up, retrying"...). Reloading tends to fail once, then succeeds after one retry - it does not succeed immediately.

@derekgermano
Copy link

Just wanted to chime in... I was having problems with my SONOFF Zigbee 3.0 USB Dongle Plus on my X86 install. ZHA was failing every few hours after installing 2023.12.0. Restored my 2023.11.3 backup. Installed 2023.12.0 again... ZHA did the same thing again... Failing every few hours. Went to my 2023.11.3 backup again and ZHA has been solid on my system for days. I didn't download any logs because I thought it was something that I had done. Going to stay on 2023.11.3 for the time being.

@dmulcahey
Copy link
Contributor

Just wanted to chime in... I was having problems with my SONOFF Zigbee 3.0 USB Dongle Plus on my X86 install. ZHA was failing every few hours after installing 2023.12.0. Restored my 2023.11.3 backup. Installed 2023.12.0 again... ZHA did the same thing again... Failing every few hours. Went to my 2023.11.3 backup again and ZHA has been solid on my system for days. I didn't download any logs because I thought it was something that I had done. Going to stay on 2023.11.3 for the time being.

If you can flip back for a short while and capture logs it would be greatly appreciated

@Citizen-2CB8A24A
Copy link

For now it seems to work.

I'll report, if that changes.

Well, ZHA failed again this morning. Source Routing doesn't help.
After ZHA integration reload things went back to normal again instantly.

Logger: zigpy.application
Source: components/zha/core/gateway.py:210
First occurred: 07:05:51 (2 occurrences)
Last logged: 07:05:51

Zigbee channel 15 utilization is 77.65%!
If you are having problems joining new devices, are missing sensor updates, or have issues keeping devices joined, ensure your coordinator is away from interference sources such as USB 3.0 devices, SSDs, WiFi routers, etc.
Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/__init__.py:537
Integration: Zigbee Home Automation (documentation, issues)
First occurred: 12. Dezember 2023 um 23:31:06 (51 occurrences)
Last logged: 07:06:13

[0xA305:1:0x0702]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0x67CE:2:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0xA305:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>')]
[0xC543:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError()]
[0x2E0A:3:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), DeliveryError('Failed to deliver packet: <TXStatus.NWK_ROUTE_DISCOVERY_FAILED: 208>'), TimeoutError()]

@Citizen-2CB8A24A
Copy link

A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour?

@markaggar
Copy link

Same thing has happened a few times for me, and just now. It failed and reinitialized a few times in quick succession before I manually intervened. I manually reloaded the integration and it coincidentally stopped failing.

@Letalis
Copy link

Letalis commented Dec 13, 2023

A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour?

If I do a preventive reload or reboot of the HA Blue I have a living hell for a couple of minutes to get it stable again.

@Citizen-2CB8A24A
Copy link

A preventive reload would be an option (as a workaround). Is there a way to automate a reload of the integration every hour?

If I do a preventive reload or reboot of the HA Blue I have a living hell for a couple of minutes to get it stable again.

It's not about reloading the box ("reboot"), it's just about reloading the integration. That happens very fast and might work for another hour without failure.

If the integration fails, things are broken anyways. It cannot get worse. Recurring manual intervention cannot be the solution for the meantime.

@puddly
Copy link
Contributor

puddly commented Dec 13, 2023

Please include debug logs in your comments that show ZHA running from startup, crashing, and not reloading. It'll really speed up figuring out the root cause of this problem.

@dmulcahey
Copy link
Contributor

Please try the current beta

@BooBoss
Copy link

BooBoss commented Dec 31, 2023

Please try the current beta

I assume you mean current beta of Skyconnect firmware?

@TheJulianJES
Copy link
Member

@BooBoss
Copy link

BooBoss commented Dec 31, 2023 via email

@mmccool
Copy link
Author

mmccool commented Jan 1, 2024

Just installed 2024.1.b3, no joy. Now ZHA does not start at all, just loops between “Initializing”and “Failed to restart, will retry”. Will grab logs when I can.

@BooBoss
Copy link

BooBoss commented Jan 2, 2024

In my case installing HA 2024.1.b3 actually helped. Works without devices going offline for over 24h. However I have problem with one Zigbee thermometer. Since I updated my SkyConnect firmware as a process of troubleshooting this issue one thermometer looses signal and stops reporting. After restart (removing battery) it works for a while. When I move it closer it of course works all the time but thing is this thermometer was where it was for months and there was no issues with signal strengh. Now after updating firmware it kinda looks like that the signal strengh is weaker somehow?!?!?

Note that thermometer doesn't connect to gateway directly but rather through smart socket as signal repeater. So it's also not like SkyConnect itself signal strengh is weaker.

@markaggar
Copy link

markaggar commented Jan 5, 2024

Since installing 2024.1.0, ZHA is failing every day at 00:01 (1 minute past midnight). So here I am at 00:15 uploading a log. :-/. Unfortunately logging wasn't turned on during the first failure, but I turned it on while it was cycling between initializing/ok/failed and turned off logging after a few cycles. I seem to have been able to get it to stop cycling by clicking the UI to add a device.

https://1drv.ms/u/s!AnaRLop754fh0_4OT8fXjmfEJzPMBg?e=8BVFVm

Here's some specific messages that might help:

Logger: bellows.ezsp
Source: runner.py:188
First occurred: 12:00:37 AM (17 occurrences)
Last logged: 12:00:37 AM

NCP entered failed state. Requesting APP controller restart

Logger: bellows.zigbee.application
Source: runner.py:188
First occurred: 12:00:37 AM (1 occurrences)
Last logged: 12:00:37 AM

Watchdog heartbeat timeout: EzspError('EZSP is not running')

Logger: homeassistant.helpers.dispatcher
Source: helpers/dispatcher.py:59
First occurred: 12:01:00 AM (4 occurrences)
Last logged: 12:01:00 AM

Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7fc3b4adfa50>>

Logger: homeassistant.config_entries
Source: config_entries.py:561
First occurred: 12:01:00 AM (1 occurrences)
Last logged: 12:01:00 AM

Error unloading entry HubZ Smart Home Controller - HubZ ZigBee Com Port, s/n: C1300B08 - Silicon Labs for sensor
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/config_entries.py", line 561, in async_unload
result = await component.async_unload_entry(hass, self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/components/sensor/init.py", line 152, in async_unload_entry
return await component.async_unload_entry(entry)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 205, in async_unload_entry
await platform.async_reset()
File "/usr/src/homeassistant/homeassistant/helpers/entity_platform.py", line 766, in async_reset
await asyncio.gather(*tasks)
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 1297, in async_remove
await self.__async_remove_impl(force_remove)
File "/usr/src/homeassistant/homeassistant/helpers/entity.py", line 1310, in __async_remove_impl
await self.async_will_remove_from_hass()
File "/usr/src/homeassistant/homeassistant/components/zha/sensor.py", line 219, in async_will_remove_from_hass
assert self._cancel_refresh_handle
AssertionError

Logger: zigpy.zcl
Source: runner.py:188
First occurred: 12:02:08 AM (4 occurrences)
Last logged: 12:02:08 AM

[0xC4BA:1:0x0019] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2187, in _handle_cluster_request await self._handle_query_next_image( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2259, in _handle_query_next_image await self.query_next_image_response( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 411, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 483, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 317, in request await send_request File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 833, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 868, in send_packet status, _ = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast'
[0xE49B:1:0x0019] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2187, in _handle_cluster_request await self._handle_query_next_image( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2259, in _handle_query_next_image await self.query_next_image_response( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 411, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 483, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 317, in request await send_request File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 833, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 868, in send_packet status, _ = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast'

Logger: bellows.ezsp
Source: /usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py:493
First occurred: 12:08:06 AM (2 occurrences)
Last logged: 12:08:07 AM

Exception running handler
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py", line 491, in handle_callback
handler(*args)
File "/usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py", line 226, in cb
fut.set_result(response)
asyncio.exceptions.InvalidStateError: invalid state

Logger: zigpy.application
Source: components/zha/core/gateway.py:211
First occurred: 12:02:29 AM (8 occurrences)
Last logged: 12:08:48 AM

Zigbee channel 15 utilization is 80.38%!
If you are having problems joining new devices, are missing sensor updates, or have issues keeping devices joined, ensure your coordinator is away from interference sources such as USB 3.0 devices, SSDs, WiFi routers, etc.
Zigbee channel 15 utilization is 82.35%!
Zigbee channel 15 utilization is 88.70%!

Logger: zigpy.appdb
Source: components/zha/binary_sensor.py:227
First occurred: 12:01:59 AM (16 occurrences)
Last logged: 12:09:02 AM

Discarding _save_attribute event

Logger: zhaquirks
Source: components/zha/init.py:128
First occurred: 12:01:12 AM (17 occurrences)
Last logged: 12:09:05 AM

Loaded custom quirks. Please contribute them to https://github.com/zigpy/zha-device-handlers

Logger: homeassistant.components.sensor
Source: helpers/entity_platform.py:610
Integration: Sensor (documentation, issues)
First occurred: 12:01:59 AM (595 occurrences)
Last logged: 12:09:17 AM

Platform zha does not generate unique IDs. ID 00:0d:6f:00:05:49:9d:53-1-2820-power_factor already exists - ignoring sensor.christmas_tree_lights_electrical_measurement_power_factor
Platform zha does not generate unique IDs. ID 00:0d:6f:00:0a:79:23:74-1-2820-rms_current already exists - ignoring sensor.master_bath_bidet_electricalmeasurementrmscurrent
Platform zha does not generate unique IDs. ID 00:0d:6f:00:0a:79:23:74-1-2820-rms_voltage already exists - ignoring sensor.master_bath_bidet_electricalmeasurementrmsvoltage
Platform zha does not generate unique IDs. ID 00:0d:6f:00:0a:79:23:74-1-2820-ac_frequency already exists - ignoring sensor.master_bath_bidet_electricalmeasurementfrequency
Platform zha does not generate unique IDs. ID 00:0d:6f:00:0a:79:23:74-1-2820-power_factor already exists - ignoring sensor.master_bath_bidet_electricalmeasurementpowerfactor

Logger: homeassistant.components.zha.core.cluster_handlers
Source: components/zha/core/cluster_handlers/init.py:537
Integration: Zigbee Home Automation
First occurred: 12:02:39 AM (13 occurrences)
Last logged: 12:09:43 AM

[0x74A3:1:0x0006]: async_initialize: all attempts have failed: [TimeoutError(), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
[0x030F:1:0x0006]: async_initialize: all attempts have failed: [TimeoutError(), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
[0x0377:1:0x0001]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x0377:1:0x0006]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]
[0x0377:1:0x0b04]: async_initialize: all attempts have failed: [DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>'), DeliveryError('Failed to deliver message: <EmberStatus.DELIVERY_FAILED: 102>')]

Also for fun, here's the full set of relevant HA Core logs that happened at that time

12:10:07 AM – (WARNING) Recorder - message first occurred at January 4, 2024 at 4:43:04 PM and shows up 158 times
Error doing job: Task exception was never retrieved
12:09:49 AM – (ERROR) core.py - message first occurred at January 4, 2024 at 4:43:54 PM and shows up 117 times
[0x74A3:1:0x0006]: async_initialize: all attempts have failed: [TimeoutError(), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running'), ControllerError('ApplicationController is not running')]
12:09:43 AM – (WARNING) Zigbee Home Automation - message first occurred at 12:02:39 AM and shows up 13 times
Failed to set state for sensor.sensors_unavailable, fall back to unknown
12:09:19 AM – (ERROR) helpers/entity.py - message first occurred at 12:01:12 AM and shows up 204 times
.Refresh Front Door Lock: Already running
12:09:17 AM – (WARNING) Automation - message first occurred at January 4, 2024 at 4:47:43 PM and shows up 19 times
Platform zha does not generate unique IDs. ID 00:0d:6f:00:05:49:9d:53-1-2820-power_factor already exists - ignoring sensor.christmas_tree_lights_electrical_measurement_power_factor
12:09:05 AM – (WARNING) helpers/script.py - message first occurred at 12:01:12 AM and shows up 70 times
Loaded custom quirks. Please contribute them to https://github.com/zigpy/zha-device-handlers
12:09:05 AM – (WARNING) components/zha/init.py - message first occurred at 12:01:12 AM and shows up 17 times
12:09:05 AM – (WARNING) Group - message first occurred at 12:01:00 AM and shows up 34 times
[0xDA8F:1:0x0405]: 'async_initialize' stage failed: 'NoneType' object has no attribute 'sendUnicast'
12:09:04 AM – (WARNING) Zigbee Home Automation - message first occurred at 12:02:08 AM and shows up 500 times
Discarding _save_attribute event
12:09:02 AM – (WARNING) components/zha/binary_sensor.py - message first occurred at 12:01:59 AM and shows up 16 times
Zigbee channel 15 utilization is 80.38%!
12:08:48 AM – (WARNING) components/zha/core/gateway.py - message first occurred at 12:02:29 AM and shows up 8 times
Exception running handler
12:08:07 AM – (ERROR) /usr/local/lib/python3.11/site-packages/bellows/ezsp/init.py - message first occurred at 12:08:06 AM and shows up 2 times
Update of media_player.tv_family_room is taking over 10 seconds
12:07:47 AM – (WARNING) helpers/entity.py - message first occurred at January 4, 2024 at 4:30:57 PM and shows up 20 times
Timeout sending report to Alexa for light.washing_machine_outlet
12:03:03 AM – (ERROR) Automation - message first occurred at 12:03:03 AM and shows up 2 times
[0xC4BA:1:0x0019] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2187, in _handle_cluster_request await self._handle_query_next_image( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/clusters/general.py", line 2259, in _handle_query_next_image await self.query_next_image_response( File "/usr/local/lib/python3.11/site-packages/zigpy/zcl/init.py", line 411, in reply return await self._endpoint.reply( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/endpoint.py", line 278, in reply return await self.device.reply( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 483, in reply return await self.request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/zigpy/device.py", line 317, in request await send_request File "/usr/local/lib/python3.11/site-packages/zigpy/application.py", line 833, in request await self.send_packet( File "/usr/local/lib/python3.11/site-packages/bellows/zigbee/application.py", line 868, in send_packet status, _ = await self._ezsp.sendUnicast( ^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'sendUnicast'
12:02:08 AM – (ERROR) runner.py - message first occurred at 12:02:08 AM and shows up 4 times
Updating roborock camera took longer than the scheduled update interval 0:00:05
12:01:12 AM – (WARNING) Camera
Connection closed: Cannot write to closing transport
12:01:12 AM – (WARNING) runner.py - message first occurred at January 4, 2024 at 4:49:02 PM and shows up 2 times
Error unloading entry HubZ Smart Home Controller - HubZ ZigBee Com Port, s/n: C1300B08 - Silicon Labs for sensor
12:01:00 AM – (ERROR) config_entries.py
Unable to remove unknown dispatcher <bound method GroupProbe._reprobe_group of <homeassistant.components.zha.core.discovery.GroupProbe object at 0x7fc3b4adfa50>>
12:01:00 AM – (WARNING) helpers/dispatcher.py - message first occurred at 12:01:00 AM and shows up 4 times
Watchdog heartbeat timeout: EzspError('EZSP is not running')
12:00:37 AM – (WARNING) runner.py
NCP entered failed state. Requesting APP controller restart
12:00:37 AM – (ERROR) runner.py - message first occurred at 12:00:37 AM and shows up 17 times
[Onkyo Family Room(10.0.0.183):8009] Heartbeat timeout, resetting connection
12:00:37 AM – (WARNING) /usr/local/lib/python3.11/site-packages/pychromecast/socket_client.py

@mmccool
Copy link
Author

mmccool commented Jan 8, 2024

Updated to 2024.1.2 yesterday. Have not seen a full crash yet, but I am seeing the same behavior I reported above for the beta version: many frequent restarts - restarts generally succeed, but it is adding significant noise and flakiness to the system (e.g. at any given time an automation may not have access to the devices it needs to work, switches may randomly not register presses, etc), enough that I'm still leaving most of my Zigbee-based automations disabled. It would be great to see this fixed soon...

On the other hand it doesn't seem to be looping the restart anymore, it just restarts frequently and randomly.

@mmccool
Copy link
Author

mmccool commented Jan 8, 2024

OK, grabbed some logs. This is for 2024.1.2 core, 2023.12.1 supervisor, and OS 11.3 on Home Assistant Yellow with Yellow Multi-PAN.

I also have some images from the history of a template rule based on a zigbee device that gives a timeline. There seem to be two kinds of restarts: there is a "looping" restart after all, where it loops for about 5m, restarting a dozen times, then there is a pause, then a final restart that succeeds. Then there are "occasional" isolated restarts that happen a couple of times an hour. This log captured one of the "looping" restarts, and I'll see if I can capture an "occasional" restart later.
home-assistant_zha_2024-01-08T17-02-05.169Z.zip
Screenshot 2024-01-08 131306
Screenshot 2024-01-08 131013

@mmccool
Copy link
Author

mmccool commented Jan 8, 2024

Another log, with a couple of "occasional" restarts in it. Also starts out at the tail end of a "looping" restart. Also, to be clear: none of these are manual restarts.
home-assistant_zha_2024-01-08T18-37-36.262Z.zip

@markaggar
Copy link

@puddly - curious if you need any more logs to help diagnose? I see a number of open issues all describing the same type of problems. Curiously, my midnight ZHA crashes stopped happening for a couple of nights, but then returned the last two nights in addition to a couple of crashes at random times of day. Each time I've needed to restart HA Core to get stable.

@dselzle
Copy link

dselzle commented Jan 12, 2024

Tossing a "me too" in this thread. Last stable version was 2023.11.x. Sometimes stuff goes offline and stays down for hours until I restart HA. Other times, devices go up/down/up/down multiple times per minute until I restart. This morning got even weirder--most (all? not sure) of my lights themselves started cycling on and off multiple times per second, all showing as "unavailable" in my dashboard. Restarting HA stopped it. Scared the heck out of my toddler. Didn't have any kind of debug logging switched on, but I couldn't find anything at all in Core/Supervisor/Host/Silicon Labs Multiprotocol logs. It's getting old having to restart HA every time I want to turn a lamp on. Happy to provide any kind of logs or do any kind of testing that could feasibly help. (Currently on 2024.1.2.)

@puddly
Copy link
Contributor

puddly commented Jan 12, 2024

@mmccool In your last log, did you manually reload the integration at any point, reconfigure it via a config flow from the frontend, or anything of the sort?

@mmccool
Copy link
Author

mmccool commented Jan 12, 2024

@mmccool In your last log, did you manually reload the integration at any point, reconfigure it via a config flow from the frontend, or anything of the sort?

Nope. All those were automatic restarts/reloads. I just started debug logging, watched the system via the history to see what it was doing, and stopped logging when I saw enough reload events, but did not touch it otherwise.

@mmccool
Copy link
Author

mmccool commented Jan 12, 2024

ZHA did crash last night and stayed down until I manually restarted. In fact I updated the OS to 2024.11.4, seeing if that would make a difference. It actually does seem a little more stable but is still glitching.

@rcremasco
Copy link

i got same problem in my installation.
zigbee home automation integration is not stable and restart frequently (more when adding devices)

attached diagnostic file from zha, ha log anche silicon labs log:

config_entry-zha-84f66216a6b035b13594f4815dcb04f6.json.txt
silicon labs multiprocotol.log
home-assistant_zha_2024-01-15T06-16-25.843Z.log

@markaggar
Copy link

markaggar commented Jan 15, 2024 via email

@mmccool
Copy link
Author

mmccool commented Jan 17, 2024

OK, I followed the advice of @puddly and disabled Multi-PAN and am now using the "real" ZHA integration - and so far, it has worked! No crashes or restarts in the last 12h. So I think my issue was not, in fact, a ZHA issue per se but a Multi-PAN/Zigbee issue.

Well, the instructions do say Multi-PAN is experimental...

Next I'm going to try setting up a separate Sky Connect dongle for Thread (Matter).

I will update the name of this issue to say "Zigbee" instead of ZHA. In my defense, "Multi-PAN" is not listed as a separate integration, but it does load different firmware.

Comment: as a side effect of reverting to plain ZHA, it reflashed the firmware on the radio. At some point I was wondering if it was a hardware problem, but it's also possible that Multi-PAN was fine but the firmware got corrupted. However... I don't really feel like reinstalling Multi-PAN to test that. It would be nice if there was an option to reflash firmware, however.

@mmccool mmccool changed the title ZHA is failing constantly, requires reloading/reinitialization Zigbee is failing constantly, requires reloading/reinitialization Jan 17, 2024
@mmccool
Copy link
Author

mmccool commented Jan 24, 2024

So disabling Multi-PAN has resolved my issue - my Zigbee is now stable. I have also installed a separate radio for Thread/Matter and that also seems to work.

The conclusion is that my issue is now resolved, but Multi-PAN still has a problem which should probably be addressed eventually (Multi-PAN is experimental after all, although its reliability seems worse now than before).

My personal recommendation is to use separate radios for Zigbee and Thread/Matter if you want reliability - or just use one or the other if you only have one radio. But it's simple to add a second radio using the SkyConnect USB kit.

@dmulcahey
Copy link
Contributor

So disabling Multi-PAN has resolved my issue - my Zigbee is now stable. I have also installed a separate radio for Thread/Matter and that also seems to work.

The conclusion is that my issue is now resolved, but Multi-PAN still has a problem which should probably be addressed eventually (Multi-PAN is experimental after all, although its reliability seems worse now than before).

My personal recommendation is to use separate radios for Zigbee and Thread/Matter if you want reliability - or just use one or the other if you only have one radio. But it's simple to add a second radio using the SkyConnect USB kit.

Thx for following up.

@Shadoxity
Copy link

FYI I was having a similar issue and unticking use beta in the matter server stopped it from crashing.

Also as an alternative you could set up an automation for if a zigbee devices goes to unavailable, reload the ZHA integration

@stevegroom
Copy link

I was also seeing several daily ZHA failures for a couple of weeks on my HA Yellow. In the end I disabled MultiProtocol and HA has been running stably for almost a week now. Notably without multiprotocol the network feels a lot more responsive.

@hesspoint
Copy link

Same issue. I think mainly since the last update either 2024.1 or .2 fine after restart and guaranteed a few hours later zigbee decides are offline and sure enough the stupid bugger Yellow Multi-Pan is failing setup again. WTF!

@stevegroom
Copy link

I turned off Multipan on my HA yellow two weeks ago and system has been stable since then. Also Zigbee devices seem to be more responsive.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests