-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EZSP does not implement concurrency correctly #770
Comments
Your search results indicate that the EZSP adapter does not use the "concurrent" parameter... but this does not mean that the adapter does not work in concurrency mode. |
OK, so two things.
For example, in ezsp.js, you create an unbounded queue: However in this driver.js, you create a hard-coded bounded queue: So as I said, you don't have any bounded queues that are actually being used. Everything goes as fast as possible from what I can tell. |
On further examination, I do see this:
The adapter code seems to be using the driver's hard-coded queue ... though I'm not sure how the adapter code fits in with the driver code. But it could mean I'm wrong, and the queue IS used, but only via. the adapter code? Also, I'm not sure how the concurrency is supposed to work, but docs say that it's supposed to wait for a reply to free up a queue slot. Not sure if that's happening or if this is just doing fire and forget. If it's doing fire and forget (ie. not consuming that queue slot) then the queue does nothing. If it's waiting for the response to free up the queue slot, then cool. I don't know where that code is or how to check it. I'm mentioning this as I assume you know where to check that, and could do so in seconds to verify that works correctly. All I know is that, if I try to control too many zigbee devices at once (for example a bunch of blinds, lights, etc) then not all the intended actions take place. |
This seems inconsistent. The unicast command to send a frame to an endpoint doesn't use the queue at all: (group and all do): In other words.
Ideally, the driver.js itself should ensure that any API it provides uses the concurrency it provides, and not rely on external callers (ie. ezspAdapter.js) to invoke it's concurrency externally. But that's now how the code is currently designed. |
Yes, the queue created in the driver is used a little higher - in the adapter level. This is an architectural feature: previously there was a queue at the adapter level, but I removed it. I recently removed the use of a queue when sending messages to the device It manifested itself like this: the user pressed the brightness increase button 5 times, but the first message to the device had delivery problems and blocked other messages. After a few seconds, the timeout handler will fire and report an error in the first message. Then the remaining messages in the queue for that device timed out. As a result, no messages were delivered to the device. I don't understand your problem a little. There is parallelism when sending messages - present. Waiting for a response when sending messages - present. What do you want to influence? |
By removing the queue for the device (in #727), you have effectively allowed unlimited concurrency. The problem is not no parallelization, it's too much! Meaning that, rather than limiting it to 8 (the hard-coded queue limit) concurrent actions, if I tried to send a command to 80 devices, it would try and send all 80 at once - which will fail for most of them. I'm not worried so much about concurrency of messages sent to the SAME device, I'm more worried about concurrency across devices. Right now, because of it's unbounded nature, I have to carefully construct my scenes on Home Assistant, and manually add delays, or things won't work. For example, consider this scene.
And during the holidays, also:
This is 2 group messages (subject to the concurrency), and 15 unicast messages. So that's 4 group messages and 23 unicast messages. The group messages abide by concurrency limits, but as the unicast messages don't, they not only all try to send at the same time, the group messages ALSO won't get rate limited because the unicast messages don't use the rate limiter (queue). This essentially means 27 messages will alll go out at once. And a bunch of those will be dropped. So yeah,
I can't really do this myself, as the way zigbee2mqtt is deployed in my environment, it's in a docker container I can't really edit reliably. The container shuts down on z2m restart (meaning that I can't log into it and edit the files). Additionally, mounting in source code does not work (everything fails). Plus I can't specify a custom image to use, as I'm using TrueNAS for my Z2M, and it HELM deployments, which don't let me specify the image. Thanks. |
The TL:DR is, that I should not have to hand-craft my HA scenes to try and ensure I don't get dropped zigbee messages - that's what this concurrency is supposed to BE for. But right now, any significant zigbee scene with the EZSP driver will cause some of the actions to fail because too many messages are sent at once, and some of them are dropped. |
For group command (its broadcast in the 802.15.4) its one underlying limiting in the 802.15.4 that is protecting form broadcast storm that can killing the network. They is doing it be have max 8 broadcast in 9 seconds (its one FIFO with the broadcast table with timer). If trying sending more broadcast all router SHALL ignoring them silent. Also the old way is sending the broadcast 3 times for being sure its being OK broadcasted (counting as one FIFO position) but good Zigbee 3 routers is listening if if all neighbors is re sending it and if OK then is sending only 1 time or 2 (passive acking). If tuning this like Z2M have doing in the TI firmware so its possible sending more broadcast then its only the first level of routers getting the frame and they is not relaying it to the second level then its being blocked. By doing so its working if all router have direct RF contact with the coordinator but its braking the network completely and also the router discovery that is using broadcast is being broke or getting strange problems and then normal uni cast commands is not working well then the network cant finding devices in its network. If broadcast is working good in the network and have good routers source routing is also working great that is making uni cast working better but if not its better disabling source routing in the network then its being more stable at least with larger EZSP networks (advanced ZHA users experience with around 200 devices). The unicast problem i think you 2 is better digging in it but in the end its need broadcast working OK for working or the network cant finding its devices and its being stucked and blocking the transmitting in the network. |
@MattWestb I believe this issue is specifically about overwhelming the EZSP device with requests - specifically unicast requests, though I imagine it doesn't matter whether the request is unicast or broadcast. I'm not saying broadcast is not working correctly. But if you look at my situation, I have an HA scene which results in 23 unicast and 4 broadcast (group) messages. Right now, due to the lack of concurrency limiting in Z2M for the EZSP driver, it will attempt to send all of them back to back (as fast as it can), without any concurrency limiting. THAT will cause the dropped messages. I don't have any problem communicating to any group or individual node on it's own as a one-off action. |
Assuming this can be closed now since #771 is merged, thanks @kirovilya ! |
@corporategoth try the dev/edge version of z2m and the ability to configure the "concurrent" parameter. write whether it worked or not |
Based on simple greps (in zigbee-herdsman/adapter):
grep -ri utils_1.Queue *
grep -ri concurr *
The EZSP adapter does not implement any kind of concurrency. This means not only will the concurrency option in the config do nothing, but it will always try to send messages as fast as possible over the EZSP adapter.
This would be a large contributor to the issue I see of, when I am manipulating a lot of devices, some of them never get the memo to do the thing.
This could also possibly explain why some other crashes are happening with EZSP - if it's not implementing concurrency, then some messages may never get out to the network, which will result in timeouts (possibly causing the interview crash, crash on startup due to timeouts, and much more).
Concurrency should be implemented for EZSP like it is for other adapters.
The text was updated successfully, but these errors were encountered: