Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EZSP does not implement concurrency correctly #770

Closed
corporategoth opened this issue Oct 10, 2023 · 11 comments
Closed

EZSP does not implement concurrency correctly #770

corporategoth opened this issue Oct 10, 2023 · 11 comments

Comments

@corporategoth
Copy link

Based on simple greps (in zigbee-herdsman/adapter):
grep -ri utils_1.Queue *

deconz/adapter/deconzAdapter.js:        this.queue = new utils_1.Queue(concurrent);
ezsp/driver/ezsp.js:        this.queue = new utils_1.Queue();
z-stack/znp/znp.js:        this.queue = new utils_1.Queue();
z-stack/adapter/zStackAdapter.js:        this.queue = new utils_1.Queue(concurrent);
zigate/driver/zigate.js:        this.queue = new utils_1.Queue(1);
zigate/adapter/zigateAdapter.js:        this.queue = new utils_1.Queue(concurrent);

grep -ri concurr *

deconz/adapter/deconzAdapter.js:        const concurrent = this.adapterOptions && this.adapterOptions.concurrent ?
deconz/adapter/deconzAdapter.js:            this.adapterOptions.concurrent : 2;
deconz/adapter/deconzAdapter.js:        this.queue = new utils_1.Queue(concurrent);
tstype.d.ts:    concurrent?: number;
z-stack/adapter/zStackAdapter.js:        const concurrent = this.adapterOptions && this.adapterOptions.concurrent ?
z-stack/adapter/zStackAdapter.js:            this.adapterOptions.concurrent :
z-stack/adapter/zStackAdapter.js:        debug(`Adapter concurrent: ${concurrent}`);
z-stack/adapter/zStackAdapter.js:        this.queue = new utils_1.Queue(concurrent);
zigate/adapter/zigateAdapter.js:        const concurrent = this.adapterOptions && this.adapterOptions.concurrent ?
zigate/adapter/zigateAdapter.js:            this.adapterOptions.concurrent : 2;
zigate/adapter/zigateAdapter.js:        debug.log(`Adapter concurrent: ${concurrent}`);
zigate/adapter/zigateAdapter.js:        this.queue = new utils_1.Queue(concurrent);

The EZSP adapter does not implement any kind of concurrency. This means not only will the concurrency option in the config do nothing, but it will always try to send messages as fast as possible over the EZSP adapter.

This would be a large contributor to the issue I see of, when I am manipulating a lot of devices, some of them never get the memo to do the thing.
This could also possibly explain why some other crashes are happening with EZSP - if it's not implementing concurrency, then some messages may never get out to the network, which will result in timeouts (possibly causing the interview crash, crash on startup due to timeouts, and much more).

Concurrency should be implemented for EZSP like it is for other adapters.

@kirovilya
Copy link
Contributor

Your search results indicate that the EZSP adapter does not use the "concurrent" parameter... but this does not mean that the adapter does not work in concurrency mode.
For example https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/driver.ts#L98C19

@corporategoth
Copy link
Author

OK, so two things.

  1. the concurrency SHOULD be able to be changed, as I said, I have a problem on my network where messages get dropped, so I need to reduce the concurrency. So this ticket is still valid.

  2. and more importantly, I see you constructing that queue, as you say - but not actually using it.

For example, in ezsp.js, you create an unbounded queue:
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L255
And then use it:
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L553

However in this driver.js, you create a hard-coded bounded queue:
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L553
But it's never used, everything just directly calls through to ezsp to execute, eg.
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/driver.ts#L497

So as I said, you don't have any bounded queues that are actually being used. Everything goes as fast as possible from what I can tell.
I might be wrong, JS is not my primary language, and I acknowledge I suck at JS. But from my view, even the single bounded queue you create is unused, and where you DO use a queue, it's unbounded (and not configurable).

@corporategoth
Copy link
Author

On further examination, I do see this:

adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        debug('sendZclFrameToEndpointInternal %s:%i/%i (%i,%i,%i)', ieeeAddr, networkAddress, endpoint, responseAttempt, dataRequestAttempt, this.driver.queue.count());
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {

The adapter code seems to be using the driver's hard-coded queue ... though I'm not sure how the adapter code fits in with the driver code. But it could mean I'm wrong, and the queue IS used, but only via. the adapter code?
It still should be configurable though, so I can tweak it so I don't get dropped signals.

Also, I'm not sure how the concurrency is supposed to work, but docs say that it's supposed to wait for a reply to free up a queue slot. Not sure if that's happening or if this is just doing fire and forget. If it's doing fire and forget (ie. not consuming that queue slot) then the queue does nothing. If it's waiting for the response to free up the queue slot, then cool. I don't know where that code is or how to check it.

I'm mentioning this as I assume you know where to check that, and could do so in seconds to verify that works correctly. All I know is that, if I try to control too many zigbee devices at once (for example a bunch of blinds, lights, etc) then not all the intended actions take place.

@corporategoth
Copy link
Author

This seems inconsistent. The unicast command to send a frame to an endpoint doesn't use the queue at all:
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L457

(group and all do):
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L486
https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L504

In other words.

  1. Unicast calls don't use a queue at all.
  2. The queue inside the EZSP class is unbounded, meaning effectively there is no queue at all (it doesn't really do much, other than turn the sync? serial call into an async one by introducing a timeout).
  3. This means anything invoking execCommand from the EZSP class, or via. the Driver that doesn't use the driver's queue will have no concurrency protection at all
  4. There are a lot of execCommands originating from the EZSP class itself, NOT via. the driver (ie. that have 0 concurrency):
driver/ezsp.js:        const result = await this.execCommand("version", { desiredProtocolVersion: version });
driver/ezsp.js:            await this.execCommand("version", { desiredProtocolVersion: result.protocolVersion });
driver/ezsp.js:        const result = await this.execCommand("networkInit");
driver/ezsp.js:        const result = await this.execCommand("leaveNetwork");
driver/ezsp.js:        const ret = await this.execCommand('setConfigurationValue', { configId: configId, value: value });
driver/ezsp.js:        const ret = await this.execCommand('getConfigurationValue', { configId: configId });
driver/ezsp.js:        const ret = await this.execCommand('getMulticastTableEntry', { index: index });
driver/ezsp.js:        const ret = await this.execCommand('setMulticastTableEntry', { index: index, value: entry });
driver/ezsp.js:        const ret = await this.execCommand('setInitialSecurityState', { state: entry });
driver/ezsp.js:        const ret = await this.execCommand('getCurrentSecurityState');
driver/ezsp.js:        const ret = await this.execCommand('setValue', { valueId, value });
driver/ezsp.js:        const ret = await this.execCommand('getValue', { valueId });
driver/ezsp.js:        const ret = await this.execCommand('setPolicy', { policyId: policyId, decisionId: value });
driver/ezsp.js:    async execCommand(name, params = null) {
driver/ezsp.js:        const v = await this.execCommand("formNetwork", { parameters: params });
driver/ezsp.js:        return this.execCommand('sendUnicast', {
driver/ezsp.js:        return this.execCommand('sendMulticast', {
driver/ezsp.js:        const res = await this.execCommand('setConcentrator', {
driver/ezsp.js:            await this.execCommand('setSourceRouteDiscoveryMode', { mode: 1 });
driver/ezsp.js:        return this.execCommand('sendBroadcast', {
driver/ezsp.js:            await this.execCommand('nop');
  1. Since the driver doesn't use it's own concurrency queue, and it's reliant on the caller to do so, anything that doesn't use the concurrency queue itself will have no concurrency protection.

Ideally, the driver.js itself should ensure that any API it provides uses the concurrency it provides, and not rely on external callers (ie. ezspAdapter.js) to invoke it's concurrency externally. But that's now how the code is currently designed.
And the omission of the Unicast messages to use the driver's concurrency means that any unicast message is not concurrency protected. Ideally, if the driver enforced concurrency instead of relying on the caller to use it, this could not have happened.

@kirovilya
Copy link
Contributor

Yes, the queue created in the driver is used a little higher - in the adapter level. This is an architectural feature: previously there was a queue at the adapter level, but I removed it.
Technically, the driver queue should be in the adapter, but I didn't move it.
Yes, currently the queue is not used when sending a message directly to the device.

I recently removed the use of a queue when sending messages to the device
#727
This removed the delay and waiting for a message to be sent to one device. It became possible to send up to 8 messages in parallel and wait for their execution.

It manifested itself like this: the user pressed the brightness increase button 5 times, but the first message to the device had delivery problems and blocked other messages. After a few seconds, the timeout handler will fire and report an error in the first message. Then the remaining messages in the queue for that device timed out. As a result, no messages were delivered to the device.
After the queue is deleted (in the PR link above), all messages are sent. Even if the first message does not reach the device, it will be able to receive the remaining messages. As a result, a timeout error will only occur for those messages that did not reach the device, and not for all of them.

I don't understand your problem a little. There is parallelism when sending messages - present. Waiting for a response when sending messages - present. What do you want to influence?
You can try to return the queue when sending messages, you can try to increase/decrease the number of parallel messages and check how it works for you.

@corporategoth
Copy link
Author

corporategoth commented Oct 10, 2023

By removing the queue for the device (in #727), you have effectively allowed unlimited concurrency. The problem is not no parallelization, it's too much!

Meaning that, rather than limiting it to 8 (the hard-coded queue limit) concurrent actions, if I tried to send a command to 80 devices, it would try and send all 80 at once - which will fail for most of them.

I'm not worried so much about concurrency of messages sent to the SAME device, I'm more worried about concurrency across devices. Right now, because of it's unbounded nature, I have to carefully construct my scenes on Home Assistant, and manually add delays, or things won't work.

For example, consider this scene.

  • Close all my blinds (9 of them) - these can't be grouped, as creating a group for my blinds and attempting to use close on it doesn't work. So this is 9 unicast messages.
  • Turn on my driveway lights (a group, containing 18 lights)
  • Turn on my porch lights (a group containing 6 lights)
  • Turn on my side sensor lights (unicast)
  • Turn on my back sensor lights (unicast)
  • Turn on my garden lights (2 unicast)
  • Turn on my tree illumination (2 unicast messages)
  • Turn on the light in my free library

And during the holidays, also:

  • Turn on my holiday lights (4 unicast messages)
  • Set the color of my front garden lights (2 unicast)
  • Set the color of my driveway lights (group)
  • Set the color of my porch lights (group)
  • Turn on my 2 more of my post outlets (2 unicast messages).

This is 2 group messages (subject to the concurrency), and 15 unicast messages.
In the holidays, add another 2 group messages and 8 unicast messages.

So that's 4 group messages and 23 unicast messages. The group messages abide by concurrency limits, but as the unicast messages don't, they not only all try to send at the same time, the group messages ALSO won't get rate limited because the unicast messages don't use the rate limiter (queue).

This essentially means 27 messages will alll go out at once. And a bunch of those will be dropped.

So yeah,

  1. Please re-introduce concurrency effecting Unicast messages
  2. Please allow for that concurrency to be configurable
  3. Please change the concurrency to ensure that nothing (including messages sent by the ezsp.js code itself) can bypass it.

I can't really do this myself, as the way zigbee2mqtt is deployed in my environment, it's in a docker container I can't really edit reliably. The container shuts down on z2m restart (meaning that I can't log into it and edit the files). Additionally, mounting in source code does not work (everything fails). Plus I can't specify a custom image to use, as I'm using TrueNAS for my Z2M, and it HELM deployments, which don't let me specify the image.

Thanks.

@corporategoth
Copy link
Author

The TL:DR is, that I should not have to hand-craft my HA scenes to try and ensure I don't get dropped zigbee messages - that's what this concurrency is supposed to BE for. But right now, any significant zigbee scene with the EZSP driver will cause some of the actions to fail because too many messages are sent at once, and some of them are dropped.

@corporategoth corporategoth changed the title EZSP does not implement concurrency EZSP does not implement concurrency correctly Oct 10, 2023
@MattWestb
Copy link

For group command (its broadcast in the 802.15.4) its one underlying limiting in the 802.15.4 that is protecting form broadcast storm that can killing the network. They is doing it be have max 8 broadcast in 9 seconds (its one FIFO with the broadcast table with timer). If trying sending more broadcast all router SHALL ignoring them silent. Also the old way is sending the broadcast 3 times for being sure its being OK broadcasted (counting as one FIFO position) but good Zigbee 3 routers is listening if if all neighbors is re sending it and if OK then is sending only 1 time or 2 (passive acking).

If tuning this like Z2M have doing in the TI firmware so its possible sending more broadcast then its only the first level of routers getting the frame and they is not relaying it to the second level then its being blocked. By doing so its working if all router have direct RF contact with the coordinator but its braking the network completely and also the router discovery that is using broadcast is being broke or getting strange problems and then normal uni cast commands is not working well then the network cant finding devices in its network.

If broadcast is working good in the network and have good routers source routing is also working great that is making uni cast working better but if not its better disabling source routing in the network then its being more stable at least with larger EZSP networks (advanced ZHA users experience with around 200 devices).

The unicast problem i think you 2 is better digging in it but in the end its need broadcast working OK for working or the network cant finding its devices and its being stucked and blocking the transmitting in the network.

@corporategoth
Copy link
Author

@MattWestb I believe this issue is specifically about overwhelming the EZSP device with requests - specifically unicast requests, though I imagine it doesn't matter whether the request is unicast or broadcast.

I'm not saying broadcast is not working correctly. But if you look at my situation, I have an HA scene which results in 23 unicast and 4 broadcast (group) messages. Right now, due to the lack of concurrency limiting in Z2M for the EZSP driver, it will attempt to send all of them back to back (as fast as it can), without any concurrency limiting. THAT will cause the dropped messages.

I don't have any problem communicating to any group or individual node on it's own as a one-off action.

@Koenkk
Copy link
Owner

Koenkk commented Oct 12, 2023

Assuming this can be closed now since #771 is merged, thanks @kirovilya !

@Koenkk Koenkk closed this as completed Oct 12, 2023
@kirovilya
Copy link
Contributor

@corporategoth try the dev/edge version of z2m and the ability to configure the "concurrent" parameter. write whether it worked or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants