[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

plan44 · 2023-08-01T08:56:14Z

Reproduction steps

Commit c9ce7f2 from PR #28104 causes a regression in subscription handling, which is 100% reproducible in my (Linux Bridge) commissioned into a AppleTV:

build with c9ce7f2 -> controller endlessly retries subscribing
build with preceeding commit fe4c679 -> subscriptions work ok

Bug prevalence

Always

GitHub hash of the SDK that was being used

c9ce7f2

Platform

core

Platform Version(s)

n/a

Anything else?

For some reason, the report engine keeps generating a IM:ReportData message that is rejected by the controller (AppleTV) with INVALID_SUBSCRIPTION. This causes the contoller to send a new IM:SubscribeRequest in a new session, which again causes the invalid IM:ReportData, forever (not good for the flash storage).

@bzbarsky-apple comments in #28104:

[...] Something is very broken here... This message:
[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
is wrong. It's happening before the subscription has been set up, but not on the subscription-setup exchange, which is why the other side is rejecting it: it carries a subscription id which is not for a subscription that "exists" yet.

Here's a log excerpt of the repeating part:
(full log as attachment)

....
[1690835059.326804][3483:3483] CHIP:EM: >>> [E:12133i S:7746 M:156214298 (Ack:190299662)] (S) 
[c9ce7f2 - reporting regression.txt](https://github.com/project-chip/connectedhomeip/files/12222101/c9ce7f2.-.reporting.regression.txt)
Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.327958][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)

[1690835059.331730][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-eModLd)
[1690835059.480666][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835059.502327][3483:3483] CHIP:EM: <<< [E:12133i S:7746 M:190299665 (Ack:156214298)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835059.503050][3483:3483] CHIP:IN: (S) Sending msg 190299665 on secure session with LSID: 7746
[1690835059.504382][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214299 (Ack:190299663)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.505544][3483:3483] CHIP:IM: Received status response, status is 0x00 (SUCCESS)
[1690835059.506646][3483:3483] CHIP:EM: <<< [E:34561r S:7746 M:190299666 (Ack:156214299)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[1690835059.507473][3483:3483] CHIP:IN: (S) Sending msg 190299666 on secure session with LSID: 7746
[1690835059.510134][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214300 (Ack:190299666)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835065.946340][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355875] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1690835065.946659][3483:3483] CHIP:IN: CASE Server received Sigma1 message . Starting handshake. EC 0x6be5b0
[1690835065.946778][3483:3483] CHIP:SC: Received Sigma1 msg
[1690835065.949108][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618995 (Ack:49355875)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
[1690835065.949478][3483:3483] CHIP:IN: (U) Sending msg 132618995 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835065.952904][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835065.953837][3483:3483] CHIP:SC: Success status report received. Session was established
[1690835065.956631][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-oKjjhg)
[1690835066.095636][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.097282][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-DkPbMl)
[1690835066.234423][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.236145][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GnkEcB)
[1690835066.413213][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.413715][3483:3483] CHIP:SC: SecureSession[0x779419e0, LSID:7747]: State change 'kEstablishing' --> 'kActive'
[1690835066.413871][3483:3483] CHIP:IN: CASE Session established to peer: <146752179F0AE79A, 1>
[1690835066.414342][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618996 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.414691][3483:3483] CHIP:IN: (U) Sending msg 132618996 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.416382][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.418659][3483:3483] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 41430293 and system Timestamp 41430293.
[1690835066.423998][3483:3483] CHIP:DMG: Final negotiated min/max parameters: Min = 0s, Max = 2s
[1690835066.452039][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-BhAaJd)
[1690835066.745570][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.751488][3483:3483] CHIP:IN: (S) Sending msg 65373677 on secure session with LSID: 7747
[1690835066.756634][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373678 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.757360][3483:3483] CHIP:IN: (S) Sending msg 65373678 on secure session with LSID: 7747
[1690835066.759274][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.760237][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373679 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.762392][3483:3483] CHIP:IN: (S) Sending msg 65373679 on secure session with LSID: 7747
[1690835066.763714][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835066.764682][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618997 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.765626][3483:3483] CHIP:IN: (U) Sending msg 132618997 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.767370][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.768331][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373680 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.769167][3483:3483] CHIP:IN: (S) Sending msg 65373680 on secure session with LSID: 7747
[1690835066.770940][3483:3483] CHIP:EM: >>> [E:12134i S:7747 M:113090552 (Ack:65373677)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835066.772102][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)

[1690835066.776877][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GJJBnm)
[1690835067.094920][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
.... repeating endlessly

c9ce7f2 - reporting regression.txt

The text was updated successfully, but these errors were encountered:

lpbeliveau-silabs · 2023-08-01T14:11:40Z

@plan44
I am setting up an environment to reproduce the bug. In the meantime, I am seing a lot of:

[1690832821.160153][3436:3436] CHIP:DMG: Refresh subscribe timer sync after 2 seconds

Logs in the file, which is weird given that this log has been removed in this specific PR.

Have you tried a clean rebuild of the bridge app? The fact that this log is still happening leads me to believe something might not have been updated properly.

…n of ReportScheduler into the ReadHandler and IM engine. (#28104) - this commit causes endless re-subscriptions and thus flash wear - is being investigated in project-chip/connectedhomeip#28434

plan44 · 2023-08-02T07:36:50Z

@lpbeliveau-silabs

Have you tried a clean rebuild of the bridge app?

I thought I had, but apparently not, my bad, sorry. I did now (deleting the out dir to make absolutely sure), but it does not make a difference for the problem observed (that log line is gone, though).

Please find the new log attached.

c9ce7f2 - reporting regression - rebuilt.txt

lpbeliveau-silabs · 2023-08-04T18:50:37Z

So far I wasn't able to replicate the behavior with the linux example bridge app on a raspberrypi 4. I initially build the linux bridge app and commissioned it with an applet TV on c9ce7f2. This didn't show the behavior in the logs here so I tried replicating the subscription resumption shown in the logs.

The steps used to try to replicate with resumption were :
(Since the log displays a subscription resumption, I assumed the bridge was commissioned on fe4c679 initially)

Bootstrap and build the bridge app upon commit fe4c679
Run the app
Commissioned the bridge app with an AppleTv
Checkout c9ce7f2
Rebuild the app (clean rebuild)
Relaunch the app
Subscription resumed without causing the issue

From the log lines:

[1690960515.480095][2155:2155] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 1614906 and system Timestamp 1614906.
[1690960515.485213][2155:2155] CHIP:DMG: Final negotiated min/max parameters: Min = 0s, Max = 2s
[1690960515.510347][2155:2155] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-FAEDKa)
[1690960515.649570][2155:2155] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690960515.689856][2155:2155] CHIP:DMG: Error retrieving data from clusterId: 0x0000_0028, err = src/system/TLVPacketBufferBackingStore.cpp:88: CHIP Error 0x0000000B: No memory

I am suspecting that the flash memory is already full from the previous failures to re-establish the subscription, thus causing the failure in the new log.

Would you mind trying to clear the persistant memory to see if that might help solving the issue?

Also, I can't seem to reconstruct the proper chain of event? When you built the first time with c9ce7f2, were you resuming a subscription as I assumed? If not, what exactly were the steps that led you to this?

plan44 · 2023-08-05T08:56:58Z

Thanks for the investigation.

[1690960515.689856][2155:2155] CHIP:DMG: Error retrieving data from clusterId: 0x0000_0028, err = src/system/TLVPacketBufferBackingStore.cpp:88: CHIP Error 0x0000000B: No memory

I am suspecting that the flash memory is already full from the previous failures to re-establish the subscription, thus causing the failure in the new log.

This is not related to flash (BTW, it technically couldn't be, because the persistent data store on Linux is just a file, with no size limits except for the disk size, which is huge).

This error just signals that the report response needs to be chunked. It happens at the TLV level, bubbles up to Engine.cpp:189 where any error is logged, but then in case it is a out-of-writer-space error (CHIP_ERROR_NO_MEMORYor CHIP_ERROR_BUFFER_TOO_SMALL) it will be reset and just causes chunking the report response. This is normal for large reports that can't be sent in a single message.

Now this could be the difference between your and my setup: while my bridge is based on the bridge-example, in the meantime it manages a larger number of active endpoints with more data to report, so chunked reports are happening. Probably, in the reduced bridge-example setup, there's no chunking happening at all?

plan44 · 2023-08-05T18:47:02Z

Responding to your other points:

(Since the log displays a subscription resumption, I assumed the bridge was commissioned on fe4c679 initially)

Correct.

Would you mind trying to clear the persistant memory to see if that might help solving the issue?

I am a bit reluctant to do that with the bridge I found this problem with - because given that nobody else has complained so far, this seems pretty much an edge case. But on that setup, it is a 100% A-B testable one - if I run the build from c9ce7f2, the app enters the resubscription loop, if I run the build from fe4c679 with the same persistent memory data everything runs normal. I did this back and forth several times. I'd hate to loose that A-B testability and maybe the problem itself, until it becomes more clear what is actually happening.

So I'd rather try to reproduce the problem on a second bridge, with a similar set of dynamic endpoints. I'll try that on Monday and will post what I find.

lpbeliveau-silabs · 2023-08-08T15:06:45Z

Thanks for the insight. The chunking is definitely worth investigating, our CI is testing for Chunking capability but this seems to be a corner case where we have re-subscribe + chunking, which might need more testing on our CI.

In the meantime another bug was Identified where the readHandler is scheduling itself with a min and a max of 0:

[1690960515.480095][2155:2155] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 1614906 and system Timestamp 1614906.

I am reworking the scheduler/readHandler to prevent that. In the event that it doesn't fix your problem I would definitely need to setup a re-subcription configuration with chunking so please let me know if you are able to reproduce this bug on a second bridge.

lpbeliveau-silabs · 2023-08-10T19:21:17Z

@plan44 Now that #28536 has been merged, can you check if the endless retries still occur on commit d62d80b?

plan44 · 2023-08-11T16:14:15Z

@lpbeliveau-silabs Unfortunately, #28536 does not solve the problem.

I did a A-B test with full rebuilds of both bf0b45a (which includes d62d80b) and fe4c679.

bf0b45a still enters the same resubscription loop as c9ce7f2 did.
fe4c679 works fine

I repeated the A-B test several times, it is still 100% reproducible.

Please find two logs attached - this is the same device, with no other changes than once using a build off bf0b45a (loops) and one off fe4c679 (works ok).

Comparing the two, I see that things start to go differently only after report chunking gets enabled (at line 423 in works - fe4c679.txt and at line 500 in loops - bf0b45a.txt)

Maybe the A-B logs (with details enabled) can already reveal something.

I will now also commission a second bridge with the same set of devices and see what happens there (and post the results here)

loops - bf0b45a.txt
works - fe4c679.txt

plan44 · 2023-08-11T17:45:55Z

@lpbeliveau-silabs As promised, now I commissioned a second bridge with (nearly) the same set of devices.

It shows exactly the same A-B behaviour as the original unit.

What I did:

started the bridge built from bf0b45a with all matter persistent storage empty (KVS file deleted)
commissioned it into Apple Home (same as with the original bridge), works ok but then already starts looping -> testbridge-setup - loops - bf0b45a.txt
quit and restarted the bf0b45a bridge to see how resumption looks, loops the same way the original setup did -> testbridge-restart - loops - bf0b45a.txt
only then I started the bridge built from fe4c679, no change otherwise, starts up and works ok without looping -> testbridge-restart - works - fe4c679.txt

testbridge-setup - loops - bf0b45a.txt
testbridge-restart - loops - bf0b45a.txt
testbridge-restart - works - fe4c679.txt

This clearly shows that the issue cannot be stale persistent data (there is none in the first run, and still the resubscription loop happens). I hope this helps to narrow down the cause for this problem. Just let me know when you need more specific information.

woody-apple · 2023-08-11T19:33:25Z

Adding cert blocker so I can track this for now.

lpbeliveau-silabs · 2023-08-14T18:29:48Z

@plan44 Thank you for providing the logs, there clearly seems to have some chunking related problems. This might be due to a change in behavior where in the previous implementation, chunking would result in cancelling the subscription refresh timer without restarting it were-as now it doesn't cancel it, which might be why we see this problem in your case as your subscription resumption takes longer than the max interval to expire, which might lead to a timer generating a report ahead of the subscription being completed.

Weirdly enough, the functioning subscription resumption ends in two failures to establish a case session:

[2023-08-11 19:27:28.213 E] 🔴 CHIP:DMG ➡️ Failed to establish CASE for subscription-resumption with error 'src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout'
[2023-08-11 19:27:28.238 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingDestruction]
[2023-08-11 19:27:28.242 E] 🔴 CHIP:DMG ➡️ Failed to establish CASE for subscription-resumption with error 'src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout'
[2023-08-11 19:27:28.267 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingDestruction]
[2023-08-11 19:27:29.639 N] ✅ CHIP:DMG ➡️ Refresh subscribe timer sync after 2 seconds
[2023-08-11 19:27:29.640 D] 🛠️ CHIP:DMG ➡️ Building Reports for ReadHandler with LastReportGeneration = 53 DirtyGeneration = 0
[2023-08-11 19:27:29.644 D] 🛠️ CHIP:DMG ➡️ Fetched 0 events
[2023-08-11 19:27:29.645 D] 🛠️ CHIP:DMG ➡️ Sending report (payload has 11 bytes)...

were-as the looping one only has one failure at that point:

[2023-08-11 19:23:29.704 E] 🔴 CHIP:DMG ➡️ Failed to establish CASE for subscription-resumption with error 'src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout'
[2023-08-11 19:23:29.728 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingDestruction]
[2023-08-11 19:23:33.418 N] ✅ CHIP:EM ➡️ >>> [E:16023r S:0 M:133719369] (U) Msg RX from 0:625A153B6942F5F7 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[2023-08-11 19:23:33.419 D] 🛠️ CHIP:EM ➡️ Handling via exchange: 16023r, Delegate: 0x6dece0
[2023-08-11 19:23:33.421 N] ✅ CHIP:IN ➡️ CASE Server received Sigma1 message . Starting handshake. EC 0x77e46df0
[2023-08-11 19:23:33.422 N] ✅ CHIP:SC ➡️ Received Sigma1 msg
[2023-08-11 19:23:33.423 D] 🛠️ CHIP:SC ➡️ Peer assigned session key ID 62236
[2023-08-11 19:23:33.427 N] ✅ CHIP:EM ➡️ <<< [E:16023r S:0 M:7425603 (Ack:133719369)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
[2023-08-11 19:23:33.429 N] ✅ CHIP:IN ➡️ (U) Sending msg 7425603 to IP address 'UDP:192.168.59.123%eth0:49437'
[2023-08-11 19:23:33.430 D] 🛠️ CHIP:SC ➡️ Sent Sigma2Resume msg
[2023-08-11 19:23:33.441 N] ✅ CHIP:EM ➡️ >>> [E:16023r S:0 M:133719370 (Ack:7425603)] (U) Msg RX from 0:625A153B6942F5F7 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[2023-08-11 19:23:33.442 D] 🛠️ CHIP:EM ➡️ Found matching exchange: 16023r, Delegate: 0x6decfc

And proceeds to receive another sigma 1.

lpbeliveau-silabs · 2023-08-15T18:12:07Z

I have raised my number of devices on the bridge, generated some chunking on commissioning and re-subscription and I am still unable to reproduce the issue.

Would it be possible for you to provide the code / executable or at least the endpoint/cluster configurations of your app so I can try to reproduce it more reliably locally?

Also, what is the version of your apple tv? I am using tvOS 16.6.

I have some hypothesis as to what is happening, it seems the new ReportScheduler ends up not sending a report at max interval once the subscription is done, which might be why the appleTV keeps sending sigma1 over and over, but I cannot reproduce it locally so it is hard to figure out exactly why this is happening in this specific case.

On your side, could you try reproducing the issue using chiptool? If this fails with the same intervals, could you try increasing the intervals to5s or 10s to see if it confirms the timing hypothesis?

plan44 · 2023-08-16T17:42:24Z

Would it be possible for you to provide the code / executable [...]

The bridge project is open source, http://github.com/plan44/p44mbrd.

I don't know if that helps, it has evolved from the original bridge-app quite a bit.

I am fully aware that this is hard to debug, and also that the problem is most likely caused by my particular setup of the overall app, which certainly has slightly different timing than the original bridge app had. In particular, the setup is strictly single-threaded which the standard linux examples are not (these need a separate thread in SystemLayerImplSelect). The DNS-SD part is using Avahi, not minimal-mDNS.

Still, I cannot see anything actually wrong in that setup, just a bit different from the example.

Can you imagine anything in your report engine changes that would rely on a specific order of events being processed, that might not be possible in a strictly single-threaded setup?

or at least the endpoint/cluster configurations of your app so I can try to reproduce it more reliably locally?

I just did the same A-B test on a different hardware platform (RaspberryPi B, vs MediaTek MT7688 before) with a reduced number of devices (just root EP0, bridge EP1 and two "Extended Color Light" (0x010d) EPs).

The A-B behaviour is exactly the same.

So I doubt it is something caused only by a particular EP configuration only.

Also, what is the version of your apple tv? I am using tvOS 16.6.

Same here: tvOS 16.6 (20M73)

Let me know if you have ideas what I could do to help narrow this down.

plan44 · 2023-08-17T16:11:19Z

I did some more digging in the Rpi A-B logs and a new log from the original bridge app I built from bf0b45a.

In the failing case, the first regular IM:ReportData scheduled after the Subscription Interaction has concluded, fires much too late (7 seconds rather than the intended max 2 seconds later).
In the working case (bridge-app example or my app built from fe4c679, the report is sent in time. This seems to make the difference.
Because of that delay, the AppleTV considers the subscription failed, and apparently the SecureSession too, and has already restarted with a new session/subscription by the time the original ReadHandler finally fires and generates a report for the now stale subscription.
AppleTV responds with INVALID_SUBSCRIPTION to the delayed report because it has already given up on that subscription.
The new Subscription Interaction concludes successfully, but then the first regular report (due 2 secs later) gets delayed again so much that the AppleTV gives up, etc. etc.

So the question is why that report gets delayed so much in the bf0b45a RE, and not in the fe4c679 one. I'm trying to track this down here right now, as it is most likely somehow related to my particular setup and don't want to waste more of your time before I understand that detail.

Still, I wonder, on a larger scale, if handling a relatively minor report delay every time with a quite drastic and resource intensive (flash wear) SecureSession restart in 5 second intervals forever is a good idea. Is this mandated by the specs (did not find it so far) or just Apple's choice of implementation?

…roject-chip#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain).

plan44 · 2023-08-17T22:17:01Z

@lpbeliveau-silabs I found the problem, finally 😅!

All of it was caused by the report timer firing a few milliseconds too early.

Yes, it had to do with my setup, which is that it uses a libev mainloop and timers (similar to how Darwin uses dispatch). So my fault, basically 😞

Here's what happened (log from the Rpi bf0b45a build, with some extra instrumentation log lines (‼️ marked):

[2023-08-17 22:31:42.477 N] ✅ CHIP:EM  ➡️ <<< [E:5300r S:36346 M:90461106 (Ack:108617764)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[2023-08-17 22:31:42.482 N] ✅ CHIP:IN  ➡️ (S) Sending msg 90461106 on secure session with LSID: 36346
[2023-08-17 22:31:42.485 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingReportResponse]
[2023-08-17 22:31:42.487 D] 🛠️ CHIP:DMG ➡️ <RE> ReportsInFlight = 1 with readHandler 12, RE has no more messages
[2023-08-17 22:31:42.489 D] 🛠️ CHIP:DMG ➡️ All ReadHandler-s are clean, clear GlobalDirtySet
[2023-08-17 22:31:42.545 N] ✅ CHIP:EM  ➡️ >>> [E:5300r S:36346 M:108617765 (Ack:90461106)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[2023-08-17 22:31:42.547 D] 🛠️ CHIP:EM  ➡️ Found matching exchange: 5300r, Delegate: 0xb6bba0fc
[2023-08-17 22:31:42.548 D] 🛠️ CHIP:EM  ➡️ Rxd Ack; Removing MessageCounter:90461106 from Retrans Table on exchange 5300r
[2023-08-17 22:31:42.550 N] ✅ CHIP:IM  ➡️ Received status response, status is 0x00 (SUCCESS)
[2023-08-17 22:31:42.553 N] ✅ CHIP:EM  ➡️ <<< [E:5300r S:36346 M:90461107 (Ack:108617765)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[2023-08-17 22:31:42.555 N] ✅ CHIP:IN  ➡️ (S) Sending msg 90461107 on secure session with LSID: 36346
[2023-08-17 22:31:42.561 N] ✅ CHIP:DMG ➡️ Registered a ReadHandler that will schedule a report between system Timestamp: 525396849 and system Timestamp 525398849.
[2023-08-17 22:31:42.564 D] 🛠️ CHIP:DMG ➡️ <RE> OnReportConfirm: NumReports = 0
[2023-08-17 22:31:42.565 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [CanStartReporting]
[2023-08-17 22:31:42.567 N] ✅ CHIP:DMG ➡️ ‼️ ReadHandlerNode::IsReportableNow()=0 : canstart=1, dirty=0, now=525396855, mMinTimestamp=525396849 (-6), mMaxTimestamp=525398849 (+1994), mSyncTimestamp=525398849 (+1994)
[2023-08-17 22:31:42.570 N] ✅ CHIP:DMG ➡️ ‼️ Starting Report Timer for node/context = 0x0xb6b40a10 with timeout = 1994
[2023-08-17 22:31:42.573 N] ✅ CHIP:EM  ➡️ >>> [E:5300r S:36346 M:108617766 (Ack:90461107)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[2023-08-17 22:31:42.574 D] 🛠️ CHIP:EM  ➡️ Found matching exchange: 5300r, Delegate: 0
[2023-08-17 22:31:42.576 D] 🛠️ CHIP:EM  ➡️ Rxd Ack; Removing MessageCounter:90461107 from Retrans Table on exchange 5300r
[2023-08-17 22:31:43.934 N] ✅ CHIP:DL  ➡️ Re-trying resolve
[2023-08-17 22:31:44.540 N] ✅ CHIP:DMG ➡️ ‼️ TimerFired() called on node = 0x0xb6b40a10
[2023-08-17 22:31:44.540 N] ✅ CHIP:DMG ➡️ ‼️ ReadHandlerNode::IsReportableNow()=0 : canstart=1, dirty=0, now=525398828, mMinTimestamp=525396849 (-1979), mMaxTimestamp=525398849 (+21), mSyncTimestamp=525398849 (+21)
[2023-08-17 22:31:44.541 D] 🛠️ CHIP:DMG ➡️ All ReadHandler-s are clean, clear GlobalDirtySet
[2023-08-17 22:31:48.947 N] ✅ CHIP:DL  ➡️ Re-trying resolve
[2023-08-17 22:31:49.004 N] ✅ CHIP:EM  ➡️ >>> [E:5301r S:0 M:202841598] (U) Msg RX from 0:FF325B00396BEFBC [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[2023-08-17 22:31:49.004 D] 🛠️ CHIP:EM  ➡️ Handling via exchange: 5301r, Delegate: 0x24a9e0
[2023-08-17 22:31:49.005 N] ✅ CHIP:IN  ➡️ CASE Server received Sigma1 message . Starting handshake. EC 0xb6cddd20

the timeout is calculated to be 1994 ("Starting Report Timer for node/context...")
but the timer actually fires 21mS earlier already ("ReadHandlerNode::IsReportableNow()=0 ... mMaxTimestamp=525398849 (+21))
which causes the first incremental report not to be sent, and no new timer set up.
so nothing happens until the AppleTV considers the subscription dead, and restarts with a new subscription.
at that time, ReadHandlerNode::IsReportableNow() of the still pending readhandler is called again, and generates the report that should have been triggered by the timer, many seconds too late.

So it boils down to the ReportEngine absolutely relying on timers NEVER firing even only a bit too early. Which is a reasonable assumption, and apparently true for other timer implementations, but not libev's.

As adding libev was my contribution (#24232), I should have realized that. My bad, sorry.
I just posted a fix: #28740

On the other hand, with timers always comes the question of tolerance. If implemenations may rely on absolutely none in the negative direction, it maybe should be stated in the header for System::Layer::StartTimer().

…ly too early (project-chip#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp

…roject-chip#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp

…ly too early (project-chip#28434) This fixes a ReportEngine problem that was caused by libev-based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp

bzbarsky-apple · 2023-08-28T18:16:32Z

which causes the first incremental report not to be sent, and no new timer set up.

That is definitely a bug in the scheduler.

Which is a reasonable assumption

No, it's not. This assumption definitely does not hold on some platforms. Filed #28929. This is a must-fix for 1.2, @lpbeliveau-silabs

…roject-chip#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp

) * SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp * SystemLayerImplSelect: libev: eliminate I/O wakeup thread libev based regular builds do not need a separate I/O wakeup thread, so the wakeup thread can be eliminated in CHIP_SYSTEM_CONFIG_USE_LIBEV case. In normal operation, libev based builds also never call Signal(), so if it is still called it now emits a error log message, to indicate something might be wrong in the setup. However we keep Signal() and related LayerSocketsLoop methods, following the Darwin dispatch implementation, as fallback to select-based event handling seems to be needed for some I/O tests. As noted in a comment to the original libev PR [1], eventually, the select() based mainloop and external mainloop based solutions like Darwin Dispatch and libev should be detangled and extracted into separate classes, adapting all the tests that somehow rely on the select() fallback. As a single self-funded developer however, I cannot possibly be expected to solve this for Apple ;-) [1] #24232 (review) * SystemLayerImplSelect: reword comment: early firing timers can happen - the fix prevents them in normal libev case - however the caller MUST NOT rely on timers *never* firing a bit early

…#28434) (project-chip#28740) * SystemLayerImplSelect: libev: avoid timers firing slightly too early (project-chip#28434) This fixes a ReportEngine problem that was caused by libev based timers firing slightly (in the range of 20mS) too early, because libev by default uses the time when events started processing as the "now" reference for relative timers for efficiency reasons. To ensure timers cannot fire early, timer setup must compensate for any difference between ev_now() which is the time events started processing and ev_time(), which is the actual current time (but is a bit less efficient to obtain). # Conflicts: # src/system/SystemLayerImplSelect.cpp * SystemLayerImplSelect: libev: eliminate I/O wakeup thread libev based regular builds do not need a separate I/O wakeup thread, so the wakeup thread can be eliminated in CHIP_SYSTEM_CONFIG_USE_LIBEV case. In normal operation, libev based builds also never call Signal(), so if it is still called it now emits a error log message, to indicate something might be wrong in the setup. However we keep Signal() and related LayerSocketsLoop methods, following the Darwin dispatch implementation, as fallback to select-based event handling seems to be needed for some I/O tests. As noted in a comment to the original libev PR [1], eventually, the select() based mainloop and external mainloop based solutions like Darwin Dispatch and libev should be detangled and extracted into separate classes, adapting all the tests that somehow rely on the select() fallback. As a single self-funded developer however, I cannot possibly be expected to solve this for Apple ;-) [1] project-chip#24232 (review) * SystemLayerImplSelect: reword comment: early firing timers can happen - the fix prevents them in normal libev case - however the caller MUST NOT rely on timers *never* firing a bit early

plan44 added bug Something isn't working needs triage labels Aug 1, 2023

plan44 mentioned this issue Aug 1, 2023

[ReadHandler] Integration of ReportScheduler into the ReadHandler and IM engine. #28104

Merged

lpbeliveau-silabs self-assigned this Aug 1, 2023

bzbarsky-apple added spec Mismatch between spec and implementation v1.2 labels Aug 1, 2023

This was referenced Aug 4, 2023

[ReadHandler] Removed Scheduling of report from OnReadHandlerCreated #28536

Merged

[ReadHandler] Add test for early report emission on ReadHandler creation #28535

Open

woody-apple self-assigned this Aug 11, 2023

woody-apple added the cert blocker label Aug 11, 2023

cjandhyala added this to the 1.2 milestone Aug 14, 2023

plan44 mentioned this issue Aug 17, 2023

SystemLayerImplSelect: libev: avoid timers firing early (#28434) #28740

Merged

plan44 closed this as completed Aug 18, 2023

bzbarsky-apple mentioned this issue Aug 28, 2023

[BUG] Report scheduler assumes timers can never fire early #28929

Closed

lpbeliveau-silabs mentioned this issue Aug 29, 2023

[ReadHandler] Handler Reportable if timer fired #28949

Merged

mustafayuce33 mentioned this issue Sep 6, 2023

Error retrieving data from clusterId - No memory #29087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

plan44 commented Aug 1, 2023

lpbeliveau-silabs commented Aug 1, 2023 •

edited

Loading

plan44 commented Aug 2, 2023

lpbeliveau-silabs commented Aug 4, 2023

plan44 commented Aug 5, 2023

plan44 commented Aug 5, 2023

lpbeliveau-silabs commented Aug 8, 2023

lpbeliveau-silabs commented Aug 10, 2023 •

edited

Loading

plan44 commented Aug 11, 2023

plan44 commented Aug 11, 2023

woody-apple commented Aug 11, 2023

lpbeliveau-silabs commented Aug 14, 2023 •

edited

Loading

lpbeliveau-silabs commented Aug 15, 2023 •

edited

Loading

plan44 commented Aug 16, 2023

plan44 commented Aug 17, 2023

plan44 commented Aug 17, 2023

bzbarsky-apple commented Aug 28, 2023

[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

Comments

plan44 commented Aug 1, 2023

Reproduction steps

Bug prevalence

GitHub hash of the SDK that was being used

Platform

Platform Version(s)

Anything else?

lpbeliveau-silabs commented Aug 1, 2023 • edited Loading

plan44 commented Aug 2, 2023

lpbeliveau-silabs commented Aug 4, 2023

plan44 commented Aug 5, 2023

plan44 commented Aug 5, 2023

lpbeliveau-silabs commented Aug 8, 2023

lpbeliveau-silabs commented Aug 10, 2023 • edited Loading

plan44 commented Aug 11, 2023

plan44 commented Aug 11, 2023

woody-apple commented Aug 11, 2023

lpbeliveau-silabs commented Aug 14, 2023 • edited Loading

lpbeliveau-silabs commented Aug 15, 2023 • edited Loading

plan44 commented Aug 16, 2023

plan44 commented Aug 17, 2023

plan44 commented Aug 17, 2023

bzbarsky-apple commented Aug 28, 2023

lpbeliveau-silabs commented Aug 1, 2023 •

edited

Loading

lpbeliveau-silabs commented Aug 10, 2023 •

edited

Loading

lpbeliveau-silabs commented Aug 14, 2023 •

edited

Loading

lpbeliveau-silabs commented Aug 15, 2023 •

edited

Loading