Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

Closed
plan44 opened this issue Aug 1, 2023 · 16 comments · Fixed by #28740
Closed

[BUG] [ReadHandler] Regression in report engine caused by #28104 #28434

plan44 opened this issue Aug 1, 2023 · 16 comments · Fixed by #28740
Assignees
Labels
bug Something isn't working cert blocker needs triage spec Mismatch between spec and implementation v1.2
Milestone

Comments

@plan44
Copy link
Contributor

plan44 commented Aug 1, 2023

Reproduction steps

Commit c9ce7f2 from PR #28104 causes a regression in subscription handling, which is 100% reproducible in my (Linux Bridge) commissioned into a AppleTV:

  • build with c9ce7f2 -> controller endlessly retries subscribing
  • build with preceeding commit fe4c679 -> subscriptions work ok

Bug prevalence

Always

GitHub hash of the SDK that was being used

c9ce7f2

Platform

core

Platform Version(s)

n/a

Anything else?

For some reason, the report engine keeps generating a IM:ReportData message that is rejected by the controller (AppleTV) with INVALID_SUBSCRIPTION. This causes the contoller to send a new IM:SubscribeRequest in a new session, which again causes the invalid IM:ReportData, forever (not good for the flash storage).

@bzbarsky-apple comments in #28104:

[...] Something is very broken here... This message:

[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)

is wrong. It's happening before the subscription has been set up, but not on the subscription-setup exchange, which is why the other side is rejecting it: it carries a subscription id which is not for a subscription that "exists" yet.

Here's a log excerpt of the repeating part:
(full log as attachment)

....
[1690835059.326804][3483:3483] CHIP:EM: >>> [E:12133i S:7746 M:156214298 (Ack:190299662)] (S) 
[c9ce7f2 - reporting regression.txt](https://github.com/project-chip/connectedhomeip/files/12222101/c9ce7f2.-.reporting.regression.txt)
Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.327958][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)

[1690835059.331730][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-eModLd)
[1690835059.480666][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835059.502327][3483:3483] CHIP:EM: <<< [E:12133i S:7746 M:190299665 (Ack:156214298)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835059.503050][3483:3483] CHIP:IN: (S) Sending msg 190299665 on secure session with LSID: 7746
[1690835059.504382][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214299 (Ack:190299663)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835059.505544][3483:3483] CHIP:IM: Received status response, status is 0x00 (SUCCESS)
[1690835059.506646][3483:3483] CHIP:EM: <<< [E:34561r S:7746 M:190299666 (Ack:156214299)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[1690835059.507473][3483:3483] CHIP:IN: (S) Sending msg 190299666 on secure session with LSID: 7746
[1690835059.510134][3483:3483] CHIP:EM: >>> [E:34561r S:7746 M:156214300 (Ack:190299666)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835065.946340][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355875] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1690835065.946659][3483:3483] CHIP:IN: CASE Server received Sigma1 message . Starting handshake. EC 0x6be5b0
[1690835065.946778][3483:3483] CHIP:SC: Received Sigma1 msg
[1690835065.949108][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618995 (Ack:49355875)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
[1690835065.949478][3483:3483] CHIP:IN: (U) Sending msg 132618995 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835065.952904][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835065.953837][3483:3483] CHIP:SC: Success status report received. Session was established
[1690835065.956631][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-oKjjhg)
[1690835066.095636][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.097282][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-DkPbMl)
[1690835066.234423][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.236145][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GnkEcB)
[1690835066.413213][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.413715][3483:3483] CHIP:SC: SecureSession[0x779419e0, LSID:7747]: State change 'kEstablishing' --> 'kActive'
[1690835066.413871][3483:3483] CHIP:IN: CASE Session established to peer: <146752179F0AE79A, 1>
[1690835066.414342][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618996 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.414691][3483:3483] CHIP:IN: (U) Sending msg 132618996 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.416382][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.418659][3483:3483] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 41430293 and system Timestamp 41430293.
[1690835066.423998][3483:3483] CHIP:DMG: Final negotiated min/max parameters: Min = 0s, Max = 2s
[1690835066.452039][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-BhAaJd)
[1690835066.745570][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690835066.750730][3483:3483] CHIP:EM: <<< [E:12134i S:7747 M:65373677] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.751488][3483:3483] CHIP:IN: (S) Sending msg 65373677 on secure session with LSID: 7747
[1690835066.756634][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373678 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[1690835066.757360][3483:3483] CHIP:IN: (S) Sending msg 65373678 on secure session with LSID: 7747
[1690835066.759274][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.760237][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373679 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.762392][3483:3483] CHIP:IN: (S) Sending msg 65373679 on secure session with LSID: 7747
[1690835066.763714][3483:3483] CHIP:EM: >>> [E:34562r S:0 M:49355876 (Ack:132618995)] (U) Msg RX from 0:0512CA591F6CAE18 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[1690835066.764682][3483:3483] CHIP:EM: <<< [E:34562r S:0 M:132618997 (Ack:49355876)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.765626][3483:3483] CHIP:IN: (U) Sending msg 132618997 to IP address 'UDP:[fe80::cb0:4d11:63e1:f960%eth0]:51521'
[1690835066.767370][3483:3483] CHIP:EM: >>> [E:34563r S:7747 M:113090551] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:03 (IM:SubscribeRequest)
[1690835066.768331][3483:3483] CHIP:EM: <<< [E:34563r S:7747 M:65373680 (Ack:113090551)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[1690835066.769167][3483:3483] CHIP:IN: (S) Sending msg 65373680 on secure session with LSID: 7747
[1690835066.770940][3483:3483] CHIP:EM: >>> [E:12134i S:7747 M:113090552 (Ack:65373677)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[1690835066.772102][3483:3483] CHIP:IM: Received status response, status is 0x7d (INVALID_SUBSCRIPTION)

[1690835066.776877][3483:3483] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-GJJBnm)
[1690835067.094920][3483:3483] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
.... repeating endlessly 

c9ce7f2 - reporting regression.txt

@plan44 plan44 added bug Something isn't working needs triage labels Aug 1, 2023
@lpbeliveau-silabs lpbeliveau-silabs self-assigned this Aug 1, 2023
@bzbarsky-apple bzbarsky-apple added spec Mismatch between spec and implementation v1.2 labels Aug 1, 2023
@lpbeliveau-silabs
Copy link
Contributor

lpbeliveau-silabs commented Aug 1, 2023

@plan44
I am setting up an environment to reproduce the bug. In the meantime, I am seing a lot of:

[1690832821.160153][3436:3436] CHIP:DMG: Refresh subscribe timer sync after 2 seconds

Logs in the file, which is weird given that this log has been removed in this specific PR.

Have you tried a clean rebuild of the bridge app? The fact that this log is still happening leads me to believe something might not have been updated properly.

plan44 added a commit to plan44/p44mbrd that referenced this issue Aug 1, 2023
…n of ReportScheduler into the ReadHandler and IM engine. (#28104)

- this commit causes endless re-subscriptions and thus flash wear
- is being investigated in project-chip/connectedhomeip#28434
@plan44
Copy link
Contributor Author

plan44 commented Aug 2, 2023

@lpbeliveau-silabs

Have you tried a clean rebuild of the bridge app?

I thought I had, but apparently not, my bad, sorry. I did now (deleting the out dir to make absolutely sure), but it does not make a difference for the problem observed (that log line is gone, though).

Please find the new log attached.

c9ce7f2 - reporting regression - rebuilt.txt

@lpbeliveau-silabs
Copy link
Contributor

So far I wasn't able to replicate the behavior with the linux example bridge app on a raspberrypi 4. I initially build the linux bridge app and commissioned it with an applet TV on c9ce7f2. This didn't show the behavior in the logs here so I tried replicating the subscription resumption shown in the logs.

The steps used to try to replicate with resumption were :
(Since the log displays a subscription resumption, I assumed the bridge was commissioned on fe4c679 initially)

  • Bootstrap and build the bridge app upon commit fe4c679
  • Run the app
  • Commissioned the bridge app with an AppleTv
  • Checkout c9ce7f2
  • Rebuild the app (clean rebuild)
  • Relaunch the app
  • Subscription resumed without causing the issue

From the log lines:

[1690960515.480095][2155:2155] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 1614906 and system Timestamp 1614906.
[1690960515.485213][2155:2155] CHIP:DMG: Final negotiated min/max parameters: Min = 0s, Max = 2s
[1690960515.510347][2155:2155] CHIP:DL: writing settings to file (/flash/p44mbrd_kvs-FAEDKa)
[1690960515.649570][2155:2155] CHIP:DL: renamed tmp file to file (/flash/p44mbrd_kvs)
[1690960515.689856][2155:2155] CHIP:DMG: Error retrieving data from clusterId: 0x0000_0028, err = src/system/TLVPacketBufferBackingStore.cpp:88: CHIP Error 0x0000000B: No memory

I am suspecting that the flash memory is already full from the previous failures to re-establish the subscription, thus causing the failure in the new log.

Would you mind trying to clear the persistant memory to see if that might help solving the issue?

Also, I can't seem to reconstruct the proper chain of event? When you built the first time with c9ce7f2, were you resuming a subscription as I assumed? If not, what exactly were the steps that led you to this?

@plan44
Copy link
Contributor Author

plan44 commented Aug 5, 2023

Thanks for the investigation.

[1690960515.689856][2155:2155] CHIP:DMG: Error retrieving data from clusterId: 0x0000_0028, err = src/system/TLVPacketBufferBackingStore.cpp:88: CHIP Error 0x0000000B: No memory

I am suspecting that the flash memory is already full from the previous failures to re-establish the subscription, thus causing the failure in the new log.

This is not related to flash (BTW, it technically couldn't be, because the persistent data store on Linux is just a file, with no size limits except for the disk size, which is huge).

This error just signals that the report response needs to be chunked. It happens at the TLV level, bubbles up to Engine.cpp:189 where any error is logged, but then in case it is a out-of-writer-space error (CHIP_ERROR_NO_MEMORYor CHIP_ERROR_BUFFER_TOO_SMALL) it will be reset and just causes chunking the report response. This is normal for large reports that can't be sent in a single message.

Now this could be the difference between your and my setup: while my bridge is based on the bridge-example, in the meantime it manages a larger number of active endpoints with more data to report, so chunked reports are happening. Probably, in the reduced bridge-example setup, there's no chunking happening at all?

@plan44
Copy link
Contributor Author

plan44 commented Aug 5, 2023

Responding to your other points:

(Since the log displays a subscription resumption, I assumed the bridge was commissioned on fe4c679 initially)

Correct.

Would you mind trying to clear the persistant memory to see if that might help solving the issue?

I am a bit reluctant to do that with the bridge I found this problem with - because given that nobody else has complained so far, this seems pretty much an edge case. But on that setup, it is a 100% A-B testable one - if I run the build from c9ce7f2, the app enters the resubscription loop, if I run the build from fe4c679 with the same persistent memory data everything runs normal. I did this back and forth several times. I'd hate to loose that A-B testability and maybe the problem itself, until it becomes more clear what is actually happening.

So I'd rather try to reproduce the problem on a second bridge, with a similar set of dynamic endpoints. I'll try that on Monday and will post what I find.

@lpbeliveau-silabs
Copy link
Contributor

Thanks for the insight. The chunking is definitely worth investigating, our CI is testing for Chunking capability but this seems to be a corner case where we have re-subscribe + chunking, which might need more testing on our CI.

In the meantime another bug was Identified where the readHandler is scheduling itself with a min and a max of 0:

[1690960515.480095][2155:2155] CHIP:DMG: Registered a ReadHandler that will schedule a report between system Timestamp: 1614906 and system Timestamp 1614906.

I am reworking the scheduler/readHandler to prevent that. In the event that it doesn't fix your problem I would definitely need to setup a re-subcription configuration with chunking so please let me know if you are able to reproduce this bug on a second bridge.

@lpbeliveau-silabs
Copy link
Contributor

lpbeliveau-silabs commented Aug 10, 2023

@plan44 Now that #28536 has been merged, can you check if the endless retries still occur on commit d62d80b?

@plan44
Copy link
Contributor Author

plan44 commented Aug 11, 2023

@lpbeliveau-silabs Unfortunately, #28536 does not solve the problem.

I did a A-B test with full rebuilds of both bf0b45a (which includes d62d80b) and fe4c679.

  • bf0b45a still enters the same resubscription loop as c9ce7f2 did.
  • fe4c679 works fine

I repeated the A-B test several times, it is still 100% reproducible.

Please find two logs attached - this is the same device, with no other changes than once using a build off bf0b45a (loops) and one off fe4c679 (works ok).

Comparing the two, I see that things start to go differently only after report chunking gets enabled (at line 423 in works - fe4c679.txt and at line 500 in loops - bf0b45a.txt)

Maybe the A-B logs (with details enabled) can already reveal something.

I will now also commission a second bridge with the same set of devices and see what happens there (and post the results here)

loops - bf0b45a.txt
works - fe4c679.txt

@plan44
Copy link
Contributor Author

plan44 commented Aug 11, 2023

@lpbeliveau-silabs As promised, now I commissioned a second bridge with (nearly) the same set of devices.

It shows exactly the same A-B behaviour as the original unit.

What I did:

  • started the bridge built from bf0b45a with all matter persistent storage empty (KVS file deleted)
  • commissioned it into Apple Home (same as with the original bridge), works ok but then already starts looping -> testbridge-setup - loops - bf0b45a.txt
  • quit and restarted the bf0b45a bridge to see how resumption looks, loops the same way the original setup did -> testbridge-restart - loops - bf0b45a.txt
  • only then I started the bridge built from fe4c679, no change otherwise, starts up and works ok without looping -> testbridge-restart - works - fe4c679.txt

testbridge-setup - loops - bf0b45a.txt
testbridge-restart - loops - bf0b45a.txt
testbridge-restart - works - fe4c679.txt

This clearly shows that the issue cannot be stale persistent data (there is none in the first run, and still the resubscription loop happens). I hope this helps to narrow down the cause for this problem. Just let me know when you need more specific information.

@woody-apple
Copy link
Contributor

Adding cert blocker so I can track this for now.

@cjandhyala cjandhyala added this to the 1.2 milestone Aug 14, 2023
@lpbeliveau-silabs
Copy link
Contributor

lpbeliveau-silabs commented Aug 14, 2023

@plan44 Thank you for providing the logs, there clearly seems to have some chunking related problems. This might be due to a change in behavior where in the previous implementation, chunking would result in cancelling the subscription refresh timer without restarting it were-as now it doesn't cancel it, which might be why we see this problem in your case as your subscription resumption takes longer than the max interval to expire, which might lead to a timer generating a report ahead of the subscription being completed.

Weirdly enough, the functioning subscription resumption ends in two failures to establish a case session:

[2023-08-11 19:27:28.213 E] 🔴 CHIP:DMG ➡️ Failed to establish CASE for subscription-resumption with error 'src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout'
[2023-08-11 19:27:28.238 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingDestruction]
[2023-08-11 19:27:28.242 E] 🔴 CHIP:DMG ➡️ Failed to establish CASE for subscription-resumption with error 'src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout'
[2023-08-11 19:27:28.267 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingDestruction]
[2023-08-11 19:27:29.639 N] ✅ CHIP:DMG ➡️ Refresh subscribe timer sync after 2 seconds
[2023-08-11 19:27:29.640 D] 🛠️ CHIP:DMG ➡️ Building Reports for ReadHandler with LastReportGeneration = 53 DirtyGeneration = 0
[2023-08-11 19:27:29.644 D] 🛠️ CHIP:DMG ➡️ Fetched 0 events
[2023-08-11 19:27:29.645 D] 🛠️ CHIP:DMG ➡️ Sending report (payload has 11 bytes)...

were-as the looping one only has one failure at that point:

[2023-08-11 19:23:29.704 E] 🔴 CHIP:DMG ➡️ Failed to establish CASE for subscription-resumption with error 'src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout'
[2023-08-11 19:23:29.728 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingDestruction]
[2023-08-11 19:23:33.418 N] ✅ CHIP:EM ➡️ >>> [E:16023r S:0 M:133719369] (U) Msg RX from 0:625A153B6942F5F7 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[2023-08-11 19:23:33.419 D] 🛠️ CHIP:EM ➡️ Handling via exchange: 16023r, Delegate: 0x6dece0
[2023-08-11 19:23:33.421 N] ✅ CHIP:IN ➡️ CASE Server received Sigma1 message . Starting handshake. EC 0x77e46df0
[2023-08-11 19:23:33.422 N] ✅ CHIP:SC ➡️ Received Sigma1 msg
[2023-08-11 19:23:33.423 D] 🛠️ CHIP:SC ➡️ Peer assigned session key ID 62236
[2023-08-11 19:23:33.427 N] ✅ CHIP:EM ➡️ <<< [E:16023r S:0 M:7425603 (Ack:133719369)] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:33 (SecureChannel:CASE_Sigma2Resume)
[2023-08-11 19:23:33.429 N] ✅ CHIP:IN ➡️ (U) Sending msg 7425603 to IP address 'UDP:192.168.59.123%eth0:49437'
[2023-08-11 19:23:33.430 D] 🛠️ CHIP:SC ➡️ Sent Sigma2Resume msg
[2023-08-11 19:23:33.441 N] ✅ CHIP:EM ➡️ >>> [E:16023r S:0 M:133719370 (Ack:7425603)] (U) Msg RX from 0:625A153B6942F5F7 [0000] --- Type 0000:40 (SecureChannel:StatusReport)
[2023-08-11 19:23:33.442 D] 🛠️ CHIP:EM ➡️ Found matching exchange: 16023r, Delegate: 0x6decfc

And proceeds to receive another sigma 1.

@lpbeliveau-silabs
Copy link
Contributor

lpbeliveau-silabs commented Aug 15, 2023

I have raised my number of devices on the bridge, generated some chunking on commissioning and re-subscription and I am still unable to reproduce the issue.

Would it be possible for you to provide the code / executable or at least the endpoint/cluster configurations of your app so I can try to reproduce it more reliably locally?

Also, what is the version of your apple tv? I am using tvOS 16.6.

I have some hypothesis as to what is happening, it seems the new ReportScheduler ends up not sending a report at max interval once the subscription is done, which might be why the appleTV keeps sending sigma1 over and over, but I cannot reproduce it locally so it is hard to figure out exactly why this is happening in this specific case.

On your side, could you try reproducing the issue using chiptool? If this fails with the same intervals, could you try increasing the intervals to5s or 10s to see if it confirms the timing hypothesis?

@plan44
Copy link
Contributor Author

plan44 commented Aug 16, 2023

Would it be possible for you to provide the code / executable [...]

The bridge project is open source, http://github.com/plan44/p44mbrd.

I don't know if that helps, it has evolved from the original bridge-app quite a bit.

I am fully aware that this is hard to debug, and also that the problem is most likely caused by my particular setup of the overall app, which certainly has slightly different timing than the original bridge app had. In particular, the setup is strictly single-threaded which the standard linux examples are not (these need a separate thread in SystemLayerImplSelect). The DNS-SD part is using Avahi, not minimal-mDNS.

Still, I cannot see anything actually wrong in that setup, just a bit different from the example.

Can you imagine anything in your report engine changes that would rely on a specific order of events being processed, that might not be possible in a strictly single-threaded setup?

or at least the endpoint/cluster configurations of your app so I can try to reproduce it more reliably locally?

I just did the same A-B test on a different hardware platform (RaspberryPi B, vs MediaTek MT7688 before) with a reduced number of devices (just root EP0, bridge EP1 and two "Extended Color Light" (0x010d) EPs).

The A-B behaviour is exactly the same.

So I doubt it is something caused only by a particular EP configuration only.

Also, what is the version of your apple tv? I am using tvOS 16.6.

Same here: tvOS 16.6 (20M73)

Let me know if you have ideas what I could do to help narrow this down.

@plan44
Copy link
Contributor Author

plan44 commented Aug 17, 2023

I did some more digging in the Rpi A-B logs and a new log from the original bridge app I built from bf0b45a.

  • In the failing case, the first regular IM:ReportData scheduled after the Subscription Interaction has concluded, fires much too late (7 seconds rather than the intended max 2 seconds later).

  • In the working case (bridge-app example or my app built from fe4c679, the report is sent in time. This seems to make the difference.

  • Because of that delay, the AppleTV considers the subscription failed, and apparently the SecureSession too, and has already restarted with a new session/subscription by the time the original ReadHandler finally fires and generates a report for the now stale subscription.

  • AppleTV responds with INVALID_SUBSCRIPTION to the delayed report because it has already given up on that subscription.

  • The new Subscription Interaction concludes successfully, but then the first regular report (due 2 secs later) gets delayed again so much that the AppleTV gives up, etc. etc.

So the question is why that report gets delayed so much in the bf0b45a RE, and not in the fe4c679 one. I'm trying to track this down here right now, as it is most likely somehow related to my particular setup and don't want to waste more of your time before I understand that detail.

Still, I wonder, on a larger scale, if handling a relatively minor report delay every time with a quite drastic and resource intensive (flash wear) SecureSession restart in 5 second intervals forever is a good idea. Is this mandated by the specs (did not find it so far) or just Apple's choice of implementation?

plan44 added a commit to plan44/connectedhomeip that referenced this issue Aug 17, 2023
…roject-chip#28434)

This fixes a ReportEngine problem that was caused by libev-based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).
@plan44
Copy link
Contributor Author

plan44 commented Aug 17, 2023

@lpbeliveau-silabs I found the problem, finally 😅!

All of it was caused by the report timer firing a few milliseconds too early.

Yes, it had to do with my setup, which is that it uses a libev mainloop and timers (similar to how Darwin uses dispatch). So my fault, basically 😞

Here's what happened (log from the Rpi bf0b45a build, with some extra instrumentation log lines (‼️ marked):

[2023-08-17 22:31:42.477 N] ✅ CHIP:EM  ➡️ <<< [E:5300r S:36346 M:90461106 (Ack:108617764)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:05 (IM:ReportData)
[2023-08-17 22:31:42.482 N] ✅ CHIP:IN  ➡️ (S) Sending msg 90461106 on secure session with LSID: 36346
[2023-08-17 22:31:42.485 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [AwaitingReportResponse]
[2023-08-17 22:31:42.487 D] 🛠️ CHIP:DMG ➡️ <RE> ReportsInFlight = 1 with readHandler 12, RE has no more messages
[2023-08-17 22:31:42.489 D] 🛠️ CHIP:DMG ➡️ All ReadHandler-s are clean, clear GlobalDirtySet
[2023-08-17 22:31:42.545 N] ✅ CHIP:EM  ➡️ >>> [E:5300r S:36346 M:108617765 (Ack:90461106)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0001:01 (IM:StatusResponse)
[2023-08-17 22:31:42.547 D] 🛠️ CHIP:EM  ➡️ Found matching exchange: 5300r, Delegate: 0xb6bba0fc
[2023-08-17 22:31:42.548 D] 🛠️ CHIP:EM  ➡️ Rxd Ack; Removing MessageCounter:90461106 from Retrans Table on exchange 5300r
[2023-08-17 22:31:42.550 N] ✅ CHIP:IM  ➡️ Received status response, status is 0x00 (SUCCESS)
[2023-08-17 22:31:42.553 N] ✅ CHIP:EM  ➡️ <<< [E:5300r S:36346 M:90461107 (Ack:108617765)] (S) Msg TX to 1:146752179F0AE79A [B49B] --- Type 0001:04 (IM:SubscribeResponse)
[2023-08-17 22:31:42.555 N] ✅ CHIP:IN  ➡️ (S) Sending msg 90461107 on secure session with LSID: 36346
[2023-08-17 22:31:42.561 N] ✅ CHIP:DMG ➡️ Registered a ReadHandler that will schedule a report between system Timestamp: 525396849 and system Timestamp 525398849.
[2023-08-17 22:31:42.564 D] 🛠️ CHIP:DMG ➡️ <RE> OnReportConfirm: NumReports = 0
[2023-08-17 22:31:42.565 D] 🛠️ CHIP:DMG ➡️ IM RH moving to [CanStartReporting]
[2023-08-17 22:31:42.567 N] ✅ CHIP:DMG ➡️ ‼️ ReadHandlerNode::IsReportableNow()=0 : canstart=1, dirty=0, now=525396855, mMinTimestamp=525396849 (-6), mMaxTimestamp=525398849 (+1994), mSyncTimestamp=525398849 (+1994)
[2023-08-17 22:31:42.570 N] ✅ CHIP:DMG ➡️ ‼️ Starting Report Timer for node/context = 0x0xb6b40a10 with timeout = 1994
[2023-08-17 22:31:42.573 N] ✅ CHIP:EM  ➡️ >>> [E:5300r S:36346 M:108617766 (Ack:90461107)] (S) Msg RX from 1:146752179F0AE79A [B49B] --- Type 0000:10 (SecureChannel:StandaloneAck)
[2023-08-17 22:31:42.574 D] 🛠️ CHIP:EM  ➡️ Found matching exchange: 5300r, Delegate: 0
[2023-08-17 22:31:42.576 D] 🛠️ CHIP:EM  ➡️ Rxd Ack; Removing MessageCounter:90461107 from Retrans Table on exchange 5300r
[2023-08-17 22:31:43.934 N] ✅ CHIP:DL  ➡️ Re-trying resolve
[2023-08-17 22:31:44.540 N] ✅ CHIP:DMG ➡️ ‼️ TimerFired() called on node = 0x0xb6b40a10
[2023-08-17 22:31:44.540 N] ✅ CHIP:DMG ➡️ ‼️ ReadHandlerNode::IsReportableNow()=0 : canstart=1, dirty=0, now=525398828, mMinTimestamp=525396849 (-1979), mMaxTimestamp=525398849 (+21), mSyncTimestamp=525398849 (+21)
[2023-08-17 22:31:44.541 D] 🛠️ CHIP:DMG ➡️ All ReadHandler-s are clean, clear GlobalDirtySet
[2023-08-17 22:31:48.947 N] ✅ CHIP:DL  ➡️ Re-trying resolve
[2023-08-17 22:31:49.004 N] ✅ CHIP:EM  ➡️ >>> [E:5301r S:0 M:202841598] (U) Msg RX from 0:FF325B00396BEFBC [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[2023-08-17 22:31:49.004 D] 🛠️ CHIP:EM  ➡️ Handling via exchange: 5301r, Delegate: 0x24a9e0
[2023-08-17 22:31:49.005 N] ✅ CHIP:IN  ➡️ CASE Server received Sigma1 message . Starting handshake. EC 0xb6cddd20
  • the timeout is calculated to be 1994 ("Starting Report Timer for node/context...")
  • but the timer actually fires 21mS earlier already ("ReadHandlerNode::IsReportableNow()=0 ... mMaxTimestamp=525398849 (+21))
  • which causes the first incremental report not to be sent, and no new timer set up.
  • so nothing happens until the AppleTV considers the subscription dead, and restarts with a new subscription.
  • at that time, ReadHandlerNode::IsReportableNow() of the still pending readhandler is called again, and generates the report that should have been triggered by the timer, many seconds too late.

So it boils down to the ReportEngine absolutely relying on timers NEVER firing even only a bit too early. Which is a reasonable assumption, and apparently true for other timer implementations, but not libev's.

As adding libev was my contribution (#24232), I should have realized that. My bad, sorry.
I just posted a fix: #28740

On the other hand, with timers always comes the question of tolerance. If implemenations may rely on absolutely none in the negative direction, it maybe should be stated in the header for System::Layer::StartTimer().

plan44 added a commit to plan44/connectedhomeip that referenced this issue Aug 17, 2023
…ly too early (project-chip#28434)

This fixes a ReportEngine problem that was caused by libev-based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).

# Conflicts:
#	src/system/SystemLayerImplSelect.cpp
@plan44 plan44 closed this as completed Aug 18, 2023
plan44 added a commit to plan44/connectedhomeip that referenced this issue Aug 22, 2023
…roject-chip#28434)

This fixes a ReportEngine problem that was caused by libev based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).

# Conflicts:
#	src/system/SystemLayerImplSelect.cpp
plan44 added a commit to plan44/connectedhomeip that referenced this issue Aug 22, 2023
…ly too early (project-chip#28434)

This fixes a ReportEngine problem that was caused by libev-based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).

# Conflicts:
#	src/system/SystemLayerImplSelect.cpp
@bzbarsky-apple
Copy link
Contributor

which causes the first incremental report not to be sent, and no new timer set up.

That is definitely a bug in the scheduler.

Which is a reasonable assumption

No, it's not. This assumption definitely does not hold on some platforms. Filed #28929. This is a must-fix for 1.2, @lpbeliveau-silabs

plan44 added a commit to plan44/connectedhomeip that referenced this issue Aug 29, 2023
…roject-chip#28434)

This fixes a ReportEngine problem that was caused by libev based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).

# Conflicts:
#	src/system/SystemLayerImplSelect.cpp
mergify bot pushed a commit that referenced this issue Aug 29, 2023
)

* SystemLayerImplSelect: libev: avoid timers firing slightly too early (#28434)

This fixes a ReportEngine problem that was caused by libev based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).

# Conflicts:
#	src/system/SystemLayerImplSelect.cpp

* SystemLayerImplSelect: libev: eliminate I/O wakeup thread

libev based regular builds do not need a separate I/O wakeup thread, so
the wakeup thread can be eliminated in CHIP_SYSTEM_CONFIG_USE_LIBEV case.

In normal operation, libev based builds also never call Signal(), so if it
is still called it now emits a error log message, to indicate something might
be wrong in the setup.

However we keep Signal() and related LayerSocketsLoop methods, following the
Darwin dispatch implementation, as fallback to select-based event handling
seems to be needed for some I/O tests.

As noted in a comment to the original libev PR [1], eventually, the select() based
mainloop and external mainloop based solutions like Darwin Dispatch and libev
should be detangled and extracted into separate classes, adapting all the tests
that somehow rely on the select() fallback. As a single self-funded
developer however, I cannot possibly be expected to solve this for Apple ;-)

[1] #24232 (review)

* SystemLayerImplSelect: reword comment: early firing timers can happen

- the fix prevents them in normal libev case
- however the caller MUST NOT rely on timers *never* firing a bit early
HunsupJung pushed a commit to HunsupJung/connectedhomeip that referenced this issue Oct 23, 2023
…#28434) (project-chip#28740)

* SystemLayerImplSelect: libev: avoid timers firing slightly too early (project-chip#28434)

This fixes a ReportEngine problem that was caused by libev based timers
firing slightly (in the range of 20mS) too early, because libev by
default uses the time when events started processing as the "now"
reference for relative timers for efficiency reasons.

To ensure timers cannot fire early, timer setup must compensate for
any difference between ev_now() which is the time events started
processing and ev_time(), which is the actual current time (but is
a bit less efficient to obtain).

# Conflicts:
#	src/system/SystemLayerImplSelect.cpp

* SystemLayerImplSelect: libev: eliminate I/O wakeup thread

libev based regular builds do not need a separate I/O wakeup thread, so
the wakeup thread can be eliminated in CHIP_SYSTEM_CONFIG_USE_LIBEV case.

In normal operation, libev based builds also never call Signal(), so if it
is still called it now emits a error log message, to indicate something might
be wrong in the setup.

However we keep Signal() and related LayerSocketsLoop methods, following the
Darwin dispatch implementation, as fallback to select-based event handling
seems to be needed for some I/O tests.

As noted in a comment to the original libev PR [1], eventually, the select() based
mainloop and external mainloop based solutions like Darwin Dispatch and libev
should be detangled and extracted into separate classes, adapting all the tests
that somehow rely on the select() fallback. As a single self-funded
developer however, I cannot possibly be expected to solve this for Apple ;-)

[1] project-chip#24232 (review)

* SystemLayerImplSelect: reword comment: early firing timers can happen

- the fix prevents them in normal libev case
- however the caller MUST NOT rely on timers *never* firing a bit early
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cert blocker needs triage spec Mismatch between spec and implementation v1.2
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants