Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Matter.framework] When the device is entering suspended state and at… #35973

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vivien-apple
Copy link
Contributor

… the same time _ensureSubscriptionForExistingDelegates is called with an error from getting a session it crashes with _os_unfair_lock_recursive_abort since it is called on the same thread and the error handling code also try to get a lock

Problem

While attempting to reproduce a specific memory leak with Darwin-framework-tool, I encountered a crash caused by a recursive lock issue.

In essence, this occurs because, when the tool is suspended, directlyGetSessionForNode returns synchronously. However, the code that triggered this behavior already held the lock, and the code managing subscription errors also attempts to acquire the same lock.

When this happens, the thread states are as follows:

thread #1, queue = 'com.apple.main-thread'
    frame #0: 0x000000018c3a3c7c libsystem_kernel.dylib`__ulock_wait2 + 8
    frame #1: 0x000000018c409e4c libsystem_platform.dylib`_os_unfair_lock_lock_slow + 180
    frame #2: 0x00000001056849a4 Matter`std::__1::lock_guard<os_unfair_lock_s>::lock_guard(this=0x000000016fdfd968, lock=0x0000000137f04e48) at MTRUnfairLock.h:29:64
    frame #3: 0x000000010568089c Matter`std::__1::lock_guard<os_unfair_lock_s>::lock_guard(this=0x000000016fdfd968, lock=0x0000000137f04e48) at MTRUnfairLock.h:29:62
    frame #4: 0x000000010647c658 Matter`-[MTRDevice_Concrete controllerSuspended](self=0x0000000137f04e40, _cmd="controllerSuspended") at MTRDevice_Concrete.mm:4132:21
    frame #5: 0x00000001059e6390 Matter`-[MTRDeviceController suspend](self=0x0000000148033c00, _cmd="suspend") at MTRDeviceController.mm:256:9
    frame #6: 0x000000010007e130 darwin-framework-tool`CHIPCommandBridge::MaybeSetUpStack(this=0x0000000137e20c20) at CHIPCommandBridge.mm:249:5
    frame #7: 0x000000010007d134 darwin-framework-tool`CHIPCommandBridge::Run(this=0x0000000137e20c20) at CHIPCommandBridge.mm:101:5
    frame #8: 0x000000010003ce44 darwin-framework-tool`Commands::RunCommand(this=0x000000016fdfea90, argc=10, argv=0x000000016fdfed20, interactive=false, interactiveStorageDirectory=0x000000016fdfe938, interactiveAdvertiseOperational=false) at Commands.cpp:331:21
    frame #9: 0x000000010003bf18 darwin-framework-tool`Commands::Run(this=0x000000016fdfea90, argc=10, argv=0x000000016fdfed20) at Commands.cpp:178:11
    frame #10: 0x000000010016e89c darwin-framework-tool`main(argc=10, argv=0x000000016fdfed20) at main.mm:55:29
    frame #11: 0x000000018c054274 dyld`start + 2840

* thread #2, queue = 'org.csa-iot.matter.workqueue', stop reason = EXC_BREAKPOINT (code=1, subcode=0x18c40f04c)
    frame #0: 0x000000018c40f04c libsystem_platform.dylib`_os_unfair_lock_recursive_abort + 36
    frame #1: 0x000000018c409ecc libsystem_platform.dylib`_os_unfair_lock_lock_slow + 308
    frame #2: 0x00000001056849a4 Matter`std::__1::lock_guard<os_unfair_lock_s>::lock_guard(this=0x000000016fe85b60, lock=0x0000000137f04e48) at MTRUnfairLock.h:29:64
    frame #3: 0x000000010568089c Matter`std::__1::lock_guard<os_unfair_lock_s>::lock_guard(this=0x000000016fe85b60, lock=0x0000000137f04e48) at MTRUnfairLock.h:29:62
    frame #4: 0x000000010645abcc Matter`-[MTRDevice_Concrete _handleSubscriptionError:](self=0x0000000137f04e40, _cmd="_handleSubscriptionError:", error=domain: "MTRErrorDomain" - code: 6) at MTRDevice_Concrete.mm:1082:21
  * frame #5: 0x0000000106467b70 Matter`__51-[MTRDevice_Concrete _setupSubscriptionWithReason:]_block_invoke.214(.block_descriptor=0x000000016fe86590, exchangeManager=0x0000000000000000, session=0x000000016fe86380, error=domain: "MTRErrorDomain" - code: 6, retryDelay=0x0000000000000000) at MTRDevice_Concrete.mm:2400:32
    frame #6: 0x0000000106228ccc Matter`-[MTRDeviceController_Concrete directlyGetSessionForNode:completion:](self=0x0000000148033c00, _cmd="directlyGetSessionForNode:completion:", nodeID=305414945, completion=0x00000001064679c8) at MTRDeviceController_Concrete.mm:1421:9
    frame #7: 0x00000001064674ec Matter`-[MTRDevice_Concrete _setupSubscriptionWithReason:](self=0x0000000137f04e40, _cmd="_setupSubscriptionWithReason:", reason="delegate is set and scheduled subscription is happening") at MTRDevice_Concrete.mm:2393:5
    frame #8: 0x0000000106458b80 Matter`__62-[MTRDevice_Concrete _ensureSubscriptionForExistingDelegates:]_block_invoke_2(.block_descriptor=0x0000600003e2cf60) at MTRDevice_Concrete.mm:763:21
    frame #9: 0x0000000106229f5c Matter`__72-[MTRDeviceController_Concrete asyncDispatchToMatterQueue:errorHandler:]_block_invoke(.block_descriptor=0x0000600003e2cf30, (null)=0x0000000148037e00) at MTRDeviceController_Concrete.mm:1513:9
    frame #10: 0x0000000106229d54 Matter`__79-[MTRDeviceController_Concrete asyncGetCommissionerOnMatterQueue:errorHandler:]_block_invoke(.block_descriptor=0x000060000252c600) at MTRDeviceController_Concrete.mm:1506:9
    frame #11: 0x000000018c2228f8 libdispatch.dylib`_dispatch_call_block_and_release + 32
    frame #12: 0x000000018c224658 libdispatch.dylib`_dispatch_client_callout + 20
    frame #13: 0x000000018c22bc60 libdispatch.dylib`_dispatch_lane_serial_drain + 744
    frame #14: 0x000000018c22c79c libdispatch.dylib`_dispatch_lane_invoke + 432
    frame #15: 0x000000018c2377e8 libdispatch.dylib`_dispatch_root_queue_drain_deferred_wlh + 288
    frame #16: 0x000000018c237034 libdispatch.dylib`_dispatch_workloop_worker_thread + 540
    frame #17: 0x000000018c3d33d8 libsystem_pthread.dylib`_pthread_wqthread + 288

… the same time _ensureSubscriptionForExistingDelegates is called with an error from getting a session it crashes with _os_unfair_lock_recursive_abort since it is called on the same thread and the error handling code also try to get a lock
@vivien-apple vivien-apple self-assigned this Oct 8, 2024
@vivien-apple vivien-apple requested a review from a team as a code owner October 8, 2024 17:34
Copy link

Review changes with SemanticDiff.

Copy link

github-actions bot commented Oct 8, 2024

PR #35973: Size comparison from 0a2e58d to 491bcb7

Full report (88 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, linux, nrfconnect, nxp, psoc6, qpg, stm32, telink, tizen)
platform target config section 0a2e58d 491bcb7 change % change
bl602 lighting-app bl602 FLASH 1280606 1280606 0 0.0
RAM 95808 95808 0 0.0
bl602+mfd FLASH 1308654 1308654 0 0.0
RAM 95464 95464 0 0.0
bl602+rpc FLASH 1335282 1335282 0 0.0
RAM 103784 103784 0 0.0
bl702 lighting-app bl702 FLASH 945032 945032 0 0.0
RAM 15105 15105 0 0.0
bl702+mfd FLASH 957486 957486 0 0.0
RAM 14769 14769 0 0.0
bl702+rpc FLASH 1051190 1051190 0 0.0
RAM 23661 23661 0 0.0
bl706-eth FLASH 658164 658164 0 0.0
RAM 24745 24745 0 0.0
bl706-wifi FLASH 824960 824960 0 0.0
RAM 13965 13965 0 0.0
bl702l lighting-app bl702l FLASH 961968 961968 0 0.0
RAM 16804 16804 0 0.0
bl702l+mfd FLASH 974730 974730 0 0.0
RAM 16468 16468 0 0.0
cc13x4_26x4 lighting-app LP_EM_CC1354P10_6 FLASH 829332 829332 0 0.0
RAM 123292 123292 0 0.0
lock-ftd LP_EM_CC1354P10_6 FLASH 814656 814656 0 0.0
RAM 125172 125172 0 0.0
pump-app LP_EM_CC1354P10_6 FLASH 761220 761220 0 0.0
RAM 113664 113664 0 0.0
pump-controller-app LP_EM_CC1354P10_6 FLASH 745448 745448 0 0.0
RAM 113856 113856 0 0.0
cc32xx air-purifier CC3235SF_LAUNCHXL FLASH 616954 616954 0 0.0
RAM 205396 205396 0 0.0
lock CC3235SF_LAUNCHXL FLASH 657058 657058 0 0.0
RAM 205548 205548 0 0.0
cyw30739 light CYW30739B2-P5-EVK-01 unknown 2040 2040 0 0.0
FLASH 678301 678301 0 0.0
RAM 78508 78508 0 0.0
CYW30739B2-P5-EVK-02 unknown 2040 2040 0 0.0
FLASH 698153 698153 0 0.0
RAM 81140 81140 0 0.0
CYW30739B2-P5-EVK-03 unknown 2040 2040 0 0.0
FLASH 698153 698153 0 0.0
RAM 81140 81140 0 0.0
CYW930739M2EVB-02 unknown 2040 2040 0 0.0
FLASH 655081 655081 0 0.0
RAM 73576 73576 0 0.0
light-switch CYW30739B2-P5-EVK-01 unknown 2040 2040 0 0.0
FLASH 614637 614637 0 0.0
RAM 71468 71468 0 0.0
CYW30739B2-P5-EVK-02 unknown 2040 2040 0 0.0
FLASH 634273 634273 0 0.0
RAM 74020 74020 0 0.0
CYW30739B2-P5-EVK-03 unknown 2040 2040 0 0.0
FLASH 634273 634273 0 0.0
RAM 74020 74020 0 0.0
lock CYW30739B2-P5-EVK-01 unknown 2040 2040 0 0.0
FLASH 633949 633949 0 0.0
RAM 74516 74516 0 0.0
CYW30739B2-P5-EVK-02 unknown 2040 2040 0 0.0
FLASH 653665 653665 0 0.0
RAM 77068 77068 0 0.0
CYW30739B2-P5-EVK-03 unknown 2040 2040 0 0.0
FLASH 653665 653665 0 0.0
RAM 77068 77068 0 0.0
thermostat CYW30739B2-P5-EVK-01 unknown 2040 2040 0 0.0
FLASH 609253 609253 0 0.0
RAM 68604 68604 0 0.0
CYW30739B2-P5-EVK-02 unknown 2040 2040 0 0.0
FLASH 629113 629113 0 0.0
RAM 71236 71236 0 0.0
CYW30739B2-P5-EVK-03 unknown 2040 2040 0 0.0
FLASH 629113 629113 0 0.0
RAM 71236 71236 0 0.0
efr32 lighting-app BRD4187C FLASH 933860 933860 0 0.0
RAM 135876 135876 0 0.0
lock-app BRD2605a FLASH 740248 740240 -8 -0.0
RAM 230240 230240 0 0.0
BRD4338a FLASH 740896 740888 -8 -0.0
RAM 230272 230272 0 0.0
window-app BRD4187C FLASH 1017264 1017264 0 0.0
RAM 127804 127804 0 0.0
esp32 all-clusters-app c3devkit DRAM 94080 94080 0 0.0
FLASH 1538432 1538432 0 0.0
IRAM 82538 82538 0 0.0
m5stack DRAM 115032 115032 0 0.0
FLASH 1548898 1548898 0 0.0
IRAM 117039 117039 0 0.0
linux air-purifier-app debug unknown 4688 4688 0 0.0
FLASH 2780049 2780049 0 0.0
RAM 129488 129488 0 0.0
all-clusters-app debug unknown 5528 5528 0 0.0
FLASH 6084046 6084046 0 0.0
RAM 512416 512416 0 0.0
all-clusters-minimal-app debug unknown 5424 5424 0 0.0
FLASH 5417296 5417296 0 0.0
RAM 241200 241200 0 0.0
bridge-app debug unknown 5408 5408 0 0.0
FLASH 4750770 4750770 0 0.0
RAM 218384 218384 0 0.0
chip-tool debug unknown 5960 5960 0 0.0
FLASH 12673576 12673576 0 0.0
RAM 563410 563410 0 0.0
chip-tool-ipv6only arm64 unknown 20736 20736 0 0.0
FLASH 11307312 11307312 0 0.0
RAM 613728 613728 0 0.0
fabric-admin debug unknown 5792 5792 0 0.0
FLASH 11019545 11019545 0 0.0
RAM 563066 563066 0 0.0
fabric-bridge-app debug unknown 4632 4632 0 0.0
FLASH 4573966 4573966 0 0.0
RAM 205048 205048 0 0.0
lighting-app debug+rpc+ui unknown 6056 6056 0 0.0
FLASH 5691361 5691361 0 0.0
RAM 228456 228456 0 0.0
lock-app debug unknown 5344 5344 0 0.0
FLASH 4800332 4800332 0 0.0
RAM 204472 204472 0 0.0
ota-provider-app debug unknown 4720 4720 0 0.0
FLASH 4429756 4429756 0 0.0
RAM 198192 198192 0 0.0
ota-requestor-app debug unknown 4656 4656 0 0.0
FLASH 4568570 4568570 0 0.0
RAM 202760 202760 0 0.0
shell debug unknown 4216 4216 0 0.0
FLASH 3110109 3110109 0 0.0
RAM 159280 159280 0 0.0
thermostat-no-ble arm64 unknown 9448 9448 0 0.0
FLASH 4318336 4318336 0 0.0
RAM 242888 242888 0 0.0
tv-app debug unknown 5624 5624 0 0.0
FLASH 6028229 6028229 0 0.0
RAM 585952 585952 0 0.0
tv-casting-app debug unknown 5208 5208 0 0.0
FLASH 10870845 10870845 0 0.0
RAM 651888 651888 0 0.0
nrfconnect all-clusters-app nrf52840dk_nrf52840 FLASH 914236 914236 0 0.0
RAM 142199 142199 0 0.0
nrf7002dk_nrf5340_cpuapp FLASH 885036 885036 0 0.0
RAM 140338 140338 0 0.0
all-clusters-minimal-app nrf52840dk_nrf52840 FLASH 847652 847652 0 0.0
RAM 141093 141093 0 0.0
nxp contact k32w0+release FLASH 581992 581992 0 0.0
RAM 70784 70784 0 0.0
k32w1+release FLASH 596696 596696 0 0.0
RAM 63008 63008 0 0.0
mcxw71+release FLASH 596456 596456 0 0.0
RAM 63008 63008 0 0.0
light k32w0+release FLASH 618556 618556 0 0.0
RAM 70256 70256 0 0.0
k32w1+release FLASH 682696 682696 0 0.0
RAM 48648 48648 0 0.0
mcxw71+release FLASH 682696 682696 0 0.0
RAM 48648 48648 0 0.0
lock k32w1+release FLASH 705040 705040 0 0.0
RAM 67148 67148 0 0.0
mcxw71+release FLASH 705048 705048 0 0.0
RAM 67148 67148 0 0.0
psoc6 all-clusters cy8ckit_062s2_43012 FLASH 1646132 1646132 0 0.0
RAM 210888 210888 0 0.0
all-clusters-minimal cy8ckit_062s2_43012 FLASH 1552180 1552180 0 0.0
RAM 207688 207688 0 0.0
light cy8ckit_062s2_43012 FLASH 1467652 1467652 0 0.0
RAM 200688 200688 0 0.0
lock cy8ckit_062s2_43012 FLASH 1464724 1464724 0 0.0
RAM 225048 225048 0 0.0
qpg lighting-app qpg6105+debug FLASH 660328 660328 0 0.0
RAM 105236 105236 0 0.0
lock-app qpg6105+debug FLASH 618340 618340 0 0.0
RAM 99704 99704 0 0.0
stm32 light STM32WB5MM-DK FLASH 481624 481624 0 0.0
RAM 144684 144684 0 0.0
telink air-quality-sensor-app tlsr9528a_retention FLASH 620726 620726 0 0.0
RAM 50488 50488 0 0.0
all-clusters-app tlsr9118bdk40d FLASH 687924 687924 0 0.0
RAM 148328 148328 0 0.0
all-clusters-minimal-app tlsr9528a FLASH 780916 780916 0 0.0
RAM 110280 110280 0 0.0
bridge-app tlsr9258a FLASH 680822 680822 0 0.0
RAM 91144 91144 0 0.0
contact-sensor-app tlsr9528a_retention FLASH 620490 620490 0 0.0
RAM 50440 50440 0 0.0
light-switch-app-ota-shell-factory-data tlsr9528a FLASH 708432 708432 0 0.0
RAM 73780 73780 0 0.0
lighting-app-ota-factory-data tlsr9118bdk40d FLASH 625428 625428 0 0.0
RAM 144308 144308 0 0.0
lighting-app-ota-rpc-factory-data-4mb tlsr9518adk80d FLASH 811188 811188 0 0.0
RAM 98940 98940 0 0.0
lock-app-dfu tlsr9528a FLASH 656396 656396 0 0.0
RAM 66500 66500 0 0.0
ota-requestor-app tlsr9258a FLASH 696800 696800 0 0.0
RAM 90736 90736 0 0.0
pump-app-usb tlsr9518adk80d FLASH 634124 634124 0 0.0
RAM 55316 55316 0 0.0
pump-controller-app tlsr9518adk80d FLASH 611314 611314 0 0.0
RAM 52560 52560 0 0.0
shell tlsr9518adk80d FLASH 467872 467872 0 0.0
RAM 68168 68168 0 0.0
smoke_co_alarm-app tlsr9528a_retention FLASH 627640 627640 0 0.0
RAM 52160 52160 0 0.0
temperature-measurement-app-mars-ota tlsr9518adk80d FLASH 653446 653446 0 0.0
RAM 56108 56108 0 0.0
thermostat tlsr9518adk80d FLASH 638046 638046 0 0.0
RAM 52952 52952 0 0.0
window-covering tlsr9118bdk40d FLASH 524184 524184 0 0.0
RAM 97284 97284 0 0.0
tizen all-clusters-app arm unknown 4904 4904 0 0.0
FLASH 1725884 1725884 0 0.0
RAM 89076 89076 0 0.0
chip-tool-ubsan arm unknown 10456 10456 0 0.0
FLASH 17544418 17544418 0 0.0
RAM 7642364 7642364 0 0.0

if (self.suspended) {
MTR_LOG_ERROR("%@ suspended: can't get session for node %016llX-%016llx (%llu)", self, self.compressedFabricID.unsignedLongLongValue, nodeID, nodeID);
// TODO: Can we do a better error here?
completion(nullptr, chip::NullOptional, [MTRError errorForCHIPErrorCode:CHIP_ERROR_INCORRECT_STATE], nil);
dispatch_async(_chipWorkQueue, ^{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the thing is... asyncGetCommissionerOnMatterQueue can also sync-call completion, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to either fix whoever is calling directlyGetSessionForNode to handle the fact that they might get a sync-call in whatever state they are in to the error handler, or change the API documentation for getSessionForNode and the implementations of both that and directlyGetSessionForNode to guarantee async callbacks.

We used to not be able to use the Matter queue there, because without a running controller we potentially had no running queue. But now that we never shut the queue down (right?) maybe we can consistently deliver the errors on that queue....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

5 participants