Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to reach the CommissioningStageComplete if network is unreachable during commissioning progress #30041

Closed
lion6420 opened this issue Oct 27, 2023 · 6 comments · Fixed by #30068
Labels
bug Something isn't working needs triage

Comments

@lion6420
Copy link

lion6420 commented Oct 27, 2023

Reproduction steps

We use our own matter controller, and are testing our error handling function. Therefore, during the following step, the chip-all-cluster-app will be shutdown and network will be cut off purposely.

  1. Controller (ubuntu PC) connect to raspberry-pi (chip-all-cluster-app) by ethernet
  2. Pairing device by mdns discovery (example command: ./chip-tool pairing onnetwork 1 20202021)
  3. During the commissioning progress, shutdown the chip-all-cluster-app and cut off the ethernet connection purposely
  4. The commission progress will failed and the call back function OnCommissioningComplete is not called.

Commissioning without network cut off: Log without network cutdown.txt

[1698369771.645721][149159:149161] CHIP:CTL: Going from commissioning step 'SendPAICertificateRequest' with lastErr = 'src/app/CommandSender.cpp:263: CHIP Error 0x00000032: Timeout' -> 'Cleanup'
[1698369771.645731][149159:149161] CHIP:CTL: Performing next commissioning step 'Cleanup' with completion status = 'src/app/CommandSender.cpp:263: CHIP Error 0x00000032: Timeout'
[1698369771.645734][149159:149161] CHIP:CTL: Expiring failsafe on proxy 0x7fdd9c0021c0
[1698369771.645757][149159:149161] CHIP:DMG: ICR moving to [AddingComm]
[1698369771.645762][149159:149161] CHIP:DMG: ICR moving to [AddedComma]
[1698369771.645814][149159:149161] CHIP:EM: <<< [E:59831i S:44553 M:10708539] (S) Msg TX to 0:FFFFFFFB00000000 [0000] --- Type 0001:08 (IM:InvokeCommandRequest)
[1698369771.645830][149159:149161] CHIP:IN: (S) Sending msg 10708539 on secure session with LSID: 44553
[1698369771.646094][149159:149161] CHIP:DMG: ICR moving to [CommandSen]
[1698369771.646140][149159:149161] CHIP:DMG: ICR moving to [AwaitingDe]
[1698369772.002093][149159:149161] CHIP:EM: Retransmitting MessageCounter:10708539 on exchange 59831i Send Cnt 1
[1698369772.002208][149159:149161] CHIP:IN: (S) Sending msg 10708539 on secure session with LSID: 44553
[1698369772.387395][149159:149161] CHIP:EM: Retransmitting MessageCounter:10708539 on exchange 59831i Send Cnt 2
[1698369772.387451][149159:149161] CHIP:IN: (S) Sending msg 10708539 on secure session with LSID: 44553
[1698369772.962330][149159:149161] CHIP:EM: Retransmitting MessageCounter:10708539 on exchange 59831i Send Cnt 3
[1698369772.962434][149159:149161] CHIP:IN: (S) Sending msg 10708539 on secure session with LSID: 44553
[1698369773.887160][149159:149161] CHIP:EM: Retransmitting MessageCounter:10708539 on exchange 59831i Send Cnt 4
[1698369773.887228][149159:149161] CHIP:IN: (S) Sending msg 10708539 on secure session with LSID: 44553
[1698369775.571683][149159:149161] CHIP:EM: Failed to Send CHIP MessageCounter:10708539 on exchange 59831i sendCount: 4 max retries: 4
[1698369777.872659][149159:149161] CHIP:DMG: Time out! failed to receive invoke command response from Exchange: 59831i
[1698369777.872707][149159:149161] CHIP:CTL: Received failure response  when disarming failsafesrc/app/CommandSender.cpp:263: CHIP Error 0x00000032: Timeout

Network cut off case: LogNetworkCutdownCase.txt

[1698370306.475366][149323:149325] CHIP:CTL: Failed to perform commissioning step 10
[1698370306.475372][149323:149325] CHIP:CTL: Going from commissioning step 'SendPAICertificateRequest' with lastErr = 'src/app/CommandSender.cpp:263: CHIP Error 0x00000032: Timeout' -> 'Cleanup'
[1698370306.475386][149323:149325] CHIP:CTL: Performing next commissioning step 'Cleanup' with completion status = 'src/app/CommandSender.cpp:263: CHIP Error 0x00000032: Timeout'
[1698370306.475392][149323:149325] CHIP:CTL: Expiring failsafe on proxy 0x7ff8d40021c0
[1698370306.475424][149323:149325] CHIP:DMG: ICR moving to [AddingComm]
[1698370306.475438][149323:149325] CHIP:DMG: ICR moving to [AddedComma]
[1698370306.475507][149323:149325] CHIP:EM: <<< [E:2653i S:49873 M:139577590] (S) Msg TX to 0:FFFFFFFB00000000 [0000] --- Type 0001:08 (IM:InvokeCommandRequest)
[1698370306.475523][149323:149325] CHIP:IN: (S) Sending msg 139577590 on secure session with LSID: 49873
[1698370306.475567][149323:149325] CHIP:DMG: ICR moving to [AwaitingDe]

Is it because the DisArmSafe command failed in DeviceCommissioner::CleanupCommissioning and without handling it?

else
    {
        // If we've failed somewhere in the early stages (or we don't have a failedStage specified), we need to start from the
        // beginning. However, because some of the commands can only be sent once per arm-failsafe, we also need to force a reset on
        // the failsafe so we can start fresh on the next attempt.
        GeneralCommissioning::Commands::ArmFailSafe::Type request;
        request.expiryLengthSeconds = 0; // Expire immediately.
        request.breadcrumb          = 0;
        ChipLogProgress(Controller, "Expiring failsafe on proxy %p", proxy);
        mDeviceBeingCommissioned = proxy;
        // We actually want to do the same thing on success or failure because we're already in a failure state
        SendCommand(proxy, request, OnDisarmFailsafe, OnDisarmFailsafeFailure,
                    /* timeout = */ NullOptional);
    }

Bug prevalence

Whenever I do this

GitHub hash of the SDK that was being used

e059202

Platform

other

Platform Version(s)

No response

Anything else?

No response

@lion6420 lion6420 added bug Something isn't working needs triage labels Oct 27, 2023
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Oct 27, 2023
The commissioning state machine, when handling a state, needs to either queue up
an async process that will eventually push the state machine along or needs to
fail out and stop the commissioning process.  We had a number of state handlers
that could fail to do both if the attempt to send the message failed, which
would leave the commissioner in a stuck state where it thought it was waiting
for the async work, while the async work was not queued.

This change adds the relevant error checks to ensure we never wait for work that
has not started.

Fixes project-chip#30041
@bzbarsky-apple
Copy link
Contributor

Is it because the DisArmSafe command failed in DeviceCommissioner::CleanupCommissioning and without handling it?

Yes, exactly. Good catch. #30068 should fix this; would you be willing to verify that @lion6420?

@lion6420
Copy link
Author

Sure, I am willing to verify that. I can do the verification until I go back to the office on Monday @bzbarsky-apple .

I would like to ask about the ArmSafe. I am not quite understanding about this mechanism. Why can't the controller just delete the commssionee and make the commissioner back to the initial stage?

@bzbarsky-apple
Copy link
Contributor

@lion6420 Yes, nothing here is urgent enough to be worth worrying about over the weekend!

Why can't the controller just delete the commssionee and make the commissioner back to the initial stage?

It can, in terms its own internal state. Clearing the fail-safe is about putting the commissionee device (if it's reachable) into a state where the user can try again (perhaps with a different wireless network, if the issue was that the device could not join the IP network) immediately instead of having to wait for the fail-safe to time out.

@lion6420
Copy link
Author

@bzbarsky-apple Thanks for your reply, have a nice weekend!

@lion6420
Copy link
Author

Hi @bzbarsky-apple
I upgraded the SDK to 02e7647 (Fri Oct 27.), and verified the issue was fixed with your modification.

Controller log: HandleSendingErrorsDuringCommissioing.txt

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Oct 31, 2023
The commissioning state machine, when handling a state, needs to either queue up
an async process that will eventually push the state machine along or needs to
fail out and stop the commissioning process.  We had a number of state handlers
that could fail to do both if the attempt to send the message failed, which
would leave the commissioner in a stuck state where it thought it was waiting
for the async work, while the async work was not queued.

This change adds the relevant error checks to ensure we never wait for work that
has not started.

Fixes project-chip#30041
@bzbarsky-apple
Copy link
Contributor

@lion6420 Thank you for checking that!

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Oct 31, 2023
The commissioning state machine, when handling a state, needs to either queue up
an async process that will eventually push the state machine along or needs to
fail out and stop the commissioning process.  We had a number of state handlers
that could fail to do both if the attempt to send the message failed, which
would leave the commissioner in a stuck state where it thought it was waiting
for the async work, while the async work was not queued.

This change adds the relevant error checks to ensure we never wait for work that
has not started.

Fixes project-chip#30041
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Nov 1, 2023
The commissioning state machine, when handling a state, needs to either queue up
an async process that will eventually push the state machine along or needs to
fail out and stop the commissioning process.  We had a number of state handlers
that could fail to do both if the attempt to send the message failed, which
would leave the commissioner in a stuck state where it thought it was waiting
for the async work, while the async work was not queued.

This change adds the relevant error checks to ensure we never wait for work that
has not started.

Fixes project-chip#30041
@mergify mergify bot closed this as completed in #30068 Nov 1, 2023
mergify bot pushed a commit that referenced this issue Nov 1, 2023
The commissioning state machine, when handling a state, needs to either queue up
an async process that will eventually push the state machine along or needs to
fail out and stop the commissioning process.  We had a number of state handlers
that could fail to do both if the attempt to send the message failed, which
would leave the commissioner in a stuck state where it thought it was waiting
for the async work, while the async work was not queued.

This change adds the relevant error checks to ensure we never wait for work that
has not started.

Fixes #30041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants