Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'sfputil firmware run' cmd needs better resilience and synchronization with PMON Xcvrd #17615

Open
snider-nokia opened this issue Dec 20, 2023 · 10 comments
Assignees
Labels
MSFT Triaged this issue has been triaged

Comments

@snider-nokia
Copy link
Contributor

Description

When 'sfputil firmware run ' command is invoked it causes the target transceiver module to reset, however Xcvrd is unaware that said operation is taking place so its threads may/will continue to attempt to access the module during the window of time prior to the module being restored to normal operational status.

Steps to reproduce the issue

  1. For the simplest case, ensure the targeted interface has a CMIS module present (which properly supports FW upgrade procedure) and that the applicable interface is admin down (interface can also be set to admin up, which is even more likely to cause the problem to occur).
  2. Issue the 'sfputil firmware run ' command to induce the target transceiver module to reset and begin running its inactive (or active, when supported) firmware load.
  3. Wait for interface to stabilize and then run the command again until such time as the 'sfputil firmware run ' command fails with a traceback.

Describe the results you received

Exactly what is described in paragraph 3 above.

Describe the results you expected

Optimally, transceiver module accesses would/should not fail after this command is issued. But, there is no guarantee that accesses to the module will complete successfully until such time as the module stabilizes post-reset.

Additional information you deem important

Two detailed annotated samples are provided below to show the progression of events involved here.

In the first sample, the 'sfputil firmware run' command appears to work fine and indicates a successful completion status. Even in this case, though, PMON Xcvrd is experiencing failures when simultaneously attempting to access the protagonist module.

In the second sample, the 'sfputil firmware run' command fails with a traceback when it issues a module read operation that doesn't complete successfully and the platform specific code returns value None (as specified by sfp_base.py, and due to the failed access).

PMON Xcvrd threads should not be attempting to access a module that has this command issued to it until such time as the module is understood to be operating normally again (and is prepared to sink accesses). As the transceiver subsystem architecture stands now, these Xcvrd threads may try to provision/de-provision the module datapath, solicit DDM/DOM data, or interact with the module in other ways.

Investigation and sample runs were conducted using Acacia ZR module target with the FW versions shown (at interface Ethernet80):
image

CMIS spec indicates that module behavior during the associated 'Run FW Image' reset is 'vendor and technology dependent', thus there can be no assumption that module can be accessed prior to quiescing post-reset. Spec further indicates that CMD 0041h: 'Firmware Management Features' is (should be) used to query firmware command performance attributes (for example, how long it may take maximally to execute commands).

image

Consistent with the above, the Acacia ZR/ZR+ documentation that we have here states that 'Before issuing/using FW download [including 0109h: Run Image], the host should issue CMD 0041h to familiarize itself with the features supported, and in particular the max timeout values.'

It may be necessary to engage module vendor(s) in order to understand the specific access restrictions in this area (during the associated 'Run Image' reset period), as it would appear that module behavior can/does change with different FW versions and also when contrasting non-hitless with hitless upgrade.

1st annotated sample run:

image

Sfputil firmware run cmd and xcvrd dysfunction sample 1

2nd annotated sample run:

image

Sfputil firmware run cmd and xcvrd dysfunction sample 2

Output of show version

Using 202205 branch...

admin@ixre-egl-board40:~$ show version

SONiC Software Version: SONiC.HEAD.600988-msft-2205-ndk-c854a6a2
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-18-2-amd64
Build commit: c854a6a2
Build date: Mon Dec 18 22:37:27 UTC 2023
Built by: gitlab-runner@sonic-bld2

Platform: x86_64-nokia_ixr7250e_36x400g-r0
HwSKU: Nokia-IXR7250E-36x400G
ASIC: broadcom
ASIC Count: 2
Serial Number: EAG2-02-052
Model Number: N/A
Hardware Revision: 56
Uptime: 18:56:46 up 1 day,  2:49,  1 user,  load average: 1.14, 1.32, 1.36
Date: Wed 20 Dec 2023 18:56:46

Docker images:
REPOSITORY                    TAG                                  IMAGE ID       SIZE
docker-orchagent              HEAD.600988-msft-2205-ndk-c854a6a2   90241935b030   406MB
docker-orchagent              latest                               90241935b030   406MB
docker-fpm-frr                HEAD.600988-msft-2205-ndk-c854a6a2   df6176897cc2   418MB
docker-fpm-frr                latest                               df6176897cc2   418MB
docker-teamd                  HEAD.600988-msft-2205-ndk-c854a6a2   a2d7a8d56c83   389MB
docker-teamd                  latest                               a2d7a8d56c83   389MB
docker-macsec                 latest                               5394fbb21224   391MB
docker-syncd-brcm-dnx         HEAD.600988-msft-2205-ndk-c854a6a2   6353695a531f   718MB
docker-syncd-brcm-dnx         latest                               6353695a531f   718MB
docker-gbsyncd-broncos        HEAD.600988-msft-2205-ndk-c854a6a2   d9ed266637ba   419MB
docker-gbsyncd-broncos        latest                               d9ed266637ba   419MB
docker-gbsyncd-credo          HEAD.600988-msft-2205-ndk-c854a6a2   ca22f0ad248b   392MB
docker-gbsyncd-credo          latest                               ca22f0ad248b   392MB
docker-dhcp-relay             latest                               f4f260277d3f   380MB
docker-snmp                   HEAD.600988-msft-2205-ndk-c854a6a2   21e34abe3852   422MB
docker-snmp                   latest                               21e34abe3852   422MB
docker-platform-monitor       HEAD.600988-msft-2205-ndk-c854a6a2   2a0cbbb6e240   460MB
docker-platform-monitor       latest                               2a0cbbb6e240   460MB
docker-router-advertiser      HEAD.600988-msft-2205-ndk-c854a6a2   6f17ed3df048   372MB
docker-router-advertiser      latest                               6f17ed3df048   372MB
docker-lldp                   HEAD.600988-msft-2205-ndk-c854a6a2   39dcda273ae9   381MB
docker-lldp                   latest                               39dcda273ae9   381MB
docker-mux                    HEAD.600988-msft-2205-ndk-c854a6a2   3c503b426df1   384MB
docker-mux                    latest                               3c503b426df1   384MB
docker-database               HEAD.600988-msft-2205-ndk-c854a6a2   c87f15bd9176   372MB
docker-database               latest                               c87f15bd9176   372MB
docker-sonic-telemetry        HEAD.600988-msft-2205-ndk-c854a6a2   66ddf943a2fe   453MB
docker-sonic-telemetry        latest                               66ddf943a2fe   453MB
docker-nat                    HEAD.600988-msft-2205-ndk-c854a6a2   0ca4b951be64   322MB
docker-nat                    latest                               0ca4b951be64   322MB
docker-sflow                  HEAD.600988-msft-2205-ndk-c854a6a2   13b652bc661e   320MB
docker-sflow                  latest                               13b652bc661e   320MB
docker-sonic-mgmt-framework   HEAD.600988-msft-2205-ndk-c854a6a2   f8f1ca49f557   449MB
docker-sonic-mgmt-framework   latest                               f8f1ca49f557   449MB

Additional comments

  • sfputil should not fail with a traceback when/if platform specific code returns value None (as prescribed by sfp_base.py) from read_eeprom method when a module read operation fails.

  • There is some measure of synchronization warranted with Xcvrd whereby Xcvrd threads are not attempting to access a module which is in parallel having this 109h: Run Image command executed against it (and is thus being reset).

@snider-nokia
Copy link
Contributor Author

@prgeor, @mihirpat1, @judyjoseph, Please have a look at this. This issue is directly related to the investigation conducted in the context of https://github.com/Nokia-ION/ndk/issues/28.

@rlhui
Copy link
Contributor

rlhui commented Dec 24, 2023

@snider-nokia - please create all sonic issues in buildimage repo, thanks.

@rlhui rlhui transferred this issue from sonic-net/sonic-utilities Dec 24, 2023
@prgeor
Copy link
Contributor

prgeor commented Dec 26, 2023

@prgeor, @mihirpat1, @judyjoseph, Please have a look at this. This issue is directly related to the investigation conducted in the context of https://github.com/Nokia-ION/ndk/issues/28.

@snider-nokia i don't have access to ndk issue

@prgeor prgeor added MSFT Triaged this issue has been triaged labels Dec 26, 2023
@prgeor
Copy link
Contributor

prgeor commented Dec 26, 2023

@snider-nokia , do you still see the issue if the interface is in "admin" down? Ideally firmware download should be done after the link is isolated.

@snider-nokia
Copy link
Contributor Author

@snider-nokia , do you still see the issue if the interface is in "admin" down? Ideally firmware download should be done after the link is isolated.

Yes, as indicated in the original writeup, the problem does indeed still occur if the interface is admin down. Xcvrd is still attempting to interact with the associated module even when the interface is admin down.

NDK issue #28 discusses 'sfputil firmware run' command resulting in an explosion and traceback, @prgeor. Hopefully @judyjoseph and/or @mihirpat1 can provide you with a snapshot (or direct access) to that issue...

@prgeor
Copy link
Contributor

prgeor commented Dec 27, 2023

@snider-nokia to isolate this issue from Nokia, this issue should be reproducible on another platform? If so, I will check with Mihir.

As per my understanding CDB command 0109h DOEST not reset the I2C management interface. The reset being talked about is w.r.t the CDB instance. So after 0109h, the CDB instance is busy until the transaction/CMD is complete so during this period of time, host should not do any more CDB transaction, but the i2c management interface is still available for normal DOM polling.

image

To rule out this is Nokia specific, I think I can use the same Acaacia 400G ZR module and do multiple i2c reads of pages during firmware commit. Will update here.

@prgeor
Copy link
Contributor

prgeor commented Dec 27, 2023

*commit -> Activation

@arlakshm
Copy link
Contributor

arlakshm commented May 8, 2024

@prgeor can please share an update on this issue

@arlakshm
Copy link
Contributor

arlakshm commented May 8, 2024

@bmridul @kenneth-arista for viz..

@rlhui
Copy link
Contributor

rlhui commented Sep 4, 2024

@prgeor any suggestion on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MSFT Triaged this issue has been triaged
Projects
Status: No status
Development

No branches or pull requests

5 participants