-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'sfputil firmware run' cmd needs better resilience and synchronization with PMON Xcvrd #17615
Comments
@prgeor, @mihirpat1, @judyjoseph, Please have a look at this. This issue is directly related to the investigation conducted in the context of https://github.com/Nokia-ION/ndk/issues/28. |
@snider-nokia - please create all sonic issues in buildimage repo, thanks. |
@snider-nokia i don't have access to ndk issue |
@snider-nokia , do you still see the issue if the interface is in "admin" down? Ideally firmware download should be done after the link is isolated. |
Yes, as indicated in the original writeup, the problem does indeed still occur if the interface is admin down. Xcvrd is still attempting to interact with the associated module even when the interface is admin down. NDK issue #28 discusses 'sfputil firmware run' command resulting in an explosion and traceback, @prgeor. Hopefully @judyjoseph and/or @mihirpat1 can provide you with a snapshot (or direct access) to that issue... |
@snider-nokia to isolate this issue from Nokia, this issue should be reproducible on another platform? If so, I will check with Mihir. As per my understanding CDB command 0109h DOEST not reset the I2C management interface. The reset being talked about is w.r.t the CDB instance. So after 0109h, the CDB instance is busy until the transaction/CMD is complete so during this period of time, host should not do any more CDB transaction, but the i2c management interface is still available for normal DOM polling. To rule out this is Nokia specific, I think I can use the same Acaacia 400G ZR module and do multiple i2c reads of pages during firmware commit. Will update here. |
*commit -> Activation |
@prgeor can please share an update on this issue |
@bmridul @kenneth-arista for viz.. |
@prgeor any suggestion on this issue? |
Description
When 'sfputil firmware run ' command is invoked it causes the target transceiver module to reset, however Xcvrd is unaware that said operation is taking place so its threads may/will continue to attempt to access the module during the window of time prior to the module being restored to normal operational status.
Steps to reproduce the issue
Describe the results you received
Exactly what is described in paragraph 3 above.
Describe the results you expected
Optimally, transceiver module accesses would/should not fail after this command is issued. But, there is no guarantee that accesses to the module will complete successfully until such time as the module stabilizes post-reset.
Additional information you deem important
Two detailed annotated samples are provided below to show the progression of events involved here.
In the first sample, the 'sfputil firmware run' command appears to work fine and indicates a successful completion status. Even in this case, though, PMON Xcvrd is experiencing failures when simultaneously attempting to access the protagonist module.
In the second sample, the 'sfputil firmware run' command fails with a traceback when it issues a module read operation that doesn't complete successfully and the platform specific code returns value None (as specified by sfp_base.py, and due to the failed access).
PMON Xcvrd threads should not be attempting to access a module that has this command issued to it until such time as the module is understood to be operating normally again (and is prepared to sink accesses). As the transceiver subsystem architecture stands now, these Xcvrd threads may try to provision/de-provision the module datapath, solicit DDM/DOM data, or interact with the module in other ways.
Investigation and sample runs were conducted using Acacia ZR module target with the FW versions shown (at interface Ethernet80):
CMIS spec indicates that module behavior during the associated 'Run FW Image' reset is 'vendor and technology dependent', thus there can be no assumption that module can be accessed prior to quiescing post-reset. Spec further indicates that CMD 0041h: 'Firmware Management Features' is (should be) used to query firmware command performance attributes (for example, how long it may take maximally to execute commands).
Consistent with the above, the Acacia ZR/ZR+ documentation that we have here states that 'Before issuing/using FW download [including 0109h: Run Image], the host should issue CMD 0041h to familiarize itself with the features supported, and in particular the max timeout values.'
It may be necessary to engage module vendor(s) in order to understand the specific access restrictions in this area (during the associated 'Run Image' reset period), as it would appear that module behavior can/does change with different FW versions and also when contrasting non-hitless with hitless upgrade.
1st annotated sample run:
2nd annotated sample run:
Output of
show version
Using 202205 branch...
Additional comments
sfputil should not fail with a traceback when/if platform specific code returns value None (as prescribed by sfp_base.py) from read_eeprom method when a module read operation fails.
There is some measure of synchronization warranted with Xcvrd whereby Xcvrd threads are not attempting to access a module which is in parallel having this 109h: Run Image command executed against it (and is thus being reset).
The text was updated successfully, but these errors were encountered: