-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLS Silverstone: enable the init sequence for all of the QSFPDD CMIS4… #95
CLS Silverstone: enable the init sequence for all of the QSFPDD CMIS4… #95
Conversation
… modules Signed-off-by: Dante Su <dante.su@broadcom.com>
PLEASE DO NOT approve this review until the Broadcom QA finished the internal tests, I'll later send an all-clear flag once it's done. This PR is early access to review what are the proposed changes to address the CMIS4 issues, while the new test image has yet generated, the DUT with corresponding changes manually applied are verified, please refer to the attached log for details. |
@@ -1431,7 +1431,7 @@ static int smbus_access(struct i2c_adapter *adapter, u16 addr, | |||
goto Done; | |||
} | |||
|
|||
#if 1 /* 100 kHz */ | |||
#if 0 /* 100 kHz */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this is unlikely to be the root cause of I2C failure reported yesterday, however it's the only thing under our control without having the Celestica engaged, that's why I did this.
Please note it's always working on our test unit no matter if high-speed (BIT6-0x40) is enabled or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we turn off the high speed mode, how slow would it be? I don't feel we should do this unless we feel it was necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We will have ~30% delays for each I2C operations when the BIT6 (high-speed) is cleared
- Based on the latest test report from Broadcom QA, slowing down the clock does not improve the transceiver compatibility.
And hence, I'll revert this change.
Note: None of the QSFPDD optics is available at my side, so I'm counting on QA for the verifications
e.g.
Dante,
It appears Eoptolink 3.0 broke. It took ~17 minutes for link up after remove/insert
Please see the logs below
Regards,
-kevin
......
Nov 25 21:47:25.015491 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:47:25.015491 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:48:06.595530 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:48:06.595530 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:49:36.731519 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:49:36.731519 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:50:11.383533 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:50:11.383533 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:50:39.103534 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:50:39.103534 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:51:34.543512 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
# its module state could get stuck at ModuleLowPwr upon reinsertion, and a software | ||
# reset is necessary to get it recovered. | ||
buf = self._read_eeprom_devid(x, self.IDENTITY_EEPROM_ADDR, 85, 1) | ||
if (buf is not None) and (int(buf[0], 16) != 0x03): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to address the DAC failure below:
pmon#sfputil.py: write failed: [Errno 110] Connection timed out
Some of the DAC (Passive Copper) do not support software reset, and hence such a write attempt will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of DAC, should we even add failure counter and come here, See line 553:
self.mod_failure[x] += 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another generic question, it seems we are keep monitoring all the settings and do reset/init etc all the time. It is a bit concerning if we that could be disruptive. Thinking maybe we could only do these steps when we got the optical insert event, and finish the bring up process once. of course, we will do it again if insert happens again. Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We'll actually do nothing in the case of DAC, so I'm not sure why we need to keep the counter advanced? Do we want to have any other automatic recovery for DAC?
-
The reason why this is not handled in the insertion event is that, the I2C read could fail and it was observed during the QA test last week. However, we could add a inited flag to make sure this is only performed once and only once, let me update the code logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1- that is exactly my point, we should not advance the counter and then avoid the write here, we should eliminate the counter advance in the code earlier so it won't even be a failure case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy that, and thanks, I'll update the code accordingly
rev = int(buf[1], 16) | ||
if rev >= 0x40: | ||
return self._init_cmis4_module(port_num, xcvr, self.hwsku) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the CMIS3 modules we have are working fine with 4x100G in its default application mode, and hence this init sequence will only be performed on CMIS4 modules.
Note: The definition of the datapath initialiation slightly differs from CMIS3 to CMIS4, and hence we can't use _init_cmis4_module() on CMIS3 modules.
@@ -450,7 +469,7 @@ def _init_cmis_module_custom(self, port_num, xcvr, hwsku): | |||
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, 0x10, 143, 0xff) | |||
# Initialize datapath | |||
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, 0x10, 128, 0x00) | |||
time.sleep(0.5) | |||
time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UT result shows that we need a few seconds for the actual configuration status correctly reflected on the register, but we don't really want to wait too long in the xcvrd, so I slightly enlarge the delay from 500ms to 1sec, hope this could reduce the redudant errors in the syslog
return True | ||
|
||
log_info("PORT {0}: {1}: _init_cmis_module_custom".format(port_num, xcvr)) | ||
def _get_application_code(self, port_num, host_intf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that 100G PAM4 mode is always the application 2, and hence we'll need to decode the application advertisement and figure the correct application code
…nd revert the clock of FPGA-I2C to high-speed Signed-off-by: Dante Su <dante.su@broadcom.com>
@@ -437,7 +439,7 @@ def _init_cmis4_module(self, port_num, xcvr, hwsku): | |||
log_info("PORT {0}: {1}: _init_cmis4_module".format(port_num, xcvr)) | |||
|
|||
# Allow 1s for software reset | |||
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, -1, 26, 0x08) | |||
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, -1, 26, 0x58) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure the module will be placed in LowPower mode
buf = self._read_eeprom_devid(port_num, self.IDENTITY_EEPROM_ADDR, self._page_to_flat(202, 0x11), 4) | ||
err = "".join(buf) | ||
err = '00000000' | ||
for t in range(10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eoptolink CMIS4 took 6 seconds for config status to get populated to the register, and hence 10 sec should be sufficient
# Monitoring the QSFP-DD initialization state, and reinitiate it if necessary | ||
# monitor the QSFP-DD initialization state, and reinitiate it if necessary | ||
if self.mod_inited[x]: | ||
continue | ||
buf = self._read_eeprom_devid(x, self.IDENTITY_EEPROM_ADDR, self._page_to_flat(202, 0x11), 4) | ||
if buf is None: | ||
continue | ||
err = "".join(buf) | ||
if err != '11111111': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for a record, this checker is necessary to prevent redundant CMIS4 init upon warm-reboot, otherwise significant packet drops will be observed.
@@ -1431,7 +1431,7 @@ static int smbus_access(struct i2c_adapter *adapter, u16 addr, | |||
goto Done; | |||
} | |||
|
|||
#if 0 /* 100 kHz */ | |||
#if 1 /* 100 kHz */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert the clock to high-speed, unless it's unlikely to be the root cause, and we actually have this enabled in ICOS, and I never got any issues for this
…ecoder Signed-off-by: Dante Su <dante.su@broadcom.com>
It's a known issue that Eoptolink CMIS3 requires a software reset after re-insertion, and a burst of software reset messages observed when its state != ModuleReady for 5 seconds. And hence, we'll have the following changes in this commit 1. Update the checker from (state != ModuleReady) to (state == ModuleLowPwr), which is exactly the behavior observed on the Eoptolink CMIS3, and its state will become ModulePwrDn after reset. 2. Add 1-second delay to the software reset, and hence it's impossible to have a burst of reset at a time 3. Ensure the failure counter gets reset when its state != ModuleLowPwr Signed-off-by: Dante Su <dante.su@broadcom.com>
As you mentioned, please send the all clear flag once the QA is done. |
Although the hotfix in the last commit 'CLS Silverstone: sfputil.py: fix Eoptolink CMIS3 support' has already verified by the QA, we still need to generate a new build with this change, and I'm working on it, will let you know once it's ready. |
After the internal discussion with my manager, Babu, it sounds like you're not using the binary from Broadcom, and you will use only the image rebuilt from the source package. In this case, you don't need to wait for the new build(image) from my end, please approve and merge the changes if no any concern |
) **sonic-platform-common:** Commits on Jul 07, 2020 Changes in fan and psu base classes (1.0 platform API) related to pdd… 17292e4 Commits on Jul 10, 2020 update get_pcie_check() to use sysfs instead of lspci output (#95) d4eb804 Commits on Jul 11, 2020 [eeprom] Fix UnboundLocalError (#93) 7c8bed1 Commits on Jul 14, 2020 [Transceiver] Add parser for QSFP-DD cable type and dictionaries for … be1cc24 **sonic-platform-daemons:** Commits on Jul 06, 2020 [psud] Store PSU temperature and voltage information to database (#61) ef9716a Commits on Jul 14, 2020 [xcvrd] Add support for QSFP-DD cables (#66) c530587 Commits on Jul 18, 2020 Initial version of pcied (#60) e665ee8 [xcvrd] Return non-zero error code on SFP error (#67) 4f42a79 Commits on Jul 19, 2020 [README.md] Add LTGM badges (#69) 23757a3 Remove unused imports (#70) 029d5a5 Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
…c-utilities] Update submodules (sonic-net#5435) Update Python-based submodules to include updated dependencies in setup.py files. * src/sonic-platform-common 14c6e53...7255d3a (2): > [setup.py] Update dependenices (sonic-net#120) > [sfputilbase|sfputilhelper] fix value unpack error due to port_alias_asic_map in portconfig.py file (sonic-net#116) * src/sonic-platform-daemons e1842b2...1aaffcc (2): > Add 'wheel' package to 'setup_requires' list (#95) > [thermalctld] Fix issue: thermalctld should be auto restarted when being killed (#94) * src/sonic-snmpagent 1a2b62a...5957460 (1): > [setup.py] Add 'wheel' to 'setup_requires' (sonic-net#160) * src/sonic-utilities 2244d7b...2f79ac1 (4): > [show] Add missing import to fix `show ip interfaces` (sonic-net#1126) > Fixed config load_minigrpah not working for Multi-asic platfroms. (sonic-net#1123) > Add CLI for configuring synchronous mode (sonic-net#1094) > [setup.py] Update dependencies (sonic-net#1125)
… modules
Signed-off-by: Dante Su dante.su@broadcom.com
- What I did
- How I did it
See What I did
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)