Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLD for getting error status of the SPF modules #12

Closed
wants to merge 9 commits into from

Conversation

stephenxs
Copy link
Owner

Initial version

Signed-off-by: Stephen Sun stephens@nvidia.com

Initial version

Signed-off-by: Stephen Sun <stephens@nvidia.com>
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
doc/mellanox/sfp-error-status.md Outdated Show resolved Hide resolved
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@thuffmlx
Copy link

OK, great stuff, but having a return status of 0 mean "POWER BUDGET EXCEEDED" and return code 255 mean "No error" breaks a very common convention of 0 return code means "No errors". Recommend that you change this.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs
Copy link
Owner Author

OK, great stuff, but having a return status of 0 mean "POWER BUDGET EXCEEDED" and return code 255 mean "No error" breaks a very common convention of 0 return code means "No errors". Recommend that you change this.

Thank you for your comments.
The error codes and error descriptions, like 0 for “power budget exceeded” and 255 for “reserved”, are taken from the SDK API (which takes from the PRM).
As the feature is Mellanox only and this is an internal interface inside platform API, I don’t tend to change them for now.
But if we are going to make this kind of feature a common one and the error code and interface will be exposed to the users, I’ll take it into consideration.

@jleveque
Copy link

jleveque commented May 18, 2021

Is there a way we can move this logic to xcvrd? This way we can keep in line with the SONiC ecosystem (e.g., xcvrd is the only application responsible for communicating with transceivers), and it also opens up the ability for other vendors to implement this if they have similar support on their platforms.

@stephenxs
Copy link
Owner Author

stephenxs commented May 19, 2021

Is there a way we can move this logic to xcvrd? This way we can keep in line with the SONiC ecosystem (e.g., xcvrd is the only application responsible for communicating with transceivers), and it also opens up the ability for other vendors to implement this if they have similar support on their platforms.

I think there are two approaches to achieve it:

  1. Periodically:
    • xcvrd to fetch the error status periodically and save the possible error status into STATE_DB
    • CLI to fetch error status from STATE_DB each time the command is called.
  2. Synchronous communication:
    • A dedicated database table, probably in STATE_DB, is introduced for communication between the backend, xcvrd, and the frontend, the CLI, or other possible UI.
    • xcvrd listens to an update on the database table. On receiving the message, it fetches the error status and stores it into STATE_DB
    • CLI updates the database table, indicating it wants xcvrd to fetch the error code for some SFP modules and listens to the table for the reply. It will display the error status to the user once it receives the update from the table.
    • Proper timeout mechanism should be introduced as well.

As this is a command for debugging, it may have real-time requirements. So the second approach is preferred.

Meanwhile, I'm not sure whether this is the scope of this design or another design should be opened for this

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs closed this Jun 2, 2021
@stephenxs stephenxs deleted the sfp-error-status branch December 22, 2021 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants