-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ssd] Allow individual vendor parsers to handle errors #252
Conversation
@alexrallen What is the type of the disk? Can you please share how to test and what is the output? do we have any subsequent changes following this change? |
The code I am changing is related to the "generic" disk code. To test this you will need to have a switch where SMART is throwing a non-zero return code. Then just attempt to get the ssd health through the "show" command. No related changes, this code I am removing was redundant, It does not need to be replaced as the error handling is elsewhere. |
@alexrallen The previous change actually make the underlying problem come out to the surface. And we noticed the ssdutil throw exceptions when the disk type is hdd. Instead of hiding the problem, it could be better to find the root cause of the exceptions. |
@sujinmkang I am not sure what you are asking here, can you clarify? The change I am proposing here is so that a non-zero return code from The precise case this is an issue with is when the SSD fails a smartctl check, the user should still know about it. I am not seeing how your change is related, can you elaborate a bit further? |
@alexrallen What is the exit code from the malfunctioning SSD? What is the output of smatctl? My change is to fix the python stack trace when the disk type is not ssd and smartctl cannot retrieve the health data and returns none. |
The exit code is non-zero and smartctl throws an error that it cannot read specific health metrics from the SSD (I don't have the specific message available) Otherwise smartctl gives all of the normal output which can be parsed. |
@sujinmkang I looked over your PR closely. Your PR will only cover instances when the device is not of type SSD. My change will cover cases in which it IS an SSD but there is a SMART error with it. Both of these changes should be able to be implemented I believe. |
@alexrallen I think this change will also cover the non SSD disk failure case. |
Description Removed a return None in the command execution helper function _execute_shell in the SSD class so that command output is still parsed in the event of a non-zero exit code. Motivation and Context If there is an issue found with a SSD SMART utilities may return a non-zero error code and we would still want to parse the output so we should not bail. In addition, with the return None the parsing functions throw stack traces during execution. When removed, the parsing functions are able to elegantly handle missing information and the fields are left as N/A so there is no need for handling here. How Has This Been Tested? Tested manually on a SN2010 with a malfunctioning SSD which was initially throwing stack traces and now handles the error elegantly reporting the health as N/A a much more informative error than a python stack trace.
Description Removed a return None in the command execution helper function _execute_shell in the SSD class so that command output is still parsed in the event of a non-zero exit code. Motivation and Context If there is an issue found with a SSD SMART utilities may return a non-zero error code and we would still want to parse the output so we should not bail. In addition, with the return None the parsing functions throw stack traces during execution. When removed, the parsing functions are able to elegantly handle missing information and the fields are left as N/A so there is no need for handling here. How Has This Been Tested? Tested manually on a SN2010 with a malfunctioning SSD which was initially throwing stack traces and now handles the error elegantly reporting the health as N/A a much more informative error than a python stack trace.
Update sonic-platform-common submodule to pickup new commits: 01512ec [SSD]Enhance ssd_generic with more error handling to avoid python crash sonic-net/sonic-platform-common#271 ac3e7f1 [y_cable][Broadcom] update the BRCM y_cable driver to release 2.0 sonic-net/sonic-platform-common#263 573717a [Credo][Ycable] Fix Credo firmware download API download_firmware flag sonic-net/sonic-platform-common#269 a844f18 [xcvr] Add get_module_fw_info method to XcvrApi class. sonic-net/sonic-platform-common#267 35bad16 [sfputil]Refactoring read_porttab_mappings sonic-net/sonic-platform-common#264 83c4345 [SSD Generic] Add support for parsing nvme ssd model, health and temperature sonic-net/sonic-platform-common#265 5da31e1 [ycable][credo] Fix the is_link_active API for Credo Ycable sonic-net/sonic-platform-common#260 931c6ea [Y-Cable][Credo] add theading locker to support thread-safe calling, add SKU check for download_firmware API. sonic-net/sonic-platform-common#222 ff3aa75 Fix SFF8472 Enhanced Options sonic-net/sonic-platform-common#259 a8a83e9 [ssd] Allow individual vendor parsers to handle errors sonic-net/sonic-platform-common#252 Signed-off-by: Kebo Liu <kebol@nvidia.com>
Description
Removed a
return None
in the command execution helper function_execute_shell
in the SSD class so that command output is still parsed in the event of a non-zero exit code.Motivation and Context
If there is an issue found with a SSD SMART utilities may return a non-zero error code and we would still want to parse the output so we should not bail.
In addition, with the
return None
the parsing functions throw stack traces during execution. When removed, the parsing functions are able to elegantly handle missing information and the fields are left asN/A
so there is no need for handling here.How Has This Been Tested?
Tested manually on a SN2010 with a malfunctioning SSD which was initially throwing stack traces and now handles the error elegantly reporting the health as
N/A
a much more informative error than a python stack trace.