-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set fault beacon on drive failure #2375
Comments
Have a look at https://raw.githubusercontent.com/FransUrbo/scripts/master/GetDiskInfo.sh (it's huge and kludgy and I'm trying to rewrite it in perl which would make it a little cleaner, but ... :). I do a lot of stuff like that... I'd love to help (should be able to whip something up quite quickly - I'm bored :), but I don't have something like a enclosure. |
@FransUrbo That looks like a handy script! I believe @nedbass figured out how to map the device to an enclosure and slot. Ned could you add the proceedure to this issue so we don't loose track of it. Then we can work it in to the scripts where appropriate. |
Here's something we use to blink our lights, including the mapping of device to enclosure and slot:
|
@chrisrd Nice! Thanks for posting this. |
I've got some enclosures with (LSI) SAS expanders, which are visible in |
I'm working on implementing this feature. identify_failed_drive.feature.txt provides a detailed feature description. |
@cvoltz Your description looks solid, except that I disagree that the UID light should be on. I would think only the fault light should be controlled. What is the advantage of adjusting both lights in lock-step? I think the UID should be left for administrator use. |
@rlaager The UID could certainly be dropped, but there is potential value in a large configuration. With the UID + disk FAULT LED, customers know which chassis AND which disk a bit more easily. |
@cvoltz I'm working on pretty much the same thing at LLNL. Have you had any luck with using libstoragemgmt to blink the LEDs? Have you tried it for multipath devices as well? |
I updated the feature description to remove the references to the UID lights on the drive. |
For anyone following this issue you may want to checkout the latest master source which now has improved infrastructure for generically managing a devices SES LEDs, 1bbd877. The zedlet's environment will now contain a This infrastructure is still being worked on but any feedback or testing on a wider range of configurations and hardware would be welcome. |
- Fix autoreplace behaviour on statechange-led.sh script. ZED sends the following events on an auto-replace: 1. statechange: Disk goes UNAVAIL->ONLINE 2. statechange: Disk goes ONLINE->UNAVAIL 3. vdev_attach: Disk goes ONLINE Events 1-2 happen when ZED first attempts to do an auto-online. When that fails, ZED then tries an auto-replace, generating the vdev_attach event in #3. In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition, assuming it was just the "old" VDEV going offline. This is problematic, as a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want to ignore that. This new patch correctly turns on the fault LED every time a drive becomes UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED when an auto-replaced disk comes online. - Remove unnecessary libdevmapper warning with --with-config=kernel This fixes an unnecessary libdevmapper warning when building --with-config=kernel. Kernel code does not use libdevmapper, so the warning is not needed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2375 Closes #5312 Closes #5331
For JBOD style configurations it's desirable to set the device's fault beacon on drive failures. This can be done either through proc interface or through the
sg_ses
utilities. The relevant zed scripts (cmd/zed/zed.d/io-spare.sh), should be updated to take advantage of this functionality. The tricky bit is to map the device name to an enclosure and slot location.The text was updated successfully, but these errors were encountered: