-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritise output by criticality #70
Comments
Hi @peternewman |
Yes I am. It'll be a little while until I can do so, and I've replaced the failed drive, but I should be able to get the historic output from Nagios. It was correctly reporting the overall status of the check as Critical, and listing the associated faults and level with each drive, but it was listing as sdb, sdc, sda (i.e. broken first, but just by letter/discovery order, not respective criticality). |
Usage was: Output was: Reformatted into bullets to make my point clearer, it was like this:
What I wanted was:
i.e. sorted prioritised by the return status you'd get by checking each drive individually. |
The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives You could work around this by using By the way: Although handy for quick checks, I'm not using |
I think it does. I'm pretty certain when I checked them individually that sdb was WARNING and sdc was CRITICAL, it's just it doesn't currently use the subtlety of that info, just the binary good/bad state.
As above, I'm more interested in the general WARNING/CRITICAL ordering.
Thanks for the heads up, I figured I'd start by getting a check in place across all the machines with software RAID and hence no proper disk monitoring and go from there. I'm always rather nervous with manual config like that, as it becomes rather easy to miss a drive if a machine has more disks than expected. Fortunately for me, the data is pretty transient, so I'm really just interested in knowing the drive has, or is about to fail, rebuilding things and carrying on. |
Yes, this should definitely happen. So if you do a manual check of sdb right now, is it CRITICAL or WARNING? |
That's certainly what I'd like, I don't see any code to do so currently (I'm not sure if you're saying you think it should, or agreeing it's a feature to implement): Lines 694 to 696 in 956f236
And e.g.: Lines 664 to 665 in 956f236
versus Lines 677 to 678 in 956f236
Yeah that works as expected:
|
@peternewman can you please try with the 6.11 branch? |
Thanks @Napsty . I've swapped my failed drive now unfortunately, so would need to fake it by making an existing warning a critical. I do see one big issue though: Lines 703 to 704 in 7eecae6
You'll only ever get warning messages out, as you're not concatenating the two joins together, just setting $status_string twice! |
Thx for pointing that out. Should be fixed now with commit d3a85e9 |
This still doesn't fix it in global mode unfortunately @Napsty . Note how /dev/sdc where I fudged being under the threshold to generate my test critical is listed after /dev/sdb which only has warnings:
It does in single device mode though (N.B. I've changed to a different drive here and a different threshold), i.e. errors are now correctly listed before criticals on a per drive basis:
I think in your current model you've got See also related #71 to improve the current formatting in global mode. |
Hi @peternewman . Can you try it with the newest check_smart.pl from the 6.11 branch please: |
Commit 5dbacc7 now also adds an internal "notice" status for attributes appearing as "less than threshold". Before the commit, attributes would show up in their lookup order, even when different thresholds are given:
After the commit, the "Reallocated_Sector_Ct" is moved to the end of the output:
5dbacc7 also adds splits the "not_okay" drives into "critical" and "warning" drives (as suggested by you). Then critical (first) and warning (second) drives are merged together into the "not_okay" drives. This should assure, that critical drives appear first in the output. |
Fixed in #72 |
I setup check_smart on a system which was already sick and had the following statuses:
sda - Okay
sdb - Warning - unrecoverable errors
sdc - Critical - due to die soon
It would be nice if the output listed sdc, sdb, sda so you know what to prioritise.
I had a quick look, and I think something like adding to a hash of arrays based on the local level, then joining them back up would do the trick, but didn't get a chance to implement it at the time.
The text was updated successfully, but these errors were encountered: