-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smart Input Plugin Extra Attributes #6079
Smart Input Plugin Extra Attributes #6079
Conversation
Small update: https://nvmexpress.org/wp-content/uploads/NVM-Express-Management-Interface-1.1-Ratified.pdf This document seems to correlate well r.e. the meanings of the critical warning flags for nvme. |
First off I'd like to say that a lot of paid software don't have support like this even when showing the money. Greatly appreciated guys! I'm certainly no expert on s.m.a.r.t. data, but from my experience they are quite a wild west, a lot of manufacturer specific stuff and seems impossible to find up to date information on them. On Samsung's page the only document I can find is this, and it's clearly outdated for the NVME drives: That NVME specification is probably as good as it gets for a standardized model, but assuming all vendors stick to it might be optimistic. Smartctl doesn't display the ID's for the attributes on my NVME drive. Regarding 3. I have no idea whether those attributes are used anymore with NVME drives, but they used to be key attributes to monitor with sata/ide drives, so if they are still valid they would be a great addition. |
Hey @coccobill1 👋 Thank you for the kind words. That was my first week at Influx and first thing contributed as an employee. So it meant a lot 🙇 That document you linked it super helpful as well. Cross-referencing all these different docs helps to find some sensible answer. |
Update: refactored the collection in order to make it more extensible. New matches for SAS and NVME attributes can be quickly added via the |
Update: I added the following critical warning fields with boolean values:
|
plugins/inputs/smart/smart.go
Outdated
continue | ||
} | ||
acc.AddFields("smart_attribute", map[string]interface{}{ | ||
"raw_value": flags&flag > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't be add this as a bool since we already have it as an integer, and InfluxDB won't allow us to have mixed types. Since we are attempting to model the NVMe log as smart_attributes
, I would just do it like this and document or link to the values in the README (see net_response for an example):
smart_attribute,name=critical_warning raw_value=0i
Often with enum style data we do something like this with a tag + integer field, but I don't think we should do it here since it doesn't fit into the existing model very well:
smart_attribute,name=critical\ warning,warning=reliability\ degraded warning_code=0i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielnelson ahh I see what you mean! Do you think having it as an integer is a little hard to interpret? Since it represents so many states / combinations of flags? i.e. 8 means just read only
, but 9 means read only
and available spare threshold exceeded
. I suppose no one is screaming for this warning fidelity at the moment, so I am probably just splitting hairs 😂 . They can atleast do something like alert when any critical warning occurs with a simple >0 check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, this isn't an enum it's a bit set, so it wouldn't really work to pair it with a human readable name=foo tag. I also agree, it's pretty difficult to deal with a large list of integers. Still because this is a smart_attribute, I think we should start with just the single field, but read on for more discussion of booleans and bit sets.
One reason we have often favored integer values over booleans is that Grafana/Chronograf aren't very good at displaying booleans outside of tables. UI support is always changing though, so it may be worth looking into what support there is now. Many of the outputs also do not support booleans, and require users to use the enum
processor.
I think there are 2 things that a user would be interested in with a bitset: is the item in an error state (non-zero), and what flags are set. Checking for an error state is not always something that makes sense depending on the semantics of the bit set, but I think it often does. If we were going to encode a bit set for InfluxDB, we could do it as multiple fields, one for the bit set encoded an integer, and one for each flag in the bit set:
foo critical_warning=3i,critical_warning_spare_threshold=true,critical_warning_temperature_above_or_under_threshold=true
However, this might be tricky to query, since you would need to grab all bit set fields and check each one to see if it is set. Probably only something you want to do if you are checking for a particular flag. It occurs to me that the set type idea for tags we discussed could help here, imagine:
foo,critical_warning=spare_threshold,critical_warning=temperature_above_or_under_threshold critical_warning=3i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I like it. I will simplify critical warning to just an int for now then.
This PR attempts to deliver most of #6041
Currently this PR supports:
Outstanding Questions:
Documentation for smart attributes seem to be a little elusive to me. Can anyone point me in the direction of the attributes id's for all the above attributes. Perhaps you know @coccobill1 ?
Critical Warning
appears to be a bit flag printed as a hex value. I have struggled to find decent documentation around the meaning of all these attributes. Maybe someone can point me in the best direction of better docs. I have this https://www.intel.com/content/dam/support/us/en/documents/solid-state-drives/Intel_SSD_Smart_Attrib_for_PCIe.pdf for example.Given we know the definitive meaning (and that there is one) for what each flag means, do we have any suggestions on how to communicate that best to telegraf? cc @danielnelson
What I mean is, for example, would we want to communicate each flags as individual distinct fields. Each field with a boolean value (on or off)?
e.g. The reading
0x01
given the first bit in the byte representsavailable spare is below threshold
(as I have currently taken from the shared intel doc above) would lead to a field likeCritical_Warning_Available_Spare_Below_Threshold
with araw_value=1
?I could implement them all as just counts and fabricate a test and hope they exist as described here if people are happy with this.
I could be completely missing the point here 😂 so any advice welcomed.
Required for all PRs: