-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement option to output information as OpenMetrics time series #308
Conversation
note, I've chosen to split the version information out into a dedicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job untangling that code, it's hard to find your way in there, and you seem to have found the right spots!
but the metrics exposition formats need a little work. i think once that's done, it's ready to merge, as far as i'm concerned.
it would be helpful to show sample outputs, at least in the commit log, that show the previous nagios and plain batch mode still works as well... the openmetrics output could be useful in the README or manpage directly as well.
needrestart
Outdated
|
||
my $ometric_now = time(); | ||
print "# HELP needrestart_timestamp when, in unix timestamp, needrestart was last updated\n"; | ||
print "# TYPE needrestart_timestamp gauge\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i always get confused by this, but according to this guide, i think this should be suffixed with _seconds
.
i also, maybe we need to be more explicit here and call this needrestart_last_update_timestamp_seconds
?
but then again, whey do we have that metric at all? this guide advises against it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh interesting I should take a better look at the prometheus documentation about writing exporters, there seems to be a couple of good hints in there that I haven't yet integrated.
now that you mention it, it seems indeed to be redundant.. the timing info will already be there in the time series. so let's drop that metric. I'll push this change in a couple minutes
needrestart
Outdated
if ($opt_k) { | ||
print "# HELP needrestart_kernel_info information about the kernel\n"; | ||
print "# TYPE needrestart_kernel_info info\n"; | ||
print "needrestart_kernel_info{running=$ometric_kernel_values{krunning},expected=$ometric_kernel_values{kexpected}} $ometric_kernel_values{kstatus}\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be sure, this would look like needrestart_kernel_info{running=6.9.7,expected=6.9.7} 1
on success and needrestart_kernel_info{running=6.9.6,expected=6.9.7} 0
on failure, am i right?
maybe we should have actual sample outputs of this somewhere to clarify end-users on how they can actually alert on this stuff (or even better, point people at our alerts, once we have them... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no currently it will look like needrestart_kernel_info{running=6.9.7,expected=6.9.7} current
on success
and needrestart_kernel_info{running=6.9.6,expected=6.9.7} obsolete
on failure.
that detail was something that I was really not sure about how to output.. outputting a binary value could be interesting in the case of the kernel since there's currently only two possible states. however, for microcode, there are three different states, unknown
, current
and obsolete
. so to keep things consistent, I decided to output the state "name" for both.
should the kernel_info metric be a bool instead?
needrestart
Outdated
my $ometric_ucode_status = ("unknown", "current", "obsolete")[$ucode_result]; | ||
print "# HELP needrestart_ucode_info information about the CPU microcode\n"; | ||
print "# TYPE needrestart_ucode_info info\n"; | ||
print "needrestart_ucode_info{running=$ometric_ucode_current,expected=$ometric_ucode_expected} $ometric_ucode_status\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that i'm pretty sure won't work. my perl is also rusty, but i feel this would look like:
needrestart_ucode_info{running=0x0,expected=0x1} obsolete
on failure... but prometheus won't know what to do with the string "obsolete", that's not a metric value (an integer), it's a string! OpenMetrics specifically allows only integers and floats, see this section...
you'll need something that reflects that status a little better. i would suggest using 1
for success and 0
for failure, and having the status as a label, for example see those three possible cases:
needrestart_ucode_info{running=0x0,expected=0x1,status=obsolete} 0
needrestart_ucode_info{running=,expected=0x1,status=unknown} 0
needrestart_ucode_info{running=0x0,expected=0x0,status=current} 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wow, I should have read this comment before answering above. ...I now understand that in my reading of that openmetrics document before sending this PR I did not understand what it was trying to communicate.. why is it structured in reverse like that? blah I mean if they had presented the ABNF form before explaining what each part consists of it would have made things so much less confusing.
ok well I'll fix the values then. I'll have to figure how to represent the states for microcode though. I could conflate unknown
and obsolete
into the boolean for failed (e.g. 0), but then we're losing information in that case. It could be an index (so in the codebase, simply the value of ucode_result).. I'll think about this for some time and send a change soon
This new `-o` option will make needrestart output information in a format that can be scraped by Prometheus or any other daemon that ingests OpenMetrics format. The -l, -w and -k options can be used in combination with -o in order to choose what information gets exported. Note that the combination of options -ol needs root access in order to correctly determine which services use outdated libraries. The kernel and microcode statuses are output as StateSet type metrics since there are more than one states for each one. This way users can track the state with more granularity and for example decide to ignore "unknown" microcode state or "version_upgrade" (e.g. non ABI-compatible upgrade) kernel state. For kernel and microcode, there's one Info type metric each that informs of the currently running vs. the expected newer version. (Closes: liske#291)
I've reworked this branch to incorporate feedback from @anarcat and other details found while reading the OpenMetrics spec more in depth and I've just force-pushed the result. what's to be expected as being changed in this force-push;
|
Thanks! |
awesome, thanks for the fixes |
This new
-o
option will make needrestart output information in a format that can be scraped by Prometheus or any other daemon that ingests OpenMetrics format.The -l, -w and -k options can be used in combination with -o in order to choose what information gets exported.
(Closes: #291)