Implement option to output information as OpenMetrics time series #308

lelutin · 2024-07-30T21:09:26Z

This new -o option will make needrestart output information in a format that can be scraped by Prometheus or any other daemon that ingests OpenMetrics format.

The -l, -w and -k options can be used in combination with -o in order to choose what information gets exported.

(Closes: #291)

lelutin · 2024-07-30T21:22:18Z

note, I've chosen to split the version information out into a dedicated ...build_info time series. This makes it possible to graph the timestamp normally without risking discontinuity if needrestart and/or perl is upgraded and thus the labels change, and the dedicated metric can be used just for aggregation if needs be.

anarcat

good job untangling that code, it's hard to find your way in there, and you seem to have found the right spots!

but the metrics exposition formats need a little work. i think once that's done, it's ready to merge, as far as i'm concerned.

it would be helpful to show sample outputs, at least in the commit log, that show the previous nagios and plain batch mode still works as well... the openmetrics output could be useful in the README or manpage directly as well.

anarcat · 2024-07-31T06:18:01Z

needrestart

+
+ my $ometric_now = time();
+ print "# HELP needrestart_timestamp when, in unix timestamp, needrestart was last updated\n";
+ print "# TYPE needrestart_timestamp gauge\n";


i always get confused by this, but according to this guide, i think this should be suffixed with _seconds.

i also, maybe we need to be more explicit here and call this needrestart_last_update_timestamp_seconds?

but then again, whey do we have that metric at all? this guide advises against it...

oh interesting I should take a better look at the prometheus documentation about writing exporters, there seems to be a couple of good hints in there that I haven't yet integrated.

now that you mention it, it seems indeed to be redundant.. the timing info will already be there in the time series. so let's drop that metric. I'll push this change in a couple minutes

anarcat · 2024-07-31T06:21:26Z

needrestart

+ if ($opt_k) {
+ print "# HELP needrestart_kernel_info information about the kernel\n";
+ print "# TYPE needrestart_kernel_info info\n";
+ print "needrestart_kernel_info{running=$ometric_kernel_values{krunning},expected=$ometric_kernel_values{kexpected}} $ometric_kernel_values{kstatus}\n";


just to be sure, this would look like needrestart_kernel_info{running=6.9.7,expected=6.9.7} 1 on success and needrestart_kernel_info{running=6.9.6,expected=6.9.7} 0 on failure, am i right?

maybe we should have actual sample outputs of this somewhere to clarify end-users on how they can actually alert on this stuff (or even better, point people at our alerts, once we have them... ;)

no currently it will look like needrestart_kernel_info{running=6.9.7,expected=6.9.7} current on success

and needrestart_kernel_info{running=6.9.6,expected=6.9.7} obsolete on failure.

that detail was something that I was really not sure about how to output.. outputting a binary value could be interesting in the case of the kernel since there's currently only two possible states. however, for microcode, there are three different states, unknown, current and obsolete. so to keep things consistent, I decided to output the state "name" for both.

should the kernel_info metric be a bool instead?

anarcat · 2024-07-31T06:25:54Z

needrestart

+ my $ometric_ucode_status = ("unknown", "current", "obsolete")[$ucode_result];
+ print "# HELP needrestart_ucode_info information about the CPU microcode\n";
+ print "# TYPE needrestart_ucode_info info\n";
+ print "needrestart_ucode_info{running=$ometric_ucode_current,expected=$ometric_ucode_expected} $ometric_ucode_status\n";


okay, that i'm pretty sure won't work. my perl is also rusty, but i feel this would look like:

needrestart_ucode_info{running=0x0,expected=0x1} obsolete

on failure... but prometheus won't know what to do with the string "obsolete", that's not a metric value (an integer), it's a string! OpenMetrics specifically allows only integers and floats, see this section...

you'll need something that reflects that status a little better. i would suggest using 1 for success and 0 for failure, and having the status as a label, for example see those three possible cases:

needrestart_ucode_info{running=0x0,expected=0x1,status=obsolete} 0 needrestart_ucode_info{running=,expected=0x1,status=unknown} 0 needrestart_ucode_info{running=0x0,expected=0x0,status=current} 1

oh wow, I should have read this comment before answering above. ...I now understand that in my reading of that openmetrics document before sending this PR I did not understand what it was trying to communicate.. why is it structured in reverse like that? blah I mean if they had presented the ABNF form before explaining what each part consists of it would have made things so much less confusing.

ok well I'll fix the values then. I'll have to figure how to represent the states for microcode though. I could conflate unknown and obsolete into the boolean for failed (e.g. 0), but then we're losing information in that case. It could be an index (so in the codebase, simply the value of ucode_result).. I'll think about this for some time and send a change soon

This new `-o` option will make needrestart output information in a format that can be scraped by Prometheus or any other daemon that ingests OpenMetrics format. The -l, -w and -k options can be used in combination with -o in order to choose what information gets exported. Note that the combination of options -ol needs root access in order to correctly determine which services use outdated libraries. The kernel and microcode statuses are output as StateSet type metrics since there are more than one states for each one. This way users can track the state with more granularity and for example decide to ignore "unknown" microcode state or "version_upgrade" (e.g. non ABI-compatible upgrade) kernel state. For kernel and microcode, there's one Info type metric each that informs of the currently running vs. the expected newer version. (Closes: liske#291)

lelutin · 2024-07-31T23:18:55Z

I've reworked this branch to incorporate feedback from @anarcat and other details found while reading the OpenMetrics spec more in depth and I've just force-pushed the result.

what's to be expected as being changed in this force-push;

the -l option in combination with -o now requires root access, otherwise the metric for outdated libraries was always set to 0 since an unprivileged user can't dig around all running processes to find out this information
the code and metrics now talk about "outdated libraries" and "processes with outdated libraries", which should be more precise for what we're exposing
the metrics were renamed to be more descriptive of what they expose
the man page now has some description of which metrics will be exposed and their types
the timestamp metric was removed
the info-type metrics for kernel and microcode, which were wrongly typed and implemented, now use the StateSet type to expose all of the possible states
there's an additional info-type metric for kernel and microcode in order to expose the current and expected versions. This is instead of attaching those versions as labels to all entries in the StateSet metrics. it reduces useless duplication and still makes it possible to group by those metrics if needed in order to filter or add those labels to results
the metadata for all MetricFamilies were reordered as per the OpenMetric recommendation (type, then help -- unit was not used)
a trailing # EOF\n was added to properly end the exposition as defined by OpenMetrics

liske · 2024-08-09T14:26:27Z

Thanks!

anarcat · 2024-08-11T00:50:57Z

awesome, thanks for the fixes

anarcat suggested changes Jul 31, 2024

View reviewed changes

lelutin force-pushed the openmetrics branch from ceb980b to 07fb744 Compare July 31, 2024 23:04

lelutin force-pushed the openmetrics branch from 07fb744 to b372b17 Compare July 31, 2024 23:07

liske added the enhancement label Aug 9, 2024

liske added this to the v3.7 milestone Aug 9, 2024

liske merged commit 08c0421 into liske:master Aug 9, 2024

lelutin mentioned this pull request Aug 28, 2024

OpenMetrics output is not taken as valid by prometheus #310

Closed

lelutin deleted the openmetrics branch August 28, 2024 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement option to output information as OpenMetrics time series #308

Implement option to output information as OpenMetrics time series #308

lelutin commented Jul 30, 2024

lelutin commented Jul 30, 2024

anarcat left a comment

anarcat Jul 31, 2024

lelutin Jul 31, 2024

anarcat Jul 31, 2024

lelutin Jul 31, 2024

anarcat Jul 31, 2024

lelutin Jul 31, 2024

lelutin commented Jul 31, 2024

liske commented Aug 9, 2024

anarcat commented Aug 11, 2024

Implement option to output information as OpenMetrics time series #308

Implement option to output information as OpenMetrics time series #308

Conversation

lelutin commented Jul 30, 2024

lelutin commented Jul 30, 2024

anarcat left a comment

Choose a reason for hiding this comment

anarcat Jul 31, 2024

Choose a reason for hiding this comment

lelutin Jul 31, 2024

Choose a reason for hiding this comment

anarcat Jul 31, 2024

Choose a reason for hiding this comment

lelutin Jul 31, 2024

Choose a reason for hiding this comment

anarcat Jul 31, 2024

Choose a reason for hiding this comment

lelutin Jul 31, 2024

Choose a reason for hiding this comment

lelutin commented Jul 31, 2024

liske commented Aug 9, 2024

anarcat commented Aug 11, 2024