Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement option to output information as OpenMetrics time series #308

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

lelutin
Copy link

@lelutin lelutin commented Jul 30, 2024

This new -o option will make needrestart output information in a format that can be scraped by Prometheus or any other daemon that ingests OpenMetrics format.

The -l, -w and -k options can be used in combination with -o in order to choose what information gets exported.

(Closes: #291)

@lelutin
Copy link
Author

lelutin commented Jul 30, 2024

note, I've chosen to split the version information out into a dedicated ...build_info time series. This makes it possible to graph the timestamp normally without risking discontinuity if needrestart and/or perl is upgraded and thus the labels change, and the dedicated metric can be used just for aggregation if needs be.

Copy link
Contributor

@anarcat anarcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job untangling that code, it's hard to find your way in there, and you seem to have found the right spots!

but the metrics exposition formats need a little work. i think once that's done, it's ready to merge, as far as i'm concerned.

it would be helpful to show sample outputs, at least in the commit log, that show the previous nagios and plain batch mode still works as well... the openmetrics output could be useful in the README or manpage directly as well.

needrestart Outdated

my $ometric_now = time();
print "# HELP needrestart_timestamp when, in unix timestamp, needrestart was last updated\n";
print "# TYPE needrestart_timestamp gauge\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i always get confused by this, but according to this guide, i think this should be suffixed with _seconds.

i also, maybe we need to be more explicit here and call this needrestart_last_update_timestamp_seconds?

but then again, whey do we have that metric at all? this guide advises against it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh interesting I should take a better look at the prometheus documentation about writing exporters, there seems to be a couple of good hints in there that I haven't yet integrated.

now that you mention it, it seems indeed to be redundant.. the timing info will already be there in the time series. so let's drop that metric. I'll push this change in a couple minutes

needrestart Outdated
if ($opt_k) {
print "# HELP needrestart_kernel_info information about the kernel\n";
print "# TYPE needrestart_kernel_info info\n";
print "needrestart_kernel_info{running=$ometric_kernel_values{krunning},expected=$ometric_kernel_values{kexpected}} $ometric_kernel_values{kstatus}\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be sure, this would look like needrestart_kernel_info{running=6.9.7,expected=6.9.7} 1 on success and needrestart_kernel_info{running=6.9.6,expected=6.9.7} 0 on failure, am i right?

maybe we should have actual sample outputs of this somewhere to clarify end-users on how they can actually alert on this stuff (or even better, point people at our alerts, once we have them... ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no currently it will look like needrestart_kernel_info{running=6.9.7,expected=6.9.7} current on success

and needrestart_kernel_info{running=6.9.6,expected=6.9.7} obsolete on failure.

that detail was something that I was really not sure about how to output.. outputting a binary value could be interesting in the case of the kernel since there's currently only two possible states. however, for microcode, there are three different states, unknown, current and obsolete. so to keep things consistent, I decided to output the state "name" for both.

should the kernel_info metric be a bool instead?

needrestart Outdated
my $ometric_ucode_status = ("unknown", "current", "obsolete")[$ucode_result];
print "# HELP needrestart_ucode_info information about the CPU microcode\n";
print "# TYPE needrestart_ucode_info info\n";
print "needrestart_ucode_info{running=$ometric_ucode_current,expected=$ometric_ucode_expected} $ometric_ucode_status\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that i'm pretty sure won't work. my perl is also rusty, but i feel this would look like:

needrestart_ucode_info{running=0x0,expected=0x1} obsolete

on failure... but prometheus won't know what to do with the string "obsolete", that's not a metric value (an integer), it's a string! OpenMetrics specifically allows only integers and floats, see this section...

you'll need something that reflects that status a little better. i would suggest using 1 for success and 0 for failure, and having the status as a label, for example see those three possible cases:

needrestart_ucode_info{running=0x0,expected=0x1,status=obsolete} 0
needrestart_ucode_info{running=,expected=0x1,status=unknown} 0
needrestart_ucode_info{running=0x0,expected=0x0,status=current} 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow, I should have read this comment before answering above. ...I now understand that in my reading of that openmetrics document before sending this PR I did not understand what it was trying to communicate.. why is it structured in reverse like that? blah I mean if they had presented the ABNF form before explaining what each part consists of it would have made things so much less confusing.

ok well I'll fix the values then. I'll have to figure how to represent the states for microcode though. I could conflate unknown and obsolete into the boolean for failed (e.g. 0), but then we're losing information in that case. It could be an index (so in the codebase, simply the value of ucode_result).. I'll think about this for some time and send a change soon

This new `-o` option will make needrestart output information in a
format that can be scraped by Prometheus or any other daemon that
ingests OpenMetrics format.

The -l, -w and -k options can be used in combination with -o in order to
choose what information gets exported.

Note that the combination of options -ol needs root access in order to
correctly determine which services use outdated libraries.

The kernel and microcode statuses are output as StateSet type metrics
since there are more than one states for each one. This way users can
track the state with more granularity and for example decide to ignore
"unknown" microcode state or "version_upgrade" (e.g. non ABI-compatible
upgrade) kernel state.
For kernel and microcode, there's one Info type metric each that informs
of the currently running vs. the expected newer version.

(Closes: liske#291)
@lelutin
Copy link
Author

lelutin commented Jul 31, 2024

I've reworked this branch to incorporate feedback from @anarcat and other details found while reading the OpenMetrics spec more in depth and I've just force-pushed the result.

what's to be expected as being changed in this force-push;

  • the -l option in combination with -o now requires root access, otherwise the metric for outdated libraries was always set to 0 since an unprivileged user can't dig around all running processes to find out this information
  • the code and metrics now talk about "outdated libraries" and "processes with outdated libraries", which should be more precise for what we're exposing
  • the metrics were renamed to be more descriptive of what they expose
  • the man page now has some description of which metrics will be exposed and their types
  • the timestamp metric was removed
  • the info-type metrics for kernel and microcode, which were wrongly typed and implemented, now use the StateSet type to expose all of the possible states
  • there's an additional info-type metric for kernel and microcode in order to expose the current and expected versions. This is instead of attaching those versions as labels to all entries in the StateSet metrics. it reduces useless duplication and still makes it possible to group by those metrics if needed in order to filter or add those labels to results
  • the metadata for all MetricFamilies were reordered as per the OpenMetric recommendation (type, then help -- unit was not used)
  • a trailing # EOF\n was added to properly end the exposition as defined by OpenMetrics

@liske liske added this to the v3.7 milestone Aug 9, 2024
@liske liske merged commit 08c0421 into liske:master Aug 9, 2024
@liske
Copy link
Owner

liske commented Aug 9, 2024

Thanks!

@anarcat
Copy link
Contributor

anarcat commented Aug 11, 2024

awesome, thanks for the fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prometheus metrics output
3 participants