Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add zpool_influxdb command #10786

Merged
merged 2 commits into from
Oct 9, 2020
Merged

Conversation

richardelling
Copy link
Contributor

A zpool_influxdb command is introduced to ease the collection of zpool statistics into the InfluxDB
time-series database. Examples are given on how to integrate with the telegraf statistics aggregator,
a companion to influxdb. Finally, a grafana dashboard template is included to show how pool latency
distributions can be visualized in a ZFS + telegraf + influxdb + grafana environment.

Motivation and Context

InfluxDB is one of the premier open-source time-series databases. There exists methods to get
simple zpool properties and zfs performance data from /proc into influxdb via telegraf. However,
the pool specifics are not readily available in /proc. Rather ZFS admins have relied on the zpool
command. Unfortunately, the zpool command is intended for humans and cannot be parsed easily.
zpool_influxdb can be considered a replacement for zpool which is intended for parsing by influxdb.

Description

In many ways, zpool_influxdb can be considered a userland replacement for parseable zpool output.
Unlike the zpool command which reads all of the pool configuration, health, and performance data
and then only shows a very small subset of the information, zpool_influxdb comprehensively presents
all of the information in one pass.

It is also possible to look at the output of zpool_influxdb command directly. It just isn't intended to be
human-friendly, so if you are a human, use the zpool command instead.

How Has This Been Tested?

This PR includes new ZTS tests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@codecov
Copy link

codecov bot commented Aug 23, 2020

Codecov Report

Merging #10786 into master will decrease coverage by 37.26%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #10786       +/-   ##
===========================================
- Coverage   79.76%   42.50%   -37.27%     
===========================================
  Files         395      365       -30     
  Lines      125039   116223     -8816     
===========================================
- Hits        99742    49402    -50340     
- Misses      25297    66821    +41524     
Flag Coverage Δ
#kernel 7.04% <ø> (-73.38%) ⬇️
#user 47.53% <ø> (-17.96%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
include/sys/xvattr.h 0.00% <0.00%> (-100.00%) ⬇️
module/zfs/objlist.c 0.00% <0.00%> (-100.00%) ⬇️
module/zfs/pathname.c 0.00% <0.00%> (-100.00%) ⬇️
include/sys/zfs_znode.h 0.00% <0.00%> (-100.00%) ⬇️
include/sys/dmu_redact.h 0.00% <0.00%> (-100.00%) ⬇️
include/sys/zfs_project.h 0.00% <0.00%> (-100.00%) ⬇️
include/sys/dmu_traverse.h 0.00% <0.00%> (-100.00%) ⬇️
module/os/linux/zfs/mmp_os.c 0.00% <0.00%> (-100.00%) ⬇️
include/os/linux/spl/sys/uio.h 0.00% <0.00%> (-100.00%) ⬇️
include/os/linux/spl/sys/proc.h 0.00% <0.00%> (-100.00%) ⬇️
... and 285 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b15b8d...6791c0c. Read the comment docs.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 24, 2020
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a cool idea. I take it the output of this command is InfluxDB line format data that could be piped directly into a curl command. Could you add a command line argument for extra tags? For example if I want to also add a hostname tag, something like --extra-tags hostname=$(hostname).

}
#endif

#endif /* ZFS_ZPOOL_INFLUXDB_H */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do without this header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using it as a convenient place to define SUPPORT_UINT64. But perhaps it is better to reverse that logic and allow override for not supporting uint64. Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either way can be done without a header.

If you see this being a regularly used configuration option, I would add something to ./configure for it instead. Something that detects the influxdb version by default and can be outsmarted by --with-influxdb=2 or the like.

This doesn't actually have a build dependency on influxdb, so falling back to the newest version if not present would seem appropriate. With that in mind, inverting the logic to be an opt-in for compat with the older version makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change for unsigned support actually is several years old now. When using telegraf, which is the preferred method, unsigned ints are handled properly by telegraf and we don't really need a recompile. I'd like to treat this as a "don't go back to the ice age"

@richardelling
Copy link
Contributor Author

This seems like a cool idea. I take it the output of this command is InfluxDB line format data that could be piped directly into a curl command. Could you add a command line argument for extra tags? For example if I want to also add a hostname tag, something like --extra-tags hostname=$(hostname).

excellent idea! I've used that on other collectors I've written. I'll add it.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool. It'll be nice to have a better way to collect this kind of data.

cmd/zpool_influxdb/README.md Outdated Show resolved Hide resolved
cmd/zpool_influxdb/zpool_influxdb.c Outdated Show resolved Hide resolved
cmd/zpool_influxdb/zpool_influxdb.c Show resolved Hide resolved
cmd/zpool_influxdb/zpool_influxdb.c Outdated Show resolved Hide resolved
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardelling would it make sense to rename this monitoring utility something a little more generic? That would allow us to extend it to other possible output formats without confusion. Or make it slightly less confusing if things other than influxdb find it useful. Maybe zstat or zmonitor?


[tests/functional/zpool_influxdb]
tests = 'zpool_influxdb'
tags = ['functional', 'metrics']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably make sense to move these tests under tests/functional/cli_user/ and run then as a normal unprivileged user. (user =)

Copy link
Contributor Author

@richardelling richardelling Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will do

@richardelling
Copy link
Contributor Author

@behlendorf checkstyle/deploy is failing, would a rebase fix?

@gmelikov
Copy link
Member

@richardelling I've restarted it manually

@gmelikov
Copy link
Member

@richardelling yep, rebase may fix it with apt-get update in a541f7d#diff-d72df87130a70a1aa96ef2bf6b1ad454

@richardelling
Copy link
Contributor Author

@richardelling would it make sense to rename this monitoring utility something a little more generic? That would allow us to extend it to other possible output formats without confusion. Or make it slightly less confusing if things other than influxdb find it useful. Maybe zstat or zmonitor?

At first glance, there is some merit to that idea. However, the two prevailing metrics styles, prometheus and influx, are very different in how they print metrics. Approximately half of the code is around printing, much less than the actual data collection which is almost trivial by comparison.

They are also very different in that influx is most often used in a push model (push metrics to a database HTTP endpoint) while prometheus is used in a pull model (database collects metrics from a ZFS-node-based HTTP endpoint). Obviously including a HTTP API service is more tedious to get right, especially in C. Today, both zpool_influxdb and zpool_prometheus are available from my public repo. My plan is to get zpool_influxdb in and then update zpool_prometheus with a builtin HTTP server and contribute that separately.

That said, a better architecture on the ZFS side is to relocate the spa config and spa stats into kstats. That will expose silly limitations in last century's kstat design. So maybe that is a task for the future.

@behlendorf
Copy link
Contributor

At first glance, there is some merit to that idea. However, the two prevailing metrics styles, prometheus and influx, are very different in how they print metrics.

That makes sense, I just wanted to get your thoughts. If the need arises we can always do this in the future.

That said, a better architecture on the ZFS side is to relocate the spa config and spa stats into kstats.

Yes, or perhaps something a bit more flexible than kstats. But I agree that's a job for another day.

If you can rebase this and resolve that one last bit of feedback this looks ready to merge.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to add cmd/zpool_influxdb/.gitignore containing /zpool_influxdb as well.

tests/runfiles/common.run Outdated Show resolved Hide resolved
@IvanVolosyuk
Copy link
Contributor

Is it possible to add a screenshot to the pull request to show how the result will look like? I did something similar for netdata (data collector which shows space distribution in pool between filesystems/snapshots). I wonder if it much superior than my experiments.

@richardelling
Copy link
Contributor Author

@IvanVolosyuk the size of datasets is not available in the pool configuration we read here. For dataset information, the various data collectors and agents already exist to deliver that info via kstats.

A future project that would be useful is to convert the pool stats into kstats. Until then, this approach works better than screen-scraping zpool command output.

of zpool statistics into the InfluxDB time-series database.
Examples are given on how to integrate with the telegraf
statistics aggregator, a companion to influxdb.

Finally, a grafana dashboard template is included to show
how pool latency distributions can be visualized in a
ZFS + telegraf + influxdb  + grafana environment.

Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 5, 2020
@behlendorf behlendorf merged commit e9527d4 into openzfs:master Oct 9, 2020
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Oct 15, 2020
This was requested but forgotten in openzfs#10786

Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
@ghost ghost mentioned this pull request Oct 15, 2020
12 tasks
behlendorf pushed a commit that referenced this pull request Oct 16, 2020
This was requested but forgotten in #10786.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #11071
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
A zpool_influxdb command is introduced to ease the collection
of zpool statistics into the InfluxDB time-series database.
Examples are given on how to integrate with the telegraf
statistics aggregator, a companion to influxdb.

Finally, a grafana dashboard template is included to show
how pool latency distributions can be visualized in a
ZFS + telegraf + influxdb  + grafana environment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes openzfs#10786
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
This was requested but forgotten in openzfs#10786.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes openzfs#11071
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
A zpool_influxdb command is introduced to ease the collection
of zpool statistics into the InfluxDB time-series database.
Examples are given on how to integrate with the telegraf
statistics aggregator, a companion to influxdb.

Finally, a grafana dashboard template is included to show
how pool latency distributions can be visualized in a
ZFS + telegraf + influxdb  + grafana environment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes openzfs#10786
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
This was requested but forgotten in openzfs#10786.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes openzfs#11071
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jun 11, 2021
A zpool_influxdb command is introduced to ease the collection
of zpool statistics into the InfluxDB time-series database.
Examples are given on how to integrate with the telegraf
statistics aggregator, a companion to influxdb.

Finally, a grafana dashboard template is included to show
how pool latency distributions can be visualized in a
ZFS + telegraf + influxdb  + grafana environment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes openzfs#10786
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Jun 11, 2021
This was requested but forgotten in openzfs#10786.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes openzfs#11071
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants