-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics aggregation capabilities #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be easier to review if the last commit was split up a bit and the comments improved.
The patchset brought in optimises histograms by not outputting them if they aren't used. I wonder if we always want that. Maybe it should be configurable. |
57f55f8
to
b45c02b
Compare
Didn't you remove that in this commit (the |
Looks like it. But it was also printing the headers twice. Lets consider this later. |
I split it up and expanded on the commit messages in this force push. |
b45c02b
to
2db661d
Compare
2db661d
to
1729479
Compare
1729479
to
c8de0ae
Compare
This patch add support for the summary type on the metrics layer. A summary is a different kind of histogram, it's buckets are percentile so the reporting layer (i.e. Prometheus for example) would know to report it correctly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a missing part to how histograms are being aggregated, it needs to aggregate the sum and count as well. Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Aggregate labels are a mechanism for reporting aggregated results. Most commonly it allows to report one histogram per node instead of per shard. This patch adds an option to mark a metric with a vector of labels. That vector will be part of the metric meta-data so the reporting layer would be able to aggregate over it. Skip when empty, means that metrics that are not in used, will not be reported. A common scenario is that user register a metrics, but that metrics is never used. The most common case is histogram and summary but it it can also happen with counters. This patch adds an option to mark a metric with skip_when_empty. When done so, if a metric was never used (true for histogram, counters and summary) it will not be reported. Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds multiple functionality to Prometheus reporting: 1. Add summary reporting. Summaries are used in seastar to report aggregated percentile information (for example p95 and p99) The main usage is to report per-shard summary of a latency histograms. 2. Support aggregated metrics. With an aggregated metrics, Prometheus would aggregate multiple metrics based on labels and would report the result. Usually this would be for reporting a single latency histogram per node instead of per shard. But it could be used for counters and gauge as well. 3. Skip empty counters, histograms and summaries. It's a common practice to register lots of metrics even if they are not being used. Histograms have a huge effect on performance, so not reporting an empty histogram is a great performance boost both for the application and for the Prometheus server. This is true for Summaries and Counters as well, marking a metrics with skip_when_empty would mean Prometheus will not report those metrics. 4. As an optimization, the stringstream that is used per metric is reused and clear insted of recreated. Signed-off-by: Amnon Heiman <amnon@scylladb.com>
c8de0ae
to
3f35820
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this. Sent some minor nits to the mailing list.
This PR contains a set of patches from the seastar mailing list (patch set and repo). They add aggregations
support to the seastar metrics subsystem. This PR used to be based on an older version of the upstream patch set,
but Ben fed back our findings and the new version matches our needs as is.
Mailing List Cover Letter
Motivation:
Histograms are the Prometheus killer. They are big to send over the
network and are big to store in Prometheus.
Histograms' size is always an issue, but with Seastar, it becomes even trickier.
Typically, each shard would collect its histograms, so the overall data
is multiplied by the number of shards.
This series addresses the need to report quantile information like latency
without generating massive metrics reports.
A summary is a Prometheus metric type that holds a quantile summary (i.e.
p95, p99).
The downside of summaries is that they cannot be aggregated, which is
needed for a distributed system (i.e., calculate the p99 latency of a cluster).
The series adds four tools for Prometheus performance:
optional services. It is now possible to mark those metrics as
skip_when_empty and they will not be reported if they were never used.
metrics instead of per shard. For example, for multi-nodes quantile calculation,
we need a per-node histogram. It is now possible to mark a registered
metric for aggregation. The metrics layer will aggregate this
metric based on a list of labels. (Typically, this will be by shard,
but it could be any other combination of labels).
iteration.
force push:
sm::label
strong type instead ofstd::string
for specifying aggregation labelsforce push:
aggregation_labels
argument from metric creation functions (make_<metric_type>
).force push:
write_counter
function in the commit it's introducedforce push:
sm::label
in setter of aggregation labelsforce push:
equivalent to what we had on this branch before (it also allows for skipping reporting of
empty metrics configurably).