-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal primary metrics endpoint #5165
Minimal primary metrics endpoint #5165
Conversation
c459892
to
61fa813
Compare
src/v/pandaproxy/server.cc
Outdated
@@ -137,7 +139,7 @@ void server::route(server::route_t r) { | |||
// NOTE: this pointer will be owned by data member _routes of | |||
// ss::httpd:server. seastar didn't use any unique ptr to express that. | |||
auto* handler = new handler_adaptor( | |||
_pending_reqs, _ctx, std::move(r.handler), r.path_desc); | |||
_pending_reqs, _ctx, std::move(r.handler), r.path_desc, _server_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the rest proxy we get "pandaproxy", It might be preferable to make it "rest_proxy". I'm not sure of the implication of renaming the server - I think it also affects a field in the rpc metrics, which may or may not be desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing the server name, the older metrics endpoint would refer to pandaproxy under both names (pandaproxy and rest_proxy). This is because the server
object metrics use the server name as a label. I think that would be confusing.
I feel like I'm missing some context here in order to make a decision. In general, do we want to rename pandaproxy
to rest_proxy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the case. Once upon a time Pandaproxy was it's own thing, and needed branding, but now it's a part of Redpanda proper, it's better to have it be named what it does.
But _server_name
is piped through to the old metrics, which we shouldn't change, so I think we'll need a new name for the new metrics, whilst keeping the old name for the old metrics.
I didn't quite parse that: does it mean that e.g. redpanda_kafka_write_bytes is not rolled up to a per-node value (if not, what are its labels?)? |
Let's take
And internally aggregates:
The Prometheus scrape will add something like Sometimes the data doesn't exist at the broker levelTo get topic bytes written, the query should be Sometimes there's no point in adding a broker level metric that increases cardinalityTo get the throughput published to the broker: |
@VladLazar can we change the endpoint to |
@VladLazar can you give guidance as to what is the end-result of reduction size/cardinality/etc what is the measurable improvement. how to think about this for cloud for example - i.e.: how many bytes does this save. or how much is cardinality reduced. some high level guidance is helpful. |
Sure, the metric cardinality reduction happens in a separate PR. This PR just adds the new metric endpoint. There's a bit of detail in the aggregation PR cover letter, but in brief I've seen cardinality be reduced by several orders of magnitude (somewhere between 4 and 6 times). Most of the aggregations that were introduced are done by shard, which means that the magnitude of the reduction increases with the number of shards and decreases for the other dimensions (e.g. topics, partitions). |
61fa813
to
59dc009
Compare
bb55bf4
to
2ae501f
Compare
Can you add some release notes, and move the force push comments above them. |
I wonder if it's worth having another configuration parameter, we have |
The only metrics related config we have is |
Oh yeah, it's the other way around, We can leave it to another PR if it's an open question. |
For me the question is whether |
Can you do a pass of the descriptions, so that they make sense as documentation. E.g.: |
src/v/pandaproxy/server.cc
Outdated
@@ -137,7 +139,7 @@ void server::route(server::route_t r) { | |||
// NOTE: this pointer will be owned by data member _routes of | |||
// ss::httpd:server. seastar didn't use any unique ptr to express that. | |||
auto* handler = new handler_adaptor( | |||
_pending_reqs, _ctx, std::move(r.handler), r.path_desc); | |||
_pending_reqs, _ctx, std::move(r.handler), r.path_desc, _server_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the case. Once upon a time Pandaproxy was it's own thing, and needed branding, but now it's a part of Redpanda proper, it's better to have it be named what it does.
But _server_name
is piped through to the old metrics, which we shouldn't change, so I think we'll need a new name for the new metrics, whilst keeping the old name for the old metrics.
6f9d222
to
76d8545
Compare
|
2748eb7
to
005794d
Compare
No outstanding comments from me! Delighted to have this landing as our new home for higher level metrics + I think the code for only outputting cluster stats from the leader is neat. I think others had more recent comments than I did, so will leave it for them to ✔️ it formally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great
005794d
to
910ea77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, assuming tests pass
Signed-off-by: Ben Pope <ben@redpanda.com>
This patch adds a new 'metrics.h' header that exposes some utilites for interacting with the seastar metrics subsystem: * public_metrics_handle: the seastar internal implementation handle for metrics exposed on the '/public_metrics' endpoint * report_default_histogram: converts an HDR histogram to a seastar histogram with pre-configured defaults
Expose another prometheus endpoint on route '/public_metrics'. This endpoint should include a limited set of metrics, aggregated by default that's useful for external and internal users who lack context into the innerworkings of the project. Signed-off-by: Ben Pope <ben@redpanda.com>
This patch adds a configuration option that allows users to disable the public metric endpoints. It also updates the description of both 'disable_metrics' and 'disable_public_metric' to make it clear which metrics endpoint will be affected.
This patch disables the public metrics for the redpanda fixture used in the cluster unit tests. These simulated nodes use the same internal metrics implementation which means that normal metric registration will cause 'double_registration' errors to be thrown unless metrics are disabled.
This patch adds request latency metrics for pandaproxy and the schema registry. * repdanda_rest_proxy_request_latency_seconds * redpanda_schema_registry_request_latency_seconds * Description: latency of requests processed by the respective server which includes the time spent waiting for resources to become available * Labels: shard, operation * Aggregation: shard, operation This patch updates the server constructor to accept the name of the metrics group under which to publish metrics for the new endpoint. The server name, which is already an argument to the constructor, cannot be used for this because we wish to publish 'pandaproxy' metrics under the 'rest_proxy' name.
66e8f57
to
c71b09a
Compare
This patch adds partition and topic level endpoint to the new prometheus endpoint (exposed at "/public_metrics"). The following metrics were added: Partition Level Metrics: * redpanda_kafka_max_offset: * Description: latest committed offset for the partition (i.e. the offset of the last message safely persisted on most replicas) * Labels: namespace, topic, partition, shard * Aggregation: shard * redpanda_kafka_under_replicated_replicas: * Description: the number of under replicated replicas for the partition * Labels: namespace, topic, partition, shard * Aggregation: shard Topic Level Metrics: * redpanda_kafka_request_bytes_total: * Description: total number of bytes consumed/produced per topic * Labels: namespace, topic, partition, request ("produce" | "consume") * Aggregation: shard, partition * redpanda_kafka_replicas: * Description: number of configured replicas per topic * Labels: namespace, topic, partition * Aggregation: shard, partition Broker Level Metrics: * redpanda_kafka_request_latency_seconds: * Description: latency of produce/consume requests per broker (measures from the moment a request is initiated on the partition to the moment when the response is fulfilled) * Labels: shard * Aggregation: shard
This patch introduces a new probe that which is embedded in the controller object. Only one instance is created per broker and it matches the lifetime of the controller object. These metrics are reported only from the controller leader (one node per cluster). The following cluster level metrics were added: * redpanda_cluster_brokers: * Description: number of configured brokers in the cluster * Label: shard * Aggreation: shard * redpanda_cluster_topics: * Description: number of user created topics in the cluster * Label: shard * Aggreation: shard * redpanda_cluster_partitions: * Description: number of partitions managed by the cluster (note that replicas are not counted as partitions and that the partitions of the controller topic are included) * Label: shard * Aggregation: shard * redpanda_cluster_unavailable_partitions: * Description: number of unavailable partitions in the cluster (i.e. partitions that lack quorum among replicants) * Label: shard * Aggregation: shard
This patch extends the metric querying interface in the redpanda service to allow for fetching metrics from the newly added public endpoint (/public_metrics).
This patch extends the MetricsCheck utility by exposing the metrics endpoint to scrape the metrics from in the interface. As a result, the class can be used to test the metrics exposed by new "/public_metrics" endpoint.
This patch adds tests for the cluster level metrics exposed by the new metrics endpoint.
c71b09a
to
334d5c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice stuff
Cover letter
Currently, Redpanda exposes its full set of metrics in Prometheus format on the
/metrics
endpoint.A large number of metrics are exposed on this endpoint (e.g. 6275 for a broker with no user topics
added), many of which are only useful to someone with context regarding the inner working of
Redpanda.
To address the, issues with the
/metrics
endpoint, this set of patches adds a new endpointlower in cardinality which is meant for external and internal users that don't require such a
high level of granularity.
A list of the metrics that have been added to the new endpoint (exposed on
/public_metrics
)can be found below.
Partition Level Metrics:
offset of the last message safely persisted on most replicas)
partition
Topic Level Metrics:
request ("produce" | "consume")
Broker Level Metrics:
(measures from the moment a request is initiated on the partition to
the moment when the response is fulfilled)
For a few desirable metrics, the decision was made to obtain them
by aggregating in PromQL in order to reduce the amount of work performed on scrapes:
force push:
force push:
for_each_leader
and by introducing utility function/public_metrics
ssx/metrics.h
headerforce-push
force push:
force push:
disable_metrics
config option disable only the/metrics
endpoint metricsdisable_public_metrics
to disable registering metrics onthe
/public_metrics
endpoint.rest_proxy
name for thepublic_metrics
endpoint. Nothing changes for the
metrics
endpoint.ssx
in a namespace:ssx::metrics
.force push:
force push:
force push:
force push:
force push:
force push:
force push:
Release notes
Features
/public_metrics
, that exposes a small set ofwell curated metrics. See the documentation for the full list of published metrics. The
disable_public_metrics
configuration option disables registering metrics for the
/public_metrics
endpoint. The olderdisable_metrics
configuration option skips registration of metrics for the
/metrics
endpoint.