Karapace metrics #652

libretto · 2023-06-09T07:17:41Z

This addon enhances the statsd metrics available for monitoring the karapace server. We have now included the following additional metrics:

connections_active
request_size_avg
request_size_max
response_size_avg
response_size_max
master_slave_role
request_error_rate
response_rate
response_byte_rate
latency_max
latency_avg

The aiohttp library does not provide native support for metrics. Therefore, we will make an effort to gather the required data by collecting statistics from the request handler, aiohttp logs, and various other sections of the karapace code.
We have plans to enhance the metrics support in this module by adding additional metrics.

My objective is to incorporate metrics that are similar to what can be found in competitor products. To achieve this, I am utilizing the calculation methods from kafka.metrics.Metrics class.

Karapace competitors use sampled metrics and perform calculations for avg, min, max on the application side in order to facilitate Pull-based statistics services. This particular model is commonly utilized in platforms like Prometheus, and it is also the preferred choice for our client. However, at present, karapace only supports StatsD. Nevertheless, I anticipate that customers who actively rely on Pull-based statistics will likely require support for such services, and therefore, integrating them into our metrics class would be necessary to meet their needs.

libretto · 2023-06-09T10:32:44Z

@tvainika just fixed more issues and added as many as possible metrics similar to competitive products. Also now I utilize native kafka metrics classes.

aiven-anton · 2023-06-09T10:34:41Z

Continuing discussion from original PR. If possible, in the future I'd ask that you please avoid closing and re-opening separate PRs like this, as it makes it quite hard to follow progress and review. If you need to "start over", please feel free to force push to your feature branch instead.

Karapace competitors use sampled metrics and perform calculations for avg, min, max on the application side in order to facilitate Pull-based statistics services. This particular model is commonly utilized in platforms like Prometheus [...]

This does not sound accurate to me. In my experience when using Prometheus, it's much more common to avoid calculations on the application side, and expose metrics in their raw forms. Averages, and other aggregations are then produced by posing PromQL queries to the Prometheus service.

It seems reasonable to take that route here as well, and for instance rather expose sent_response_bytes, received_request_bytes. Together with a request count, that should be everything required to reliable calculate aggregations in Prometheus or similar.

karapace/karapacemetrics.py

libretto · 2023-06-09T21:49:13Z

Continuing discussion from original PR. If possible, in the future I'd ask that you please avoid closing and re-opening separate PRs like this, as it makes it quite hard to follow progress and review. If you need to "start over", please feel free to force push to your feature branch instead.

Karapace competitors use sampled metrics and perform calculations for avg, min, max on the application side in order to facilitate Pull-based statistics services. This particular model is commonly utilized in platforms like Prometheus [...]

This does not sound accurate to me. In my experience when using Prometheus, it's much more common to avoid calculations on the application side, and expose metrics in their raw forms. Averages, and other aggregations are then produced by posing PromQL queries to the Prometheus service.

It seems reasonable to take that route here as well, and for instance rather expose sent_response_bytes, received_request_bytes. Together with a request count, that should be everything required to reliable calculate aggregations in Prometheus or similar.

My client requires support for metrics that are already supported in competitor projects. Additionally, they anticipate the option to integrate with Prometheus in the future. Consequently, I have implemented the required support accordingly. The test results using Graphite, which I used for testing purposes, show the expected data output. For Prometheus integration, we have two options: utilizing the statsd-prometheus module or developing our own converter to send aggregated data to Prometheus. I believe Prometheus has the capability to handle aggregated data, allowing us to aggregate the metrics as required.

karapace/karapacemetrics.py

aiven-anton · 2023-06-12T10:24:11Z

My client requires support for metrics that are already supported in competitor projects. Additionally, they anticipate the option to integrate with Prometheus in the future. Consequently, I have implemented the required support accordingly. The test results using Graphite, which I used for testing purposes, show the expected data output. For Prometheus integration, we have two options: utilizing the statsd-prometheus module or developing our own converter to send aggregated data to Prometheus. I believe Prometheus has the capability to handle aggregated data, allowing us to aggregate the metrics as required.

@libretto Aggregated metrics are out of scope for Karapace, we are not going to accept PRs that implement that since we do not want to carry that maintenance burden. If you need 1-to-1 metrics compatibility, I'd suggest implementing this in a separate application where you can implement the exact aggregations you need.

That said, we would consider a PR that adds to Karapace a Prometheus endpoint for exposing metrics. That should probably be implemented in a separate PR though.

aiven-anton

See prior comments.

libretto · 2023-06-12T12:07:25Z

My client requires support for metrics that are already supported in competitor projects. Additionally, they anticipate the option to integrate with Prometheus in the future. Consequently, I have implemented the required support accordingly. The test results using Graphite, which I used for testing purposes, show the expected data output. For Prometheus integration, we have two options: utilizing the statsd-prometheus module or developing our own converter to send aggregated data to Prometheus. I believe Prometheus has the capability to handle aggregated data, allowing us to aggregate the metrics as required.

@libretto Aggregated metrics are out of scope for Karapace, we are not going to accept PRs that implement that since we do not want to carry that maintenance burden. If you need 1-to-1 metrics compatibility, I'd suggest implementing this in a separate application where you can implement the exact aggregations you need.

That said, we would consider a PR that adds to Karapace a Prometheus endpoint for exposing metrics. That should probably be implemented in a separate PR though.

@aiven-anton Are You in favor of allowing record calls to the KarapaceMetrics() class, but with the requirement of removing aggregation from the class? Does this mean You are not concerned about the impact on performance caused by the 10-11 stats request calls per karapace request to StatsD or Prometheus?

aiven-anton · 2023-06-12T12:26:04Z

Are You in favor of allowing record calls to the KarapaceMetrics() class, but with the requirement of removing aggregation from the class? Does this mean You are not concerned about the impact on performance caused by the 10-11 stats request calls per karapace request to StatsD or Prometheus?

What I am saying is that we should not calculate averages, maximums, and rates, and instead only expose raw metrics. This means I think the proposed metrics should be changed like this:

Replace request_size_avg and request_size_max with something like request_bytes_received.
Replace response_byte_rate, response_size_avg and response_size_max with something like response_bytes_sent.
Replace request_error_rate with a counter.
Replace latency_max and latency_avg with something like latency_total.
Introduce a request counter.

libretto · 2023-06-16T09:27:32Z

@aiven-anton
Hello, I've made changes to the code to only send raw statistics, if possible, to the Stats Server. Currently, the data is being sent to the statsd server. Could You please review the code and let me know if this is what You were referring to?

Additionally, I'm planning to add support for Prometheus. The Prometheus client acts as an HTTP server that provides an endpoint for obtaining statistics periodically. Is this acceptable to You?

Furthermore, I'm considering a better approach. One option is to inherit from the StatsClient or create an AbstractStatsClient class with interfaces. This way, both the StatsClient and PrometheusStatsClient can inherit from the same class, and my KarapaceMetrics class can utilize it as an interface. What are Your thoughts on this?

libretto

Please check

libretto · 2023-06-26T13:31:56Z

@aiven-anton I kindly request that you review my previous comment and provide an answer regarding the proposed Prometheus solution. Specifically, I would like to know if I can utilize the pull model, where the server is inside karapace, or if I need to use the push model and have clients request the use of Pushgateway between karapace and Prometheus.

aiven-anton · 2023-06-26T14:02:38Z

Sorry for the delay, I appreciate the update here 👍

Additionally, I'm planning to add support for Prometheus. The Prometheus client acts as an HTTP server that provides an endpoint for obtaining statistics periodically. Is this acceptable to You?

Yes, but I strongly encourage doing this in a follow-up PR to keep the scope manageable and reviewable. And you're exactly right, with the Prometheus model we would not use a client, but expose a special metrics endpoint (probably on a separate port). But definitely I think that this is right approach, the Prometheus implementation should be a server that Prometheus "scrapes" from rather than the Pushgateway you mention as alternative.

Furthermore, I'm considering a better approach. One option is to inherit from the StatsClient or create an AbstractStatsClient class with interfaces. This way, both the StatsClient and PrometheusStatsClient can inherit from the same class, and my KarapaceMetrics class can utilize it as an interface. What are Your thoughts on this?

Yes, it sounds reasonable to have some common interface, and define separate pluggable backends for Statsd and later Prometheus. I'd also encourage deferring the introduction of this until you work on the Prometheus implementation, as doing it beforehand is likely to end up with an abstraction that doesn't meet all requirements.

libretto · 2023-06-30T07:09:23Z

@aiven-anton can You check the current PR and approve it yet?

libretto · 2023-07-11T18:20:23Z

@aiven-anton Hi, could You review this code and merge it with the main branch?

libretto · 2023-07-25T15:08:46Z

See prior comments.

could You review?

jjaakola-aiven

Some comments on the PR.

README.rst

karapace/statsd.py

karapace/karapacemetrics.py

jjaakola-aiven · 2023-07-27T11:13:08Z

karapace/karapacemetrics.py

+            return
+        if not isinstance(self.stats_client, StatsClient):
+            raise RuntimeError("no StatsClient available")
+        self.stats_client.gauge("request-size", size)


Counter, name as request_size_total.

The use of Gauge tries to emulate the behavior of Prometheus counter? If StatsD is the backend I think the correct would be the counter, it does get reset to 0 on every flush. Prometheus counter does not, so there is definitely difference. If Prometheus support is added I think the self.stats_client would be a different implementation suitable for Prometheus. This comment applies to other counter comments too.

I'm unclear about the necessity of using "request_size_total" when our aim is to create a graph depicting request size over time. As far as I can see, there isn't a metric that requires a counter. As I know both Prometheus and StatsD have a similar gauge metric type.

@libretto I think gauge simply doesn't make sense here?

our aim is to create a graph depicting request size over time

The way to do that with prometheus is to graph request_size_total / request_count, I'm unsure what makes sense with StatsD.

In order to calculate "request_size_total / request_count," we would need to store "request_count" and perform the calculation for "request_size_total" on the karapace side. However, I noticed here your comment where You mentioned that this approach might not be advisable. Consequently, I modified the code in this manner because both Statsd and Prometheus can visualize this data without the need for internal calculations. Perhaps this approach may be less efficient than mine, but at least it avoids any calculations within karapace.

jjaakola-aiven · 2023-07-27T11:40:37Z

karapace/karapacemetrics.py

+"""
+from __future__ import annotations
+
+from kafka.metrics import Metrics


karapace/rapu.py

karapace/karapacemetrics.py

jjaakola-aiven · 2023-07-27T11:45:03Z

karapace/karapacemetrics.py

+        schedule.every(10).seconds.do(self.connections)
+        self.worker_thread.start()


Is the threading needed only for the reading connection count? Maybe push the logic out from this class and implement which would be called from the reader thread. Consider also thread safety.

def connections(self, connections: int) -> None: ... self.stats_client.gauge("connections", connections)

The thread job operates at a frequency of every 10 seconds, which is a relatively resource-efficient approach.

@libretto Could you shed some further light on this question?

Is the threading needed only for the reading connection count?

For instance, if Karapace is handling 100 requests per second, the function that retrieves the connection count would be called with every request. In my solution, we acquire this data every 10 seconds, which means significantly fewer calls (about 1000 times fewer). While this approach might not be as precise, it's effective enough for our purposes.

@aiven-anton @jjaakola-aiven could you check?

aiven-anton · 2023-09-01T08:20:19Z

karapace/karapacemetrics.py

@@ -0,0 +1,119 @@
+"""


issue: On the naming of this module, we're already in the karapace namespace, let's simply name the module metrics? (Resulting name would be karapace.metrics instead of repeating karapace.karapacemetrics).

aiven-anton · 2023-09-01T08:21:23Z

karapace/karapacemetrics.py

+        schedule.every(10).seconds.do(self.connections)
+        self.worker_thread.start()


@libretto Could you shed some further light on this question?

Is the threading needed only for the reading connection count?

aiven-anton · 2023-09-01T08:24:40Z

karapace/karapacemetrics.py

+            return
+        if not isinstance(self.stats_client, StatsClient):
+            raise RuntimeError("no StatsClient available")
+        self.stats_client.gauge("request-size", size)


@libretto I think gauge simply doesn't make sense here?

our aim is to create a graph depicting request size over time

The way to do that with prometheus is to graph request_size_total / request_count, I'm unsure what makes sense with StatsD.

aiven-anton · 2023-11-16T09:34:22Z

This is superseded by #711.

libretto added 2 commits June 9, 2023 10:08

Karapace metrics

e3cb524

Merge branch 'master' into karapace-metrics

32ce060

libretto requested review from a team as code owners June 9, 2023 07:17

aiven-anton reviewed Jun 9, 2023

View reviewed changes

libretto added 2 commits June 10, 2023 11:14

fixup issues

8dab84d

fixup issues

2898e31

aiven-anton reviewed Jun 12, 2023

View reviewed changes

karapace/karapacemetrics.py Outdated Show resolved Hide resolved

karapace/karapacemetrics.py Outdated Show resolved Hide resolved

karapace/karapacemetrics.py Outdated Show resolved Hide resolved

libretto added 2 commits June 12, 2023 12:59

fixup annotations issue

c974579

fixup exception message

7256f5d

aiven-anton requested changes Jun 12, 2023

View reviewed changes

get rid of multiple instances of class

ab6ae96

fixup issue

733d1f2

change code to send raw data only

8751eea

libretto added 6 commits June 16, 2023 12:31

merge with master

53d3e4b

fixup

fedff8f

Merge branch 'master' into karapace-metrics

31d16d4

fixup code

b70ae03

fixup

a0387a3

fixup

358facc

libretto commented Jun 22, 2023

View reviewed changes

libretto requested a review from aiven-anton July 3, 2023 21:56

merge

a064624

jjaakola-aiven reviewed Jul 27, 2023

View reviewed changes

libretto added 6 commits August 8, 2023 16:27

improve code by request

8533959

merge with main

ac48829

add psutil typing support

90e221c

fixup

4c48576

fixup

f9cb6d8

Merge branch 'main' into karapace-metrics

765864b

libretto requested review from a team as code owners August 30, 2023 20:48

aiven-anton reviewed Sep 1, 2023

View reviewed changes

libretto added 4 commits September 2, 2023 20:19

merge with master

073aa16

refactor

0c73a1a

fixup

c495c50

fixup requirements

e3a89a1

libretto requested a review from aiven-anton September 27, 2023 20:11

aiven-anton mentioned this pull request Nov 8, 2023

Prometheus client metrics support #711

Closed

aiven-anton closed this Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karapace metrics #652

Karapace metrics #652

libretto commented Jun 9, 2023

libretto commented Jun 9, 2023

aiven-anton commented Jun 9, 2023

libretto commented Jun 9, 2023

aiven-anton commented Jun 12, 2023 •

edited

Loading

aiven-anton left a comment

libretto commented Jun 12, 2023 •

edited

Loading

aiven-anton commented Jun 12, 2023

libretto commented Jun 16, 2023 •

edited

Loading

libretto left a comment

libretto commented Jun 26, 2023

aiven-anton commented Jun 26, 2023

libretto commented Jun 30, 2023

libretto commented Jul 11, 2023

libretto commented Jul 25, 2023

jjaakola-aiven left a comment

jjaakola-aiven Jul 27, 2023

libretto Aug 8, 2023

aiven-anton Sep 1, 2023

libretto Sep 2, 2023 •

edited

Loading

jjaakola-aiven Jul 27, 2023

jjaakola-aiven Jul 27, 2023

libretto Aug 8, 2023

aiven-anton Sep 1, 2023

libretto Sep 2, 2023

libretto Sep 14, 2023

aiven-anton Sep 1, 2023

libretto Sep 4, 2023

aiven-anton Sep 1, 2023

aiven-anton Sep 1, 2023

aiven-anton commented Nov 16, 2023

		schedule.every(10).seconds.do(self.connections)
		self.worker_thread.start()

Karapace metrics #652

Karapace metrics #652

Conversation

libretto commented Jun 9, 2023

libretto commented Jun 9, 2023

aiven-anton commented Jun 9, 2023

libretto commented Jun 9, 2023

aiven-anton commented Jun 12, 2023 • edited Loading

aiven-anton left a comment

Choose a reason for hiding this comment

libretto commented Jun 12, 2023 • edited Loading

aiven-anton commented Jun 12, 2023

libretto commented Jun 16, 2023 • edited Loading

libretto left a comment

Choose a reason for hiding this comment

libretto commented Jun 26, 2023

aiven-anton commented Jun 26, 2023

libretto commented Jun 30, 2023

libretto commented Jul 11, 2023

libretto commented Jul 25, 2023

jjaakola-aiven left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libretto Sep 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aiven-anton commented Nov 16, 2023

aiven-anton commented Jun 12, 2023 •

edited

Loading

libretto commented Jun 12, 2023 •

edited

Loading

libretto commented Jun 16, 2023 •

edited

Loading

libretto Sep 2, 2023 •

edited

Loading