Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karapace metrics #652

Closed
wants to merge 26 commits into from
Closed

Conversation

libretto
Copy link
Contributor

@libretto libretto commented Jun 9, 2023

This addon enhances the statsd metrics available for monitoring the karapace server. We have now included the following additional metrics:

connections_active
request_size_avg
request_size_max
response_size_avg
response_size_max
master_slave_role
request_error_rate
response_rate
response_byte_rate
latency_max
latency_avg

The aiohttp library does not provide native support for metrics. Therefore, we will make an effort to gather the required data by collecting statistics from the request handler, aiohttp logs, and various other sections of the karapace code.
We have plans to enhance the metrics support in this module by adding additional metrics.

My objective is to incorporate metrics that are similar to what can be found in competitor products. To achieve this, I am utilizing the calculation methods from kafka.metrics.Metrics class.

Karapace competitors use sampled metrics and perform calculations for avg, min, max on the application side in order to facilitate Pull-based statistics services. This particular model is commonly utilized in platforms like Prometheus, and it is also the preferred choice for our client. However, at present, karapace only supports StatsD. Nevertheless, I anticipate that customers who actively rely on Pull-based statistics will likely require support for such services, and therefore, integrating them into our metrics class would be necessary to meet their needs.

@libretto libretto requested review from a team as code owners June 9, 2023 07:17
@libretto
Copy link
Contributor Author

libretto commented Jun 9, 2023

@tvainika just fixed more issues and added as many as possible metrics similar to competitive products. Also now I utilize native kafka metrics classes.

@aiven-anton
Copy link
Contributor

Continuing discussion from original PR. If possible, in the future I'd ask that you please avoid closing and re-opening separate PRs like this, as it makes it quite hard to follow progress and review. If you need to "start over", please feel free to force push to your feature branch instead.


Karapace competitors use sampled metrics and perform calculations for avg, min, max on the application side in order to facilitate Pull-based statistics services. This particular model is commonly utilized in platforms like Prometheus [...]

This does not sound accurate to me. In my experience when using Prometheus, it's much more common to avoid calculations on the application side, and expose metrics in their raw forms. Averages, and other aggregations are then produced by posing PromQL queries to the Prometheus service.

It seems reasonable to take that route here as well, and for instance rather expose sent_response_bytes, received_request_bytes. Together with a request count, that should be everything required to reliable calculate aggregations in Prometheus or similar.

karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
@libretto
Copy link
Contributor Author

libretto commented Jun 9, 2023

Continuing discussion from original PR. If possible, in the future I'd ask that you please avoid closing and re-opening separate PRs like this, as it makes it quite hard to follow progress and review. If you need to "start over", please feel free to force push to your feature branch instead.

Karapace competitors use sampled metrics and perform calculations for avg, min, max on the application side in order to facilitate Pull-based statistics services. This particular model is commonly utilized in platforms like Prometheus [...]

This does not sound accurate to me. In my experience when using Prometheus, it's much more common to avoid calculations on the application side, and expose metrics in their raw forms. Averages, and other aggregations are then produced by posing PromQL queries to the Prometheus service.

It seems reasonable to take that route here as well, and for instance rather expose sent_response_bytes, received_request_bytes. Together with a request count, that should be everything required to reliable calculate aggregations in Prometheus or similar.

My client requires support for metrics that are already supported in competitor projects. Additionally, they anticipate the option to integrate with Prometheus in the future. Consequently, I have implemented the required support accordingly. The test results using Graphite, which I used for testing purposes, show the expected data output. For Prometheus integration, we have two options: utilizing the statsd-prometheus module or developing our own converter to send aggregated data to Prometheus. I believe Prometheus has the capability to handle aggregated data, allowing us to aggregate the metrics as required.

karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
@aiven-anton
Copy link
Contributor

aiven-anton commented Jun 12, 2023

My client requires support for metrics that are already supported in competitor projects. Additionally, they anticipate the option to integrate with Prometheus in the future. Consequently, I have implemented the required support accordingly. The test results using Graphite, which I used for testing purposes, show the expected data output. For Prometheus integration, we have two options: utilizing the statsd-prometheus module or developing our own converter to send aggregated data to Prometheus. I believe Prometheus has the capability to handle aggregated data, allowing us to aggregate the metrics as required.

@libretto Aggregated metrics are out of scope for Karapace, we are not going to accept PRs that implement that since we do not want to carry that maintenance burden. If you need 1-to-1 metrics compatibility, I'd suggest implementing this in a separate application where you can implement the exact aggregations you need.

That said, we would consider a PR that adds to Karapace a Prometheus endpoint for exposing metrics. That should probably be implemented in a separate PR though.

Copy link
Contributor

@aiven-anton aiven-anton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See prior comments.

@libretto
Copy link
Contributor Author

libretto commented Jun 12, 2023

My client requires support for metrics that are already supported in competitor projects. Additionally, they anticipate the option to integrate with Prometheus in the future. Consequently, I have implemented the required support accordingly. The test results using Graphite, which I used for testing purposes, show the expected data output. For Prometheus integration, we have two options: utilizing the statsd-prometheus module or developing our own converter to send aggregated data to Prometheus. I believe Prometheus has the capability to handle aggregated data, allowing us to aggregate the metrics as required.

@libretto Aggregated metrics are out of scope for Karapace, we are not going to accept PRs that implement that since we do not want to carry that maintenance burden. If you need 1-to-1 metrics compatibility, I'd suggest implementing this in a separate application where you can implement the exact aggregations you need.

That said, we would consider a PR that adds to Karapace a Prometheus endpoint for exposing metrics. That should probably be implemented in a separate PR though.

@aiven-anton Are You in favor of allowing record calls to the KarapaceMetrics() class, but with the requirement of removing aggregation from the class? Does this mean You are not concerned about the impact on performance caused by the 10-11 stats request calls per karapace request to StatsD or Prometheus?

@aiven-anton
Copy link
Contributor

Are You in favor of allowing record calls to the KarapaceMetrics() class, but with the requirement of removing aggregation from the class? Does this mean You are not concerned about the impact on performance caused by the 10-11 stats request calls per karapace request to StatsD or Prometheus?

What I am saying is that we should not calculate averages, maximums, and rates, and instead only expose raw metrics. This means I think the proposed metrics should be changed like this:

  • Replace request_size_avg and request_size_max with something like request_bytes_received.
  • Replace response_byte_rate, response_size_avg and response_size_max with something like response_bytes_sent.
  • Replace request_error_rate with a counter.
  • Replace latency_max and latency_avg with something like latency_total.
  • Introduce a request counter.

@libretto
Copy link
Contributor Author

libretto commented Jun 16, 2023

@aiven-anton
Hello, I've made changes to the code to only send raw statistics, if possible, to the Stats Server. Currently, the data is being sent to the statsd server. Could You please review the code and let me know if this is what You were referring to?

Additionally, I'm planning to add support for Prometheus. The Prometheus client acts as an HTTP server that provides an endpoint for obtaining statistics periodically. Is this acceptable to You?

Furthermore, I'm considering a better approach. One option is to inherit from the StatsClient or create an AbstractStatsClient class with interfaces. This way, both the StatsClient and PrometheusStatsClient can inherit from the same class, and my KarapaceMetrics class can utilize it as an interface. What are Your thoughts on this?

Copy link
Contributor Author

@libretto libretto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check

@libretto
Copy link
Contributor Author

@aiven-anton I kindly request that you review my previous comment and provide an answer regarding the proposed Prometheus solution. Specifically, I would like to know if I can utilize the pull model, where the server is inside karapace, or if I need to use the push model and have clients request the use of Pushgateway between karapace and Prometheus.

@aiven-anton
Copy link
Contributor

Sorry for the delay, I appreciate the update here 👍

Additionally, I'm planning to add support for Prometheus. The Prometheus client acts as an HTTP server that provides an endpoint for obtaining statistics periodically. Is this acceptable to You?

Yes, but I strongly encourage doing this in a follow-up PR to keep the scope manageable and reviewable. And you're exactly right, with the Prometheus model we would not use a client, but expose a special metrics endpoint (probably on a separate port). But definitely I think that this is right approach, the Prometheus implementation should be a server that Prometheus "scrapes" from rather than the Pushgateway you mention as alternative.

Furthermore, I'm considering a better approach. One option is to inherit from the StatsClient or create an AbstractStatsClient class with interfaces. This way, both the StatsClient and PrometheusStatsClient can inherit from the same class, and my KarapaceMetrics class can utilize it as an interface. What are Your thoughts on this?

Yes, it sounds reasonable to have some common interface, and define separate pluggable backends for Statsd and later Prometheus. I'd also encourage deferring the introduction of this until you work on the Prometheus implementation, as doing it beforehand is likely to end up with an abstraction that doesn't meet all requirements.

@libretto
Copy link
Contributor Author

@aiven-anton can You check the current PR and approve it yet?

@libretto libretto requested a review from aiven-anton July 3, 2023 21:56
@libretto
Copy link
Contributor Author

@aiven-anton Hi, could You review this code and merge it with the main branch?

@libretto
Copy link
Contributor Author

See prior comments.

could You review?

Copy link
Contributor

@jjaakola-aiven jjaakola-aiven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the PR.

README.rst Outdated Show resolved Hide resolved
karapace/statsd.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
return
if not isinstance(self.stats_client, StatsClient):
raise RuntimeError("no StatsClient available")
self.stats_client.gauge("request-size", size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter, name as request_size_total.

The use of Gauge tries to emulate the behavior of Prometheus counter? If StatsD is the backend I think the correct would be the counter, it does get reset to 0 on every flush. Prometheus counter does not, so there is definitely difference. If Prometheus support is added I think the self.stats_client would be a different implementation suitable for Prometheus. This comment applies to other counter comments too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear about the necessity of using "request_size_total" when our aim is to create a graph depicting request size over time. As far as I can see, there isn't a metric that requires a counter. As I know both Prometheus and StatsD have a similar gauge metric type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@libretto I think gauge simply doesn't make sense here?

our aim is to create a graph depicting request size over time

The way to do that with prometheus is to graph request_size_total / request_count, I'm unsure what makes sense with StatsD.

Copy link
Contributor Author

@libretto libretto Sep 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to calculate "request_size_total / request_count," we would need to store "request_count" and perform the calculation for "request_size_total" on the karapace side. However, I noticed here your comment where You mentioned that this approach might not be advisable. Consequently, I modified the code in this manner because both Statsd and Prometheus can visualize this data without the need for internal calculations. Perhaps this approach may be less efficient than mine, but at least it avoids any calculations within karapace.

"""
from __future__ import annotations

from kafka.metrics import Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

karapace/rapu.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
karapace/karapacemetrics.py Outdated Show resolved Hide resolved
Comment on lines 68 to 69
schedule.every(10).seconds.do(self.connections)
self.worker_thread.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the threading needed only for the reading connection count? Maybe push the logic out from this class and implement which would be called from the reader thread. Consider also thread safety.

def connections(self, connections: int) -> None:
    ...
    self.stats_client.gauge("connections", connections)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread job operates at a frequency of every 10 seconds, which is a relatively resource-efficient approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@libretto Could you shed some further light on this question?

Is the threading needed only for the reading connection count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if Karapace is handling 100 requests per second, the function that retrieves the connection count would be called with every request. In my solution, we acquire this data every 10 seconds, which means significantly fewer calls (about 1000 times fewer). While this approach might not be as precise, it's effective enough for our purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aiven-anton @jjaakola-aiven could you check?

@libretto libretto requested review from a team as code owners August 30, 2023 20:48
@@ -0,0 +1,119 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: On the naming of this module, we're already in the karapace namespace, let's simply name the module metrics? (Resulting name would be karapace.metrics instead of repeating karapace.karapacemetrics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 68 to 69
schedule.every(10).seconds.do(self.connections)
self.worker_thread.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@libretto Could you shed some further light on this question?

Is the threading needed only for the reading connection count?

return
if not isinstance(self.stats_client, StatsClient):
raise RuntimeError("no StatsClient available")
self.stats_client.gauge("request-size", size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@libretto I think gauge simply doesn't make sense here?

our aim is to create a graph depicting request size over time

The way to do that with prometheus is to graph request_size_total / request_count, I'm unsure what makes sense with StatsD.

@aiven-anton
Copy link
Contributor

This is superseded by #711.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants