-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add and expose metrics #84
Comments
As announced on Slack, we (i.e., me and/or some of my team colleagues) would be keen on adding some initial support for Prometheus-based metrics collection. Is there anything in particular that should be included from the project maintainer's / Ark community's view? The baseline for us would probably be somewhere along the lines of collecting info about
That would allow us to set up alerts like "notify me when the last successful backup is older than x time units" and "notify me when less than x objects were considered". Implementation-wise, we usually just include the official Go Prometheus client library. So far, it has served us well. Happy to discuss design considerations and implementation details up front. |
@timoreimann I think all of those are good ideas. I'm also interested in things like
+1 to the official client lib. We're using I think it would be a good idea to see a rough sketch of how you envision integrating metrics gathering into the code base. Thanks! |
Appreciate the prompt feedback, @ncdc. I'll try to dedicate some time next week to dig into the relevant code sections and come up with a metrics master plan to post here. :-) |
I have finally found some time to look into the code and give some thought to how we could integrate Prometheus support into Ark. The first thing I believe we need to decide on is how we should scope the Prometheus library usage. Basically, there are two general approaches you can observe in the wild:
So I'd say that option 1. makes usage easier accessible while option 2. allows for a more scoped (better?) design. I see that various functions in Ark already take a fairly large number of dependencies, so it looks like the project would be willing to add another one. Let me know your thoughts on this front. The other question is how we'd design and structure the concrete metrics. My personal experience on building metrics is that trying to design the very details up front only has so much value; more often than not, I had to make adjustments as I integrated the metrics into some kind of graph or dashboard. It'd only be then when I realized that a design decision made earlier doesn't necessarily make a lot of sense anymore once used in a real-world scenario for the first time. Therefore, I'd suggest that once we make a decision on the scoping question, I'll go ahead and implement a basic prototype to see if it fits the needs. This would form the basis for a PR I would then submit, and we could use that to discuss metrics and other implementation details afterwards. What do you think? Does that make sense to you? |
@timoreimann this makes sense, thanks! FYI I did stumble across this branch where someone was already doing a POC: master...chadcatlett:stab_at_metrics. Probably worth getting @chadcatlett involved in the discussion 😄 If we do our own registry, would it be a singleton, or would we ever want or need to have multiple? Another thing to consider is that we're using gRPC based plugins. We'll need to figure out how to get any metrics generated by plugin instances - either by somehow getting their data into the ark server process, or by sending them to a metrics gateway or to a prometheus server directly. Something to think about... |
@ncdc I don't think we'll need more one than registry, so it should be a singleton basically. The design options I outlined above should only affect how (easily) we can test things IMHO. gRPC-based plugins are an interesting point. I'm not too familiar with the plugins architecture in Ark. If the binaries are long-running just like the Ark server, then I'd think that having the server component pull in plugin metrics via a dedicated gRPC function on each Prometheus scrape interval could be an option. OTOH, if the binaries are invoked at certain times only and terminate again while the Ark server keeps on running, letting them submit their metrics through the Prometheus pushgateway makes more sense. AFAIK, the pushgateway doesn't come with a retention period that cleans up metrics automatically. So when in doubt, the regular pull mechanism seems preferable. @chadcatlett: I see you have already produced some code. Is there already a PR somewhere? Happy to collaborate / coordinate to get the code into Ark. :-) |
Currently, some plugins are long-lived, while others are very short. Our plan, however, is to make all plugins short-lived, as it should make it easier for us to handle how we manage plugins. My ideal setup would be to have the Ark server pull metrics from a plugin just before it terminates the plugin. I'm not sure if there's an easy way to do this with Prometheus. I stumbled across which uses a custom I'd prefer to avoid requiring setting up an external component such as the pushgateway as the sole means of gathering plugin metrics. It just doesn't feel right to me. |
Pulling metrics just before the plugin terminates sounds like a viable option if we are able to control the plugin lifecycle. |
Yes, we're 100% in control (assuming the plugin doesn't crash). |
We just came back to this today too and would be interested in it's progress and potentially some hands on keyboard implementation in my ever abundant free time :) For what it's worth, I would trend towards simplicity first on this to determine what/how we want to use it much to a few of your observations @timoreimann. Would it be viable to put in the Global registry without plugin consideration to start? I think this would give us the biggest bang for our buck. To your point, I could see a more generalized I haven't researched closely enough the plugin model, but I could see a somewhat simple first pass at those metrics simply being plugin level RED metrics to start. This would give us visibility into the duration of the Plugin execution broken out by the plugin name to start. If it was determined we need more granular trace like information for the internals of the plugin then we could create a "proper interface" for the plugins that could allow the exposing of that data. I have some similar concerns about the PushGateway integration for this due to the complexity of integration and implementation that may involve. Would be happy to chat about this with anyone who is passionate about seeing it happen! |
@jrnt30 as long as we have a high-level vision of how we plan to integrate plugin process metrics into the overall metrics registry, I would be fine starting with just the Ark server. But I would hesitate to use the global registry if we can identify up front that it's impossible to integrate metrics from external sources into it easily. |
As a user I'd love a topline metric for each schedule that just counts backup successes and failures. As I just discovered our backups have been failing for weeks by stumbling upon it. |
I'm not familiar enough with the Ark plugin model nor the prometheus client to weigh in on those topics. That said, I think it's important to differentiate between the consumers of Ark metrics. @ncdc - as an Ark developer, you likely care about visibility and performance metrics of Ark internals based on the environment it's running in (e.g. how many objects to backup, and how long does it take) A plugin developer likely will want to have their own metrics as well that are focused on their development process. These metrics need to be namespaced so that two plugins don't collide on metrics. (namespacing could be via different metric name, or via Labeling) Ark users, however, probably care more about RED metrics (Rates - how many backups have been taken, Errors - how many backups failed, Distribution - how long did it take for backups to complete). @SleepyBrett also makes an excellent point that we may want to have metrics labeled by Backup name and/or Scheduled name. Quick pass at metrics & types, feedback welcome
|
Thumbs up for the RED metrics. |
RED metrics are something I'm looking for as well, I want to know if backups are failing most importantly, but all the other things mentioned would be a plus. |
+1 |
Hoping that PR 531 will channel the discussion here and get us the metrics that we want and add more metrics in the future. Keeping the approach simple and as non-intrusive as possible. Feel free to leave feedback on the PR. I will continue to add the RED metrics that seem to be most useful for cluster operators. |
@ncdc You can assign this issue to me as I am working on the PR anyway. I am unable to do it myself. |
Assigned! |
Added another metric to my list:
|
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (#84) (cherry picked from commit aa2b019) Update PR-BZ automation (#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
(cherry picked from commit ccb545f) Update PR-BZ automation mapping (vmware-tanzu#84) (cherry picked from commit aa2b019) Update PR-BZ automation (vmware-tanzu#92) Co-authored-by: Rayford Johnson <rjohnson@redhat.com> (cherry picked from commit ecc563f) Add publish workflow (vmware-tanzu#108) (cherry picked from commit f87b779)
Add various metrics using Prometheus.
The text was updated successfully, but these errors were encountered: