[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

abdasgupta · 2023-06-05T07:45:34Z

Feature (What you would like to be added):
As Druid runs in the namespace different than the shoot control plane but the compaction jobs triggered by it runs in the shoot control plane, it's not straightforward to collect the metrics of compaction jobs and create the dashboard out of it. There are a number of prometheus involved in the process that should collect and forward them to others. The compaction metrics are needed to be channelized in such a way so that it ultimately reaches to prometheus running in shoot control plane. Only then the metrics would be ready for consumption by Dashboards running in shoot control planes.

As Druid is running in Garden namespace, Cache prometheus will be able to collect the Druid controller metrics i.e. compaction metrics. Then, control plane prometheus can fedarate those metrics along with cadvisor metrics for Compaction job. We can use these scraped metrics from control plane prometheus and filter out the shoot specific compaction job metrics to show the dashboard for a particular shoot

To further enhance the visualization of compaction metrics, we can also create a dashboard in seed. The dashboard may show aggregated compaction job performance.

In my first comment, I attached an image shared by @istvanballok and @rickardsjp to better understand the flow.

Motivation (Why is this needed?):
We have druids that triggers compaction jobs after a certain threshold of delta events are crossed in control plane ETCD. Compaction jobs compacts the delta events that accumulated in object storage and create full snapshots out of it. But the jobs may be heavy at certain times. and we need proper monitoring for the jobs running in each shoot control planes.
Approach/Hint to the implement solution (optional):

Collect the metrics for compaction in Druid Added support for compaction metrics in druid. #569
Expose the metrics through Druid deployment in gardener #8014
Let Cache Prometheus scrape compaction metrics [Feature] Scrape compaction metrics available from druid controller by Cache Prometheus #622
Let Control plane prometheus federate cache prometheus for compaction metrics [Feature] Federate compaction metrics to Control Plane prometheus from Cache Prometheus #626
Create grafana dashboard for better visualizing the Compaction metrics [Feature] Dashboard for ETCD Compaction job #504
Short term improvements for ETCD Druid to make dashboard useful for operators [Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648
Fix a bug for compaction job not working with local storage [BUG] Compaction job does not work with local storage #709
Create alerts based on some of the compaction metrics [Feature] Alerts for the compaction job metrics #603
Enhancement: Dashboard for aggregated compaction jobs running in a seed based on Cache metrics.

abdasgupta · 2023-06-05T07:50:05Z

abdasgupta added the kind/enhancement Enhancement, improvement, extension label Jun 5, 2023

shreyas-s-rao assigned abdasgupta Jun 5, 2023

shreyas-s-rao mentioned this issue Oct 17, 2023

Multi-Node/Clustered ETCD #107

Closed

34 tasks

ishan16696 assigned renormalize and unassigned abdasgupta Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

abdasgupta commented Jun 5, 2023 •

edited by renormalize

Loading

abdasgupta commented Jun 5, 2023

[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

Comments

abdasgupta commented Jun 5, 2023 • edited by renormalize Loading

abdasgupta commented Jun 5, 2023

abdasgupta commented Jun 5, 2023 •

edited by renormalize

Loading