[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648

abdasgupta · 2023-07-25T17:16:48Z

Feature (What you would like to be added):
Three short term improvements have been identified in ETCD Druid so that compaction metrics are more meaningful for the compaction dashboard.

Compaction Jobs follow a difficult naming convention when they are created by ETCD Druid. The current format is like - <ETCD UID>-compact-job. Due to this convention, the pods created by the job have names like <ETCD UID>-compact-job-<POD UID>. Due to this the regex expression, which is required to scrape POD resource usage, needs to include * infront and afterwards of the string compact-job. The regex expression takes the form of *-compact-job-". If we can change the naming convention of compaction job while creating by druid to compact-job-<ETCD UID, the regex expression would need only the * at the end of the string compact-job.
Currently, we update the metrics metricsJobCurrent only when we create a new compaction job in druid and delete the earlier successful/failed compaction job. So, the dashboard project not so accurate completion time for a compaction job. See, we create a new job only when the difference between delta snapshot revision and full snapshot revision cross a certain threshold, configured by end user. It is the same time when we update metricsJobCurrent if the last job is completed. To accurately capture metricsJobCurrent, we should actually update it everytime when we check the difference between delta snapshot revision and full snapshot revision. Though we can't be exactly accurate with metricsJobCurrent as we are not actively monitoring compaction job. [BUG] Make the metric metricsjobcurrent capture accurate job end time #685
Currently, metricsJobDuration histogram show average duration of compaction jobs. But the range for average duration of the jobs are captured wrong in the graph of the dashboard. If we follow the attached image, we would know that we don't get a clear idea of the average job duration as all of them falling from 10s to +Inf boundary. The image is also showing unnecessary breakups for milliseconds where no job duration is posted. This is happening because we have not set boundary for the metricsJobDuration histogram in ETCD druid. So, Set the expected bucket boundaries for the histogram metricJobDurationSeconds so that the compaction job dashboard show proper and meaningful breakup of the duration of the compaction jobs in the Left Y axis of the graph.
Run sleep command for 60 seconds after the compaction job finishes uploading the backup so that even when the upload is finished the fastest, Prometheus gets enough time to capture all network activities.

Motivation (Why is this needed?):
To present the compaction job dashboard in more meaningful way to the operators.

Approach/Hint to the implement solution (optional):
Make the necessary changes in code for compaction job creation in ETCD Druid.

The text was updated successfully, but these errors were encountered:

abdasgupta · 2023-07-25T17:17:14Z

cc @istvanballok

shreyas-s-rao · 2023-10-10T09:47:12Z

@abdasgupta I see that the second task in the issue To accurately capture metricsJobCurrent, we should actually update it everytime when we check the difference between delta snapshot revision and full snapshot revision is still not complete. Can we please keep this issue open till then, for the sake of tracking and not losing it?

shreyas-s-rao · 2023-11-19T07:50:32Z

Resolved as #685 is done.
/close

abdasgupta added the kind/enhancement Enhancement, improvement, extension label Jul 25, 2023

abdasgupta mentioned this issue Jul 27, 2023

[Feature] ☂️ Monitor compaction jobs running on shoot control planes #610

Open

9 tasks

shreyas-s-rao assigned abdasgupta and shreyas-s-rao Aug 7, 2023

shreyas-s-rao removed their assignment Aug 16, 2023

This was referenced Aug 21, 2023

Use better naming for compaction jobs #672

Merged

Adds sleep command at the end of compaction. gardener/etcd-backup-restore#660

Merged

This was referenced Sep 29, 2023

Add a sleep duration of 60 second through a flag to compaction job. #686

Merged

Adds the configuration for MetricsScrapeWaitDuration of Druid. gardener/gardener#8607

Merged

gardener-prow bot closed this as completed in gardener/gardener#8607 Oct 10, 2023

abdasgupta reopened this Oct 10, 2023

gardener-robot closed this as completed Nov 19, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 19, 2023

shreyas-s-rao added this to the v0.21.0 milestone Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648

[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648

abdasgupta commented Jul 25, 2023 •

edited by shreyas-s-rao

Loading

abdasgupta commented Jul 25, 2023

shreyas-s-rao commented Oct 10, 2023 •

edited

Loading

shreyas-s-rao commented Nov 19, 2023

[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648

[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648

Comments

abdasgupta commented Jul 25, 2023 • edited by shreyas-s-rao Loading

abdasgupta commented Jul 25, 2023

shreyas-s-rao commented Oct 10, 2023 • edited Loading

shreyas-s-rao commented Nov 19, 2023

abdasgupta commented Jul 25, 2023 •

edited by shreyas-s-rao

Loading

shreyas-s-rao commented Oct 10, 2023 •

edited

Loading