[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648
Labels
kind/enhancement
Enhancement, improvement, extension
status/closed
Issue is closed (either delivered or triaged)
Milestone
Feature (What you would like to be added):
Three short term improvements have been identified in ETCD Druid so that compaction metrics are more meaningful for the compaction dashboard.
Compaction Jobs follow a difficult naming convention when they are created by ETCD Druid. The current format is like -
<ETCD UID>-compact-job
. Due to this convention, the pods created by the job have names like<ETCD UID>-compact-job-<POD UID>
. Due to this the regex expression, which is required to scrape POD resource usage, needs to include*
infront and afterwards of the stringcompact-job
. The regex expression takes the form of*-compact-job-"
. If we can change the naming convention of compaction job while creating by druid tocompact-job-<ETCD UID
, the regex expression would need only the*
at the end of the stringcompact-job
.Currently, we update the metrics
metricsJobCurrent
only when we create a new compaction job in druid and delete the earlier successful/failed compaction job. So, the dashboard project not so accurate completion time for a compaction job. See, we create a new job only when the difference between delta snapshot revision and full snapshot revision cross a certain threshold, configured by end user. It is the same time when we updatemetricsJobCurrent
if the last job is completed. To accurately capturemetricsJobCurrent
, we should actually update it everytime when we check the difference between delta snapshot revision and full snapshot revision. Though we can't be exactly accurate withmetricsJobCurrent
as we are not actively monitoring compaction job. [BUG] Make the metric metricsjobcurrent capture accurate job end time #685Currently,

metricsJobDuration
histogram show average duration of compaction jobs. But the range for average duration of the jobs are captured wrong in the graph of the dashboard. If we follow the attached image, we would know that we don't get a clear idea of the average job duration as all of them falling from 10s to +Inf boundary. The image is also showing unnecessary breakups for milliseconds where no job duration is posted. This is happening because we have not set boundary for themetricsJobDuration
histogram in ETCD druid. So, Set the expected bucket boundaries for the histogrammetricJobDurationSeconds
so that the compaction job dashboard show proper and meaningful breakup of the duration of the compaction jobs in the Left Y axis of the graph.Run
sleep
command for 60 seconds after the compaction job finishes uploading the backup so that even when the upload is finished the fastest, Prometheus gets enough time to capture all network activities.Motivation (Why is this needed?):
To present the compaction job dashboard in more meaningful way to the operators.
Approach/Hint to the implement solution (optional):
Make the necessary changes in code for compaction job creation in ETCD Druid.
The text was updated successfully, but these errors were encountered: