Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Improvements for ETCD Druid to accommodate compaction job better with the compaction dashboard. #648

Closed
4 tasks done
abdasgupta opened this issue Jul 25, 2023 · 3 comments · Fixed by gardener/gardener#8607
Assignees
Labels
kind/enhancement Enhancement, improvement, extension status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@abdasgupta
Copy link
Contributor

abdasgupta commented Jul 25, 2023

Feature (What you would like to be added):
Three short term improvements have been identified in ETCD Druid so that compaction metrics are more meaningful for the compaction dashboard.

  • Compaction Jobs follow a difficult naming convention when they are created by ETCD Druid. The current format is like - <ETCD UID>-compact-job. Due to this convention, the pods created by the job have names like <ETCD UID>-compact-job-<POD UID>. Due to this the regex expression, which is required to scrape POD resource usage, needs to include * infront and afterwards of the string compact-job. The regex expression takes the form of *-compact-job-". If we can change the naming convention of compaction job while creating by druid to compact-job-<ETCD UID, the regex expression would need only the * at the end of the string compact-job.

  • Currently, we update the metrics metricsJobCurrent only when we create a new compaction job in druid and delete the earlier successful/failed compaction job. So, the dashboard project not so accurate completion time for a compaction job. See, we create a new job only when the difference between delta snapshot revision and full snapshot revision cross a certain threshold, configured by end user. It is the same time when we update metricsJobCurrent if the last job is completed. To accurately capture metricsJobCurrent, we should actually update it everytime when we check the difference between delta snapshot revision and full snapshot revision. Though we can't be exactly accurate with metricsJobCurrent as we are not actively monitoring compaction job. [BUG] Make the metric metricsjobcurrent capture accurate job end time #685

  • Currently, metricsJobDuration histogram show average duration of compaction jobs. But the range for average duration of the jobs are captured wrong in the graph of the dashboard. If we follow the attached image, we would know that we don't get a clear idea of the average job duration as all of them falling from 10s to +Inf boundary. The image is also showing unnecessary breakups for milliseconds where no job duration is posted. This is happening because we have not set boundary for the metricsJobDuration histogram in ETCD druid. So, Set the expected bucket boundaries for the histogram metricJobDurationSeconds so that the compaction job dashboard show proper and meaningful breakup of the duration of the compaction jobs in the Left Y axis of the graph.
    image

  • Run sleep command for 60 seconds after the compaction job finishes uploading the backup so that even when the upload is finished the fastest, Prometheus gets enough time to capture all network activities.

Motivation (Why is this needed?):
To present the compaction job dashboard in more meaningful way to the operators.

Approach/Hint to the implement solution (optional):
Make the necessary changes in code for compaction job creation in ETCD Druid.

@abdasgupta abdasgupta added the kind/enhancement Enhancement, improvement, extension label Jul 25, 2023
@abdasgupta
Copy link
Contributor Author

cc @istvanballok

@shreyas-s-rao
Copy link
Contributor

shreyas-s-rao commented Oct 10, 2023

@abdasgupta I see that the second task in the issue To accurately capture metricsJobCurrent, we should actually update it everytime when we check the difference between delta snapshot revision and full snapshot revision is still not complete. Can we please keep this issue open till then, for the sake of tracking and not losing it?

@abdasgupta abdasgupta reopened this Oct 10, 2023
@shreyas-s-rao
Copy link
Contributor

Resolved as #685 is done.
/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 19, 2023
@shreyas-s-rao shreyas-s-rao added this to the v0.21.0 milestone Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants