-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: More performance metrics #4823
Conversation
Signed-off-by: Simon Behar <simbeh7@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did something similar to this recently, but I took a slightly different approach. I created a new queue:
metrics.NewWorkQueueWithMetrics(workqueue.NewNamedRateLimitingQueue(&fixedItemIntervalRateLimiter{}, "workflow_queue"), "workflow_queue")
I don't think there's much in it mind you. Just a thought.
@simster7 I've got a few metrics we might want to do at some point:
I did PoC for these, but 3 and 4 are a bit complex and I don't know if we need both. |
Sure, I'll look into this |
Signed-off-by: Simon Behar <simbeh7@gmail.com>
Signed-off-by: Simon Behar <simbeh7@gmail.com>
K8S API Metrics: not sure it should be our responsibility to emit K8s metrics. Shouldn't K8s have an endpoint for these already? K8s emits metrics already: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/ Count of workflow pods by phase (similar impl to count of workflows): currently have only 1 metric for active pods (i.e. phase==Running). I had actually first set the metric for all phases, but that would require changing our Pod informer from only looking at not completed pods to looking at all pods. If we think this is worth it, I can work to ensure that changing the nature of the informer doesn't have any intended consequences. Count of workflows that actually have running pods (by adding a condition): this seems to be tied more to the refactored mentioned in #4824. This should perhaps be changed with that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some tests are probably in order for this PR please.
workflow/metrics/metrics.go
Outdated
@@ -205,6 +213,31 @@ func (m *Metrics) CronWorkflowSubmissionError() { | |||
m.errors[ErrorCauseCronWorkflowSubmissionError].Inc() | |||
} | |||
|
|||
func (m *Metrics) WorkerBusy(workerType string) { | |||
m.mutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these mutex may create contention in the VM - doesn't the gauge already have these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not too sure what you mean. These mutexes are needed to keep concurrency issues between writing/updating metrics and metrics being scraped. While individual metrics may be concurrent safe, the collection of metrics and the metrics scraper are not without these locks. They are present in all other functions in this file.
Can you take a look at https://github.com/argoproj/argo/pull/4811/files#diff-2721f0228996fd51c5f9a8db168a353b398d051154c82cebb03f03ddd1ee0574? I think it gives a way to do the workers busy metric very safely. |
Interesting. @jessesuen are you aware of this?
I think pods that are pending are useful. I think you can do all in-complete pods.
This is harder and I'm not 100% sure this is a valuable metric. Lets hold on it. |
Signed-off-by: Simon Behar <simbeh7@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments - maybe when you merge give a more specific commit message than the PR currently has?
return map[v1.PodPhase]prometheus.Gauge{ | ||
v1.PodPending: prometheus.NewGauge(getOptsByPhase(v1.PodPending)), | ||
v1.PodRunning: prometheus.NewGauge(getOptsByPhase(v1.PodRunning)), | ||
//v1.PodSucceeded: prometheus.NewGauge(getOptsByPhase(v1.PodSucceeded)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete comments?
Closes: #4800
Signed-off-by: Simon Behar simbeh7@gmail.com
Checklist: