Successful Workflow metrics -- misconfigured dashboard #13616
-
I encountered a phenomenon that doesn't make sense to me. Yesterday I added a new workflow. That workflow is very short-lived (~30s) and is being executed 4 times every minute. For about 12h everything seems to work fine. Then I looked at this: This is the number of successful workflows. The lines on top (at ~60) are the new workflows I added. At some point this number is slowly going down. Implying that at least some of the workflow runs didn't happen. However, there are no errors. This can only happen in 1 of 2 scenarios:
If I look at the data, everything is there, and complete. Which means the workflows were running. So it can only be a problem with the metrics, giving me false readings. But why out of the sudden? A problem with the monitoring server? Unlikely. Because, that would show for all workflows, not just the new ones. Also, I am seeing this in the logs of the workflows controller, a lot:
Some of those warning are related to the new workflow, some for existing ones (not effected by this phenomenon). So I'm not even sure if this is related or not. There doesn't seem to be any errors in the logs.
These are still fairly small numbers. Plus I'm not even done yet, roughly half way through. So the number of workflows is going to double in the end. So I was asking myself:
I'm really not sure what to make of this. Until I can figure that out I don't really want to put more workload into that system. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
You don't tell us what metric you are collecting. I assume you're collecting argo_workflows_count for This changes in 3.6 with argo_workflows_total_count which is a true counter and should do what you want. |
Beta Was this translation helpful? Give feedback.
I might be ready for a vacation ...
Sorry that I bothered you with this ticket. After some digging (looking at the data, the messages, the logs, the raw metrics data, ... everything) ... in the end it came down to a badly picked range function for metrics visualization in Grafana. When we use
increase(...)
instead ofchanges(...)
everything looks exactly as it is supposed to and the metrics visualization now fits to what's really going on.