-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workqueue prometheus metrics #1266
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure yet how the queue depth will work, because I'm worried that the metrics will be scraped say every second, and the depth metric will fluctuate a lot to reflect the queue depth at any moment in time. I have a feeling we might need a high-water mark or something here, so the metric more reflects the largest depth recorded in the last 10 seconds or something like that.
That looks like it'll be in a future PR?, so approving this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Mostly docs comments/suggestions
I'm not sure yet how the queue depth will work, because I'm worried that the metrics will be scraped say every second, and the depth metric will fluctuate a lot to reflect the queue depth at any moment in time. I have a feeling we might need a high-water mark or something here, so the metric more reflects the largest depth recorded in the last 10 seconds or something like that.
That looks like it'll be in a future PR?, so approving this one.
it is common that scape intervals are much more than 1 second, like 15 seconds (for example, https://www.robustperception.io/keep-it-simple-scrape_interval-id )
I think if there is a problem with processing items, the other two metrics will also show that - work_duration_seconds
and queue_duration_seconds
will increase. Perhaps depth
metric is more useful for real-time troubleshooting? Ex. an IC pod for some reason doesn't process changes in the cluster -> admins can take a look at the metric via /metrics
on the IC pod and they will see that the queue keeps growing and growing`
At the same time, perhaps the counter for all the added elements can help? this way it will be possible to calculate the processing rate in Prometheus based on that metric and see if there was an influx of changes at some interval.
Also, right now we add to the queue changes to all endpoints in the cluster (which change frequently), even the ones the IC not interested in. The IC processes them very quickly and quickly ignores. So perhaps in the future we can also filter them out, so that those changes don't skew our queue metrics.
d56737b
to
e3dc777
Compare
@mikestephen Are you still happy for this to be merged? #1266 (comment) seems resolved to me. And sounds like you're happy to take a look at high-water mark in a later PR if determined necessary? |
5352358
to
f0b43b0
Compare
Proposed changes
Add the following prometheus metrics:
nginx_ingress_controller_workqueue_depth
nginx_ingress_controller_workqueue_queue_duration_seconds
nginx_ingress_controller_workqueue_work_duration_seconds
Checklist
Before creating a PR, run through this checklist and mark each as complete.