Add workqueue prometheus metrics #1266

Dean-Coakley · 2020-12-04T23:30:58Z

Proposed changes

Add the following prometheus metrics:

nginx_ingress_controller_workqueue_depth
nginx_ingress_controller_workqueue_queue_duration_seconds
nginx_ingress_controller_workqueue_work_duration_seconds

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto master
I will ensure my PR is targeting the master branch and pulling from my branch from my own fork

internal/k8s/task_queue.go

internal/metrics/collectors/workqueue.go

mikestephen

I'm not sure yet how the queue depth will work, because I'm worried that the metrics will be scraped say every second, and the depth metric will fluctuate a lot to reflect the queue depth at any moment in time. I have a feeling we might need a high-water mark or something here, so the metric more reflects the largest depth recorded in the last 10 seconds or something like that.
That looks like it'll be in a future PR?, so approving this one.

pleshakov

Hi @Dean-Coakley

Looks good! Mostly docs comments/suggestions

@mikestephen

I'm not sure yet how the queue depth will work, because I'm worried that the metrics will be scraped say every second, and the depth metric will fluctuate a lot to reflect the queue depth at any moment in time. I have a feeling we might need a high-water mark or something here, so the metric more reflects the largest depth recorded in the last 10 seconds or something like that.
That looks like it'll be in a future PR?, so approving this one.

it is common that scape intervals are much more than 1 second, like 15 seconds (for example, https://www.robustperception.io/keep-it-simple-scrape_interval-id )

I think if there is a problem with processing items, the other two metrics will also show that - work_duration_seconds and queue_duration_seconds will increase. Perhaps depth metric is more useful for real-time troubleshooting? Ex. an IC pod for some reason doesn't process changes in the cluster -> admins can take a look at the metric via /metrics on the IC pod and they will see that the queue keeps growing and growing`

At the same time, perhaps the counter for all the added elements can help? this way it will be possible to calculate the processing rate in Prometheus based on that metric and see if there was an influx of changes at some interval.

Also, right now we add to the queue changes to all endpoints in the cluster (which change frequently), even the ones the IC not interested in. The IC processes them very quickly and quickly ignores. So perhaps in the future we can also filter them out, so that those changes don't skew our queue metrics.

internal/metrics/collectors/processes.go

internal/metrics/collectors/workqueue.go

docs-web/logging-and-monitoring/prometheus.md

Dean-Coakley · 2020-12-08T19:52:48Z

@mikestephen Are you still happy for this to be merged?

#1266 (comment) seems resolved to me.

And sounds like you're happy to take a look at high-water mark in a later PR if determined necessary?

Dean-Coakley requested review from lucacome and pleshakov December 4, 2020 23:30

Dean-Coakley self-assigned this Dec 4, 2020

Dean-Coakley added the enhancement Pull requests for new features/feature enhancements label Dec 4, 2020

mikestephen reviewed Dec 7, 2020

View reviewed changes

internal/k8s/task_queue.go Show resolved Hide resolved

mikestephen reviewed Dec 7, 2020

View reviewed changes

internal/metrics/collectors/workqueue.go Outdated Show resolved Hide resolved

mikestephen reviewed Dec 7, 2020

View reviewed changes

internal/metrics/collectors/workqueue.go Show resolved Hide resolved

mikestephen approved these changes Dec 7, 2020

View reviewed changes

pleshakov reviewed Dec 7, 2020

View reviewed changes

Dean-Coakley force-pushed the add-queue-metrics branch from d56737b to e3dc777 Compare December 8, 2020 15:41

pleshakov self-requested a review December 8, 2020 18:26

pleshakov approved these changes Dec 8, 2020

View reviewed changes

Dean-Coakley added 4 commits December 9, 2020 14:04

Add workqueue prometheus metrics

2643ad7

Lint code. Improve docs.

ed5a613

Use more suitable latency bucket sizes

46a9877

Remove global variables

f0b43b0

Dean-Coakley force-pushed the add-queue-metrics branch from 5352358 to f0b43b0 Compare December 9, 2020 14:04

Dean-Coakley merged commit 36d3323 into master Dec 9, 2020

Dean-Coakley deleted the add-queue-metrics branch December 9, 2020 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workqueue prometheus metrics #1266

Add workqueue prometheus metrics #1266

Dean-Coakley commented Dec 4, 2020

mikestephen left a comment

pleshakov left a comment

Dean-Coakley commented Dec 8, 2020

Add workqueue prometheus metrics #1266

Add workqueue prometheus metrics #1266

Conversation

Dean-Coakley commented Dec 4, 2020

Proposed changes

Checklist

mikestephen left a comment

Choose a reason for hiding this comment

pleshakov left a comment

Choose a reason for hiding this comment

Dean-Coakley commented Dec 8, 2020