High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

bboreham · 2018-07-12T09:20:53Z

Hi, we don't run the nginx ingress ourselves, but we do run a hosted Prometheus service, which brings us into contact.

We observe that a user with hundreds of sites and dozens of ingress controllers will have tens of thousands of ingress_controller_ssl_expire_time_seconds metrics, because all the ingress controllers report all the sites. Prometheus guidance is to avoid high-cardinality metrics.

Reading the code, the intention of this metric is to alert humans that a certificate renewal is necessary, so duplicating the same information from every controller is unnecessary.

I see various related work ongoing in issues and PRs, so posted this to see what you think.

The text was updated successfully, but these errors were encountered:

JordanP · 2018-07-12T09:29:09Z

We have a similar issue with the upstream_ip label. Our upstreams have ~10 servers each but we run ~40 Nginx nodes...

aledbf · 2018-07-12T11:17:32Z

@bboreham this is already fixed in #2726. After we merge that PR we are going to release a new version with the fix.
In particular, we changed the ingress_controller_ssl_expire_time_seconds to report only when the hostname contains an SSL certificate configured.

Please also check #2771

aledbf · 2018-07-12T11:18:33Z

We have a similar issue with the upstream_ip label

@JordanP we removed that label in #2726. This is really the source of the high-cardinality.

bboreham · 2018-07-12T11:57:08Z

Thanks; I did read through #2726 - I did not see anything to avoid reporting the same metric across all ingress controllers.

I can confirm that there exists an installation with >50,000 ingress_controller_ssl_expire_time_seconds timeseries; I don't have access to the detail to know how many of them have ssl certificates, but wouldn't that be the default in modern web-hosting?

aledbf · 2018-07-12T13:59:53Z

I did not see anything to avoid reporting the same metric across all ingress controllers.

I am sorry, I didn't address this in my comment.

You are right, this is not fixed in #2726 because this is impossible in the current deployment model. Right now we have multiple components in the same k8s deployment: the k8s controller and nginx launched from go. The metric collector is located in the go binary, collecting information about ingress, nginx process, nginx status information and go runtime. This means when you scale the deployment to let's say 5 you will see the same metrics * 5 with one difference, the controller_pod label.

I plan to refactor the way the controller works splitting the go controller and nginx in different deployments.

This provides several improvements:

removes having different configurations for some time windows when the ingress controller runs in different nodes
reduce the number of connections to the API server
(the async nature of the informers could introduce this issue if the node where the pod is running has latency issues)
only scale nginx and not the ingress controller
the nginx deployment only updates the configuration
centralize the metric gathering (this will fix the issue you mentioned)

Edit: all that said, you will see a drastic reduction of the metric count after #2726

fejta-bot · 2018-10-10T19:47:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

hairyhenderson · 2018-12-05T13:30:15Z

I'm experiencing a similar issue to this in my multi-tenant environment where each nginx-ingress is reporting many different host labels. We're actually not interested in the host label at that level, and so I'd like to simply remove it.

I understand that host is probably useful for some users though, so is it possible to configure which labels are used? I think it would be a relatively straightforward case of allowing the requestTags variable (https://github.com/kubernetes/ingress-nginx/tree/master/internal/ingress/metric/collectors/socket.go#L82..L92) to be configured externally, maybe through an annotation, or a commandline flag?

bboreham · 2018-12-21T16:21:19Z

If I understand the code correctly, you'd make a matching change to the labels passed to GetMetricWith(), e.g. here:

ingress-nginx/internal/ingress/metric/collectors/socket.go

Line 223 in 63b0137

"host": stats.Host,

bboreham · 2018-12-21T17:13:11Z

@hairyhenderson you should probably file your request as a separate issue because it isn't really the same as my original point here. I just coded up your suggestion; PR coming RSN.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2018

aledbf removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2018

bboreham mentioned this issue Dec 21, 2018

Add a flag to make per-host metrics optional #3594

Merged

aledbf mentioned this issue Mar 10, 2019

Refactor status update #3885

Merged

k8s-ci-robot closed this as completed in #3885 Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

bboreham commented Jul 12, 2018

JordanP commented Jul 12, 2018

aledbf commented Jul 12, 2018

aledbf commented Jul 12, 2018

bboreham commented Jul 12, 2018

aledbf commented Jul 12, 2018 •

edited

Loading

fejta-bot commented Oct 10, 2018

hairyhenderson commented Dec 5, 2018 •

edited

Loading

bboreham commented Dec 21, 2018

bboreham commented Dec 21, 2018

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

Comments

bboreham commented Jul 12, 2018

JordanP commented Jul 12, 2018

aledbf commented Jul 12, 2018

aledbf commented Jul 12, 2018

bboreham commented Jul 12, 2018

aledbf commented Jul 12, 2018 • edited Loading

fejta-bot commented Oct 10, 2018

hairyhenderson commented Dec 5, 2018 • edited Loading

bboreham commented Dec 21, 2018

bboreham commented Dec 21, 2018

aledbf commented Jul 12, 2018 •

edited

Loading

hairyhenderson commented Dec 5, 2018 •

edited

Loading