Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

Closed
bboreham opened this issue Jul 12, 2018 · 9 comments · Fixed by #3885
Closed

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

bboreham opened this issue Jul 12, 2018 · 9 comments · Fixed by #3885

Comments

@bboreham
Copy link
Contributor

Hi, we don't run the nginx ingress ourselves, but we do run a hosted Prometheus service, which brings us into contact.

We observe that a user with hundreds of sites and dozens of ingress controllers will have tens of thousands of ingress_controller_ssl_expire_time_seconds metrics, because all the ingress controllers report all the sites. Prometheus guidance is to avoid high-cardinality metrics.

Reading the code, the intention of this metric is to alert humans that a certificate renewal is necessary, so duplicating the same information from every controller is unnecessary.

I see various related work ongoing in issues and PRs, so posted this to see what you think.

@JordanP
Copy link
Contributor

JordanP commented Jul 12, 2018

We have a similar issue with the upstream_ip label. Our upstreams have ~10 servers each but we run ~40 Nginx nodes...

@aledbf
Copy link
Member

aledbf commented Jul 12, 2018

@bboreham this is already fixed in #2726. After we merge that PR we are going to release a new version with the fix.
In particular, we changed the ingress_controller_ssl_expire_time_seconds to report only when the hostname contains an SSL certificate configured.

Please also check #2771

@aledbf
Copy link
Member

aledbf commented Jul 12, 2018

We have a similar issue with the upstream_ip label

@JordanP we removed that label in #2726. This is really the source of the high-cardinality.

@bboreham
Copy link
Contributor Author

Thanks; I did read through #2726 - I did not see anything to avoid reporting the same metric across all ingress controllers.

I can confirm that there exists an installation with >50,000 ingress_controller_ssl_expire_time_seconds timeseries; I don't have access to the detail to know how many of them have ssl certificates, but wouldn't that be the default in modern web-hosting?

@aledbf
Copy link
Member

aledbf commented Jul 12, 2018

I did not see anything to avoid reporting the same metric across all ingress controllers.

I am sorry, I didn't address this in my comment.

You are right, this is not fixed in #2726 because this is impossible in the current deployment model. Right now we have multiple components in the same k8s deployment: the k8s controller and nginx launched from go. The metric collector is located in the go binary, collecting information about ingress, nginx process, nginx status information and go runtime. This means when you scale the deployment to let's say 5 you will see the same metrics * 5 with one difference, the controller_pod label.

I plan to refactor the way the controller works splitting the go controller and nginx in different deployments.

This provides several improvements:

  • removes having different configurations for some time windows when the ingress controller runs in different nodes
  • reduce the number of connections to the API server
    (the async nature of the informers could introduce this issue if the node where the pod is running has latency issues)
  • only scale nginx and not the ingress controller
  • the nginx deployment only updates the configuration
  • centralize the metric gathering (this will fix the issue you mentioned)

Edit: all that said, you will see a drastic reduction of the metric count after #2726

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2018
@aledbf aledbf removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2018
@hairyhenderson
Copy link

hairyhenderson commented Dec 5, 2018

I'm experiencing a similar issue to this in my multi-tenant environment where each nginx-ingress is reporting many different host labels. We're actually not interested in the host label at that level, and so I'd like to simply remove it.

I understand that host is probably useful for some users though, so is it possible to configure which labels are used? I think it would be a relatively straightforward case of allowing the requestTags variable (https://github.com/kubernetes/ingress-nginx/tree/master/internal/ingress/metric/collectors/socket.go#L82..L92) to be configured externally, maybe through an annotation, or a commandline flag?

@bboreham
Copy link
Contributor Author

If I understand the code correctly, you'd make a matching change to the labels passed to GetMetricWith(), e.g. here:

@bboreham
Copy link
Contributor Author

@hairyhenderson you should probably file your request as a separate issue because it isn't really the same as my original point here. I just coded up your suggestion; PR coming RSN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants