Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook metrics can have enormous cardinality #2872

Closed
tragiclifestories opened this issue Jun 29, 2020 · 9 comments · Fixed by #3319
Closed

Webhook metrics can have enormous cardinality #2872

tragiclifestories opened this issue Jun 29, 2020 · 9 comments · Fixed by #3319
Labels
kind/question Issues or PRs that are questions around the project or a particular feature

Comments

@tragiclifestories
Copy link

The metrics exported by the tekton-pipelines-webhook service have a label for resource_name. This interacts poorly with the generateName part of the k8s metadata API, since you will get a new time series for every interaction. This will cause well known problems with Prometheus, such as drastically degraded query performance and increased storage needs.

Expected Behavior

A new time series should not be generated for every ephemeral Tekton resource created, updated etc.

Actual Behavior

A new time series is generated for every ephemeral Tekton resource created, updated etc.

Steps to Reproduce the Problem

  1. Enable prometheus metrics
  2. Create many Tekton taskruns (or whatever) with spec.metadata.generateName set.
  3. Observe the mutliplication of time series on the metrics endpoint.

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-18T02:59:13Z", GoVersion:"go1.14.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-gke.15", GitCommit:"9cabee15e0922c3b36724de4866a98f6c2da5e6a", GitTreeState:"clean", BuildDate:"2020-05-01T21:47:04Z", GoVersion:"go1.13.8b4", Compiler:"gc", Platform:"linux/amd64"}
    
    
  • Tekton Pipeline version:

    0.12.0

@imjasonh
Copy link
Member

@mattmoor AIUI webhook metrics are defined for us by knative/pkg infra. Is this something Knative has hit as well? Or is Tekton different due to its recommendation to use generateName?

@mattmoor
Copy link
Member

It certainly exacerbates it, but I think that's probably a poor choice of dimension in general since generally metrics systems frown upon the use of any dimension with unbounded cardinality. Want to raise in #observability on slack.knative.dev?

@vdemeester
Copy link
Member

/kind question

@tekton-robot tekton-robot added the kind/question Issues or PRs that are questions around the project or a particular feature label Jun 30, 2020
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 15, 2020
@vdemeester
Copy link
Member

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

@tekton-robot
Copy link
Collaborator

@vdemeester: Reopened this issue.

In response to this:

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot reopened this Aug 17, 2020
@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2020
@tragiclifestories
Copy link
Author

knative/pkg#1464 is merged now so I guess it's a matter of waiting for a new release to be cut and then bumping the dep ... Not sure if this would count as a breaking change, since it's hypothetically possible that somebody is relying on the current behaviour even though they really really shouldn't? 🤔

eddie4941 added a commit to eddie4941/pipeline that referenced this issue Oct 1, 2020
The newest knative/pkg release has released some change to how metrics
are generated. In particular, the new code stops generating a metrics
for every resource that is reconciled. This dramatically reduces the
metrics cardinality for the webhook/controller. This ultimately
reduces amount of memory required by the controller/webhook daemons and
greatly improves the usability of the /metrics endpoint as it will not
longer return a metrics for every resource that is reconciled.

fixes: tektoncd#2872
eddie4941 added a commit to eddie4941/pipeline that referenced this issue Oct 1, 2020
The newest knative/pkg release has released some change to how metrics
are generated. In particular, the new code stops generating a metrics
for every resource that is reconciled. This dramatically reduces the
metrics cardinality for the webhook. This ultimately
reduces amount of memory required by the webhook daemons and
greatly improves the usability of the /metrics endpoint as it will not
longer return a metrics for every resource that is reconciled.

fixes: tektoncd#2872
@eddie4941 eddie4941 mentioned this issue Oct 1, 2020
4 tasks
eddie4941 added a commit to eddie4941/pipeline that referenced this issue Oct 2, 2020
The newest knative/pkg release has released some change to how metrics
are generated. In particular, the new code stops generating a metrics
for every resource that is reconciled. This dramatically reduces the
metrics cardinality for the webhook. This ultimately
reduces amount of memory required by the webhook daemons and
greatly improves the usability of the /metrics endpoint as it will not
longer return a metrics for every resource that is reconciled.

fixes: tektoncd#2872
eddie4941 added a commit to eddie4941/pipeline that referenced this issue Oct 6, 2020
The newest knative/pkg release has released some change to how metrics
are generated. In particular, the new code stops generating a metrics
for every resource that is reconciled. This dramatically reduces the
metrics cardinality for the webhook. This ultimately
reduces amount of memory required by the webhook daemons and
greatly improves the usability of the /metrics endpoint as it will not
longer return a metrics for every resource that is reconciled.

fixes: tektoncd#2872
tekton-robot pushed a commit that referenced this issue Oct 6, 2020
The newest knative/pkg release has released some change to how metrics
are generated. In particular, the new code stops generating a metrics
for every resource that is reconciled. This dramatically reduces the
metrics cardinality for the webhook. This ultimately
reduces amount of memory required by the webhook daemons and
greatly improves the usability of the /metrics endpoint as it will not
longer return a metrics for every resource that is reconciled.

fixes: #2872
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Issues or PRs that are questions around the project or a particular feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants