Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webhook_request_latencies_bucket metric keeps adding new data series and became unusable #3171

Open
r0bj opened this issue Sep 6, 2020 · 34 comments
Assignees
Labels
area/metrics Issues related to metrics kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@r0bj
Copy link

r0bj commented Sep 6, 2020

Expected Behavior

Prometheus metric webhook_request_latencies_bucket is usable in real environment, don't add new data series forever. Prometheus is able to query that metric.

Actual Behavior

Prometheus metric webhook_request_latencies_bucket creates so many data series that is practically impossible to query in prometheus (too much data). It keeps adding new series while it's running so number of series increase forever. Restart pod tekton-pipelines-webhook resets number of series and fixes issue.

Steps to Reproduce the Problem

Run tekton-pipelines-webhook.

Additional Info

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-17T11:41:22Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-17T11:33:59Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
Client version: 0.12.0
Pipeline version: v0.15.0
Triggers version: v0.7.0
@r0bj r0bj added the kind/bug Categorizes issue or PR as related to a bug. label Sep 6, 2020
@ywluogg
Copy link
Contributor

ywluogg commented Sep 10, 2020

I can take a look at it if no one else is. But it may take a longer time for me.

@ywluogg
Copy link
Contributor

ywluogg commented Sep 15, 2020

@imjasonh will this be suitable as a good first issue?

@imjasonh
Copy link
Member

/assign ywluogg

cc @NavidZ since this relates to metrics

@eddie4941
Copy link
Contributor

eddie4941 commented Sep 16, 2020

Dropping this here for context. The webhook_request_latencies_bucket (and others) metrics is heavily influenced by the labels in question here: https://github.com/knative/pkg/pull/1464/files

Removing the labels in that pull request might help reduce the number of unique webhook_request_latencies_bucket metrics the webhook has to manage.

Aside from this, I don't know if theres a way to configure the metrics code to purge metrics from the in memory store after a period of time. This would help too. Most of the time, the in memory stuff is sent to a backend like Prometheus, stack driver, etc anyway.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2020
@r0bj
Copy link
Author

r0bj commented Dec 15, 2020

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2020
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2021
@r0bj
Copy link
Author

r0bj commented May 4, 2021

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2021
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021
@r0bj
Copy link
Author

r0bj commented Oct 15, 2021

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021
@jerop
Copy link
Member

jerop commented Nov 15, 2021

@ywluogg are you still looking into this?

@vdemeester looks like this issue would be addressed by TEP-0073: Simplify metrics, right?

@ywluogg
Copy link
Contributor

ywluogg commented Nov 15, 2021

Hi jerop@ I'm not looking into this anymore. Please unassign me. Thanks!

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2022
@r0bj
Copy link
Author

r0bj commented Feb 13, 2022

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2022
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022
@dibyom dibyom removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2022
@QuanZhang-William
Copy link
Member

/assign @QuanZhang-William

@khrm
Copy link
Contributor

khrm commented Jul 17, 2023

/assign @khrm

@khrm
Copy link
Contributor

khrm commented Jul 31, 2023

@pritidesai This was fixed by knative/pkg#1464

So we can close this.

/close

@tekton-robot
Copy link
Collaborator

@khrm: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

@pritidesai This was fixed by knative/pkg#1464

So we can close this.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@syurevich
Copy link

@khrm not only resource_name label but also resource_namespace label can contribute to this "high cardinality" issue. To fix it for every use case one would need to purge metrics from the in memory store after a period of time.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2023
@syurevich
Copy link

syurevich commented Nov 8, 2023

This issue is still relevant. See this comment as well as this.

zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Jan 10, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
@zhouhaibing089
Copy link
Contributor

We have the same issue, too.

I have a proposal for knative/pkg at knative/pkg#2931.

zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Jan 18, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 6, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 6, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 10, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 10, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 31, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 31, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
zhouhaibing089 added a commit to zhouhaibing089/pkg that referenced this issue Mar 31, 2024
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: knative#1464
[2]: tektoncd/pipeline#3171
knative-prow bot pushed a commit to knative/pkg that referenced this issue Apr 1, 2024
)

* webhook: add options to disable resource_namespace tag in metrics

To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [#1464][1] for more information.

While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.

This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.

Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.

[1]: #1464
[2]: tektoncd/pipeline#3171

* webhook: add StatsReporterOptions in webhook.Options

There are two ways to customize StatsReporter:

1. Use a whole new StatsReporter implementation.
1. Or pass Option funcs to customize the default StatsReporter.

Option 1 is less practical at this time due to the metrics registration
conflict. `webhook.RegisterMetrics()` is called regardless which
StatsReporter implementation is used (which is a problem by itself). The
second option is more practical since it works well without dealing with
metrics conflicts.

The `webhook.Option` in particular allows people to discard certain
metrics tags.
@zhouhaibing089
Copy link
Contributor

knative/pkg now gives the option to exclude arbitrary tags. I assume the next action item is to bump knative/pkg and customize the webhook options.

@afrittoli
Copy link
Member

@khrm are you still working on this issue? We marked it as "blocking" for a v1 release and we would like to make a v1 release in July. If you are working on it, will you be able to solve this until then? Thank you!

@khrm
Copy link
Contributor

khrm commented Apr 30, 2024

@afrittoli This seems to be resolved. Will be fixed by knative update.

@Parsavali
Copy link

Do you know when this fix is expected to be part of Tekton?

@zhouhaibing089
Copy link
Contributor

FWIW, besides bumping knative.dev/pkg, it is also necessary to override the default StatReporter options.

@zhouhaibing089
Copy link
Contributor

zhouhaibing089 commented Jun 7, 2024

Since #7989 is in, I am sending #8033 as a followup which actually addresses this issue.

I could anticipate that there might be asks to make it as a configuration option, and maybe off by default, so if any pointers, that would be greatly appreciated.

zhouhaibing089 added a commit to zhouhaibing089/pipeline that referenced this issue Jun 21, 2024
This addresses tektoncd#3171.

As `resource_namespace` has the potential to contribute to high metrics
cardinality while not adding much value from observability perspective,
it would be ideal to disable it by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Issues related to metrics kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
Status: Todo
Status: Todo
Development

No branches or pull requests