Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tekton metrics are reporting wrong values #2844

Closed
RafaeLeal opened this issue Jun 22, 2020 · 13 comments
Closed

Tekton metrics are reporting wrong values #2844

RafaeLeal opened this issue Jun 22, 2020 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@RafaeLeal
Copy link
Contributor

Expected Behavior

Tekton metrics coherent with data gather with kubectl

Actual Behavior

I have two examples, for the tekton_pipelinerun_count metric, it's currently reporting 22 failed and 58 success, but the cluster only have 20 PipelineRuns
image

$ kubectl get pipelinerun --all-namespaces -o json | jq '.items | length'
20

The second example, tekton_pipelinerun_duration_seconds_sum metric seems to increase indefinitely. I expected to raise until the pipelinerun is completed, than stabilize in the same value as the duration showed with tkn pr describe <pipelinerun_name>
image

I also noticed that

  1. It doesn’t raise at the same rate.
  2. Always raises a multiple of the actual duration of the pipeline.

I suspect that every time the controller reconciles it’s adding the duration of the pipeline.

Steps to Reproduce the Problem

  1. Enable tekton metrics
  2. Make Prometheus scrape tekton
  3. Run some pipelineruns

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    $ kubectl version
    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-26T06:17:09Z", GoVersion:"go1.14", Compiler:"gc",Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
    
  • Tekton Pipeline version: v0.13.2

    Output of tkn version

    $ tkn version
    Client version: 0.8.0
    Pipeline version: unknown
    

    Output of kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

    $ kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
    v0.13.2
    
@vincent-pli
Copy link
Member

The tekton_pipelinerun_count is incorrect, that's caused by reconcile enter this block more than once:

@vincent-pli
Copy link
Member

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2020
@RafaeLeal
Copy link
Contributor Author

RafaeLeal commented Jun 23, 2020

Do you think that could be the case for the duration metrics as well?
Since this block is running more than once and it calls metrics.DurationAndCount

err := metrics.DurationAndCount(pr)

@vincent-pli
Copy link
Member

@RafaeLeal
I think so, you could apply my fix and take a try.

@pritidesai
Copy link
Member

@vincent-pli @RafaeLeal we have similar issue reported for tekton_taskrun_count, see #3029

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 26, 2020
@pritidesai
Copy link
Member

/remove-lifecycle rotten
@RafaeLeal you still experiencing this issue?

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 30, 2020
@ibexmonj
Copy link

ibexmonj commented Feb 3, 2021

hey folks, I am curious if what i reported here #3739 about some of the pipeline metrics is relevant to this discussion.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 3, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants