Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tekton Taskrun Metrics are not accurate #3739

Closed
ibexmonj opened this issue Feb 1, 2021 · 15 comments
Closed

Tekton Taskrun Metrics are not accurate #3739

ibexmonj opened this issue Feb 1, 2021 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ibexmonj
Copy link

ibexmonj commented Feb 1, 2021

Expected Behavior

Tekton metrics should match the data provided by kubectl.

Actual Behavior

I have been trying to gather the taskrun duration using tekton_taskrun_duration_seconds_sum and tekton_taskrun_duration_seconds_countby doing

sum(rate(tekton_taskrun_duration_seconds_sum{cluster_name="dev",namespace="tekton",taskrun="task-run-tekton5g444"}[1h])) / sum(rate(tekton_taskrun_duration_seconds_count{cluster_name="dev",namespace="tekton",taskrun="task-run-tekton5g444"}[1h]))

I am seeing the value being reported twice in prometheus. Sometimes the value is reported 2-3 times resulting in duplicate values for taskrun duration seconds.

image

Here is another example where the duration_seconds value is being reported multiple times. As if the controller is resetting and then incrementing the value again ?

The query used here is sum (rate(tekton_taskrun_duration_seconds_sum{namespace="tekton",cluster_name="dev",taskrun="task-run-tektonhfvhq"}[5m]))

image

Another issue is the timestamp being reported.

Here is the taskspec

`
status:
completionTime: "2021-02-01T15:00:32Z"
conditions:

  • lastTransitionTime: "2021-02-01T15:00:32Z"
    message: All Steps have completed executing
    reason: Succeeded
    status: "True"
    type: Succeeded
    podName: task-run-tekton5g444-pod-cwbvx
    startTime: "2021-02-01T15:00:27Z"
    `

As per above the job ran at 15:00 GMT but based on the prometheus screenshot above the timestamp being reported is 16:36 which does not quiet add up to the time being reported in kubectl get taskrun taskrun_name -o yaml

Steps to Reproduce the Problem

  1. Enable Tekton metrics.
  2. Enable prom scrape.
  3. Use the above sum/count method to track taskrun duration.

Additional Info

  • Kubernetes version:

    *Output of kubectl version:

$ kubectl version Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:13:35Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.20", GitCommit:"0ac5f81eecab42bff5ef74f18b99d8896ba7b89b", GitTreeState:"clean", BuildDate:"2020-09-09T00:48:20Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

$ tkn version Client version: 0.15.0 Pipeline version: v0.10.2 Triggers version: v0.3.1

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2021
@ghost
Copy link

ghost commented Jun 15, 2021

/assign sbwsg

@tekton-robot tekton-robot assigned ghost Jun 15, 2021
@ghost ghost removed their assignment Jun 21, 2021
@bobcatfish
Copy link
Collaborator

Seems like a legit bug - we should at least investigate if we can reproduce before closing.

/remove-lifecycle rotten

@bobcatfish bobcatfish removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 10, 2021
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 8, 2021
@lbernick
Copy link
Member

/priority important-soon
/remove-lifecycle rotten
There's a TEP for improving our metrics, this should be addressed there

@tekton-robot tekton-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Dec 13, 2021
@lbernick lbernick removed their assignment Feb 7, 2022
@wlynch
Copy link
Member

wlynch commented Feb 22, 2022

/assign @khrm

Should be fixed with #4468

@tekton-robot
Copy link
Collaborator

@wlynch: GitHub didn't allow me to assign the following users: khrm.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @khrm

Should be fixed with #4468

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wlynch
Copy link
Member

wlynch commented Feb 22, 2022

/unassign

@khrm
Copy link
Contributor

khrm commented Feb 23, 2022

/assign khrm

@khrm
Copy link
Contributor

khrm commented Feb 23, 2022

This should be fixed with #4468 or #4469

@ibexmonj
Copy link
Author

Thank you for working on this.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2022
@pritidesai
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Done
Development

No branches or pull requests

7 participants