Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: pipelinerun_count is not correct #2848

Conversation

vincent-pli
Copy link
Member

Part fix issue: #2844
In scenario: PipelineResult is not null, the Metrics: pipelinerun_count will not correct.
The root cause is that:

if pr.IsDone() {
// We may be reading a version of the object that was stored at an older version
// and may not have had all of the assumed default specified.
pr.SetDefaults(contexts.WithUpgradeViaDefaulting(ctx))
c.updatePipelineResults(ctx, pr)

The updatePipelineResults(ctx, pr) will update pr.Status.PipelineResults if needed, this will make the knative/pkg update the Pipelinerun.Status and reconcile will be trigger again.

So the pipelinerun_count will count again.

Changes

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

In scenario: `PipelineResult` is not null, the Metrics: pipelinerun_count will not correct.
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 23, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign dibyom
You can assign the PR to them by writing /assign @dibyom in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

1 similar comment
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 84.2% 84.3% 0.1

@vincent-pli
Copy link
Member Author

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2020
@vdemeester
Copy link
Member

/cc @afrittoli @mattmoor

@vincent-pli
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@vdemeester
Copy link
Member

/cc @hrishin

@hrishin
Copy link
Member

hrishin commented Jun 24, 2020

@vincent-pli thank you for the fix!
Let's move this complete logic into different place and consolidate all metrics reporting in one place. Wonders its better to decouple this logic from the reconciler? 🤔

Although in this case count and duration may need additional care. Like not to repeat the count and consolidate duration.

}
}(c.metrics)

if equality.Semantic.DeepEqual(original.Status, pr.Status) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is enough to prevent false metrics from being reported.
Not knowing much about opencensus, my expectation reading this code is that DurationAndCount was idempotent, and it would take care of not submitting twice.
Now, if the controller is restarted for instance, all metrics will be submitted again.

A way to solve this could perhaps be to write metrics where the completion time is set?

@vincent-pli
Copy link
Member Author

vincent-pli commented Jun 29, 2020

@hrishin thanks for comments, that's great if we can prevent repeat in DurationAndCount, but I guess it's difficult. as @afrittoli mentioned the DurationAndCount should be idempotent so the invoker should take care of it:

  • when to call it (in this case, count the duration of pipelinerun, so should invoke when pipelinerun complete)
  • prevent repeat
  • maybe others

@afrittoli
completion time is not work, it already existed everytime when access
if pr.IsDone() {

About "controller restart", I think that's ok, all the metrics will be report again but with different time series.

@hrishin
Copy link
Member

hrishin commented Jun 30, 2020

completion time is not work, it already existed everytime when access
if pr.IsDone() {
About "controller restart", I think that's ok, all the metrics will be report again but with different time series.

@vincent-pli @afrittoli, yes this fix looks promising. Just wonders how metrics aggregators behave in case of deployment restarts? We shall test this?

@vincent-pli
Copy link
Member Author

@hrishin
When controller restart, all metrics will be reload again unless you delete pipelinerun or other stuff at the restart moment.
For example: for pipelinerun we has metrics:

  • pipelinerun_duration_seconds
  • pipelinerun_count
  • running_pipelineruns_count

All the metric list above will be recover just one thing: the pipelinerun_count become correct, since if pr.IsDone() { access only onece after controller restart : )

@afrittoli
Copy link
Member

I meant to submit the metric in a different place in the code, when the completion time is set, which happens only once, regardless of further reconciles or restarts.

Could we add test coverage for this?

@vincent-pli
Copy link
Member Author

@afrittoli
The point completion time is set is only happen once, that's right, but at the point the reconcile is not complete since the final "status: success" update is not occurred, basically, it's not a real complete pippelinerun at that point, I think we could not launch metric to mark it's already complete.

case corev1.ConditionTrue:
pr.Status.MarkSucceeded(after.Reason, after.Message)

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 15, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants