Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0121: Modify metrics recorder for TaskRun Retries #5853

Merged
merged 1 commit into from
Dec 15, 2022

Conversation

XinruZhang
Copy link
Member

@XinruZhang XinruZhang commented Dec 8, 2022

Changes

Openning this PR originally because a DATA RACE (Dirty Read) error
occurs when I was implementing #5844, see error msg.

Basically, the reason is that durationAndCountMetrics() is reading
the taskRun object in one goroutine, while retryTaskRun() (which is
called in finishReconcileUpdateEmitEvents) is writing the taskRun object.

After digging into it, there's one more concern which looks like a bug:
durationAndCountMetrics() in reconcile() only counts the taskrun numbers
where the taskruns that have no preparation error / not cancelled right after
scheduled / not timed out before calling reconcile() (for example,
be pending for a long time.).

To address the problem mentioned above, this PR moves
durationAndCountMetrics() outside of reconcile() to ReconcileKind()
to make sure it is reading the taskRun object that has no further update.


Related Background Info

The recording metrics logic was originally in the ReconcileKind() function,
but the commit 1375219 addressing the recount issue moves that logic
to reconcile()
[1] (probably for better code style? 🤔).

This is a blocker for #5844.

[1] 1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

@tekton-robot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 8, 2022
@tekton-robot tekton-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 8, 2022
@jerop
Copy link
Member

jerop commented Dec 8, 2022

/test all

@jerop jerop added the kind/bug Categorizes issue or PR as related to a bug. label Dec 8, 2022
@XinruZhang XinruZhang marked this pull request as ready for review December 9, 2022 13:21
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2022
@XinruZhang
Copy link
Member Author

/assign @jerop

cc @khrm
cc @tektoncd/core-collaborators

@JeromeJu
Copy link
Member

JeromeJu commented Dec 9, 2022

After digging into it, there's one more concern:
durationAndCountMetrics() in reconcile() only counts the taskrun numbers
where the taskrun has no preparation error / not cancelled right after
scheduled / not timed out before calling reconcile() (for example,
be pending for a long time.).

Wondering if adding the status check here in DurationAndCount will help only counts the taskRuns that are not failed from https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/taskrun/taskrun.go#L110-L180?

If this route is valid, shall we also change the similar logics for reconciler/pipelinerun.go

@XinruZhang
Copy link
Member Author

Thanks @JeromeJu for taking time to looking into the issue and commenting on the issue :)

If I'm not mistaken, are you suggesting that we should not count those failed taskruns for metrics? If so, why would we want to do that?

If this route is valid, shall we also change the similar logics for reconciler/pipelinerun.go

That's true, I do think we also need to change the logic for reconciler/pipelinerun.go

@JeromeJu
Copy link
Member

If I'm not mistaken, are you suggesting that we should not count those failed taskruns for metrics? If so, why would we want to do that?

That was referring to the previous logics of putting durationAndCountMetrics() in reconcile function. But I totally agree that we should have it at ReconcileKind level which we can keep count of failed taskRuns(taking the current route).

@khrm
Copy link
Contributor

khrm commented Dec 12, 2022

We need to check whether the count in this logical route is correct or not. I will check why this is so. We want to avoid recount issues.

@XinruZhang
Copy link
Member Author

cc @afrittoli this is the PR I opened for the issue we discussed today :P

@XinruZhang
Copy link
Member Author

XinruZhang commented Dec 13, 2022

@khrm I believe this commit 1375219 is addressing the recount issue by comparing beforeCondition and afterCondition, only record when the two conditions are different).

The recording metrics logic was originally in the ReconcileKind() function, but the commit 1375219 addressing the recount issue moves that logic to reconcile() [1] (probably for better code style? 🤔).

cc @afrittoli @jerop @dibyom

[1] 1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145

@afrittoli afrittoli added this to the Pipelines v0.43 milestone Dec 13, 2022
@XinruZhang XinruZhang force-pushed the TEP-0121-pre-fix branch 2 times, most recently from caad78b to 49224af Compare December 13, 2022 21:27
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 13, 2022
Prior to this commit, metrics recording durationAndCountMetrics()
is in reconcile() (instead of ReconcileKind()), which only counts
the taskrun numbers where the taskruns that have no preparation
error / not cancelled right after scheduled / not timed out before
calling reconcile() (for example, be pending for a long time.).

This commit moves durationAndCountMetrics() into ReconcileKind()
to address the issue above.

It's worth mentioning that the recording metrics logic was
originally in the ReconcileKind() function, but the commit tektoncd@1375219,
aming at addressing the recount issue, **moves that logic to `reconcile()`**
[1] (probably for better code style? 🤔).

[1] tektoncd@1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145
@tekton-robot tekton-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 13, 2022
@dibyom
Copy link
Member

dibyom commented Dec 14, 2022

Makes sense, thanks for digging in @XinruZhang

/approve

We need to check whether the count in this logical route is correct or not. I will check why this is so. We want to avoid recount issues.

@khrm - Agreed as well. Will the current set of tests count this?

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2022
@JeromeJu
Copy link
Member

/test check-pr-has-kind-label

@tekton-robot
Copy link
Collaborator

@JeromeJu: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-tekton-pipeline-alpha-integration-tests
  • /test pull-tekton-pipeline-build-tests
  • /test pull-tekton-pipeline-integration-tests
  • /test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

  • /test pull-tekton-pipeline-go-coverage

Use /test all to run all jobs.

In response to this:

/test check-pr-has-kind-label

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@afrittoli
Copy link
Member

NIT Submitter checklist is not filled in

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XinruZhang for addressing this!

It looks like we don't have any reconcile unit test that checks metrics, which is problematic because we must rely on our understanding of the code alone to know whether the counting of TaskRuns is ok or duplicates will exist or cases will be missed.

The current logic for counting metrics relies on the assumption that we shall only count a TaskRun when isDone() is true at the end of the reconciliation cycle and some change happened in the ConditionSucceeded condition. These two conditions identify a single reconciliation cycle as long as no change is done to the ConditionSucceeded after a TaskRun is marked as done.

These two conditions do not change when moving the logic from reconcile to ReconcileKind, and in fact, the new location of the code captures some cases that were missing before, as @XinruZhang noted already.

There are a couple of improvements that could be made, but we could do them as a follow-up:

  • the code under the isDone check should never touch the conditions of theTaskRun. We invoke setDefaults and later updateLabelsAndAnnotations neither of which will touch the condition, so it's ok now, but it would be nice to enforce that somehow to avoid inadvertently breaking this logic here in the future. We could start adding a comment to it. Reconcile metrics tests would ensure that this is not broken in future
  • The code in the durantionAndCountMetrics uses a lister to get the before condition. We should not do that, listers are expensive for the API server, besides we already store the before condition, so we should pass that to durantionAndCountMetrics, similar to what we do for events.

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, chitrangpatel, dibyom, jerop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [afrittoli,dibyom,jerop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants