Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust PipelineRun's StartTime based on TaskRun state. #3461

Merged
merged 3 commits into from
Oct 27, 2020

Conversation

mattmoor
Copy link
Member

Changes

Occasionally, it is possible for us to be reconciling a PipelineRun and have the
status we intend to report reflect an inaccurate StartTime (see issue for
details). This corrects for those circumstances by ensuring that the StartTime
we report for a PipelineRun is never later than the smallest CreationTimestamp
of a child TaskRun.

Fixes: #3460

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Fixes a bug where PipelineRun may report as Failed when it really timed out.

Occasionally, it is possible for us to be reconciling a PipelineRun and have the
status we intend to report reflect an inaccurate StartTime (see issue for
details).  This corrects for those circumstances by ensuring that the StartTime
we report for a PipelineRun is never later than the smallest CreationTimestamp
of a child TaskRun.

Fixes: tektoncd#3460
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 26, 2020
@tekton-robot tekton-robot requested review from dlorenc and a user October 26, 2020 19:45
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a test?

@@ -71,6 +72,23 @@ func (state PipelineRunState) IsBeforeFirstTaskRun() bool {
return true
}

// AdjustStartTime adjusts potential drift in the PipelineRun's start time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to stating why we do this, can you add a sentence describing how we adjust, to make this easier to read?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tweaked wording, PTAL 🙏

@pritidesai
Copy link
Member

Does this need a test?

yes please, a test would be great. Does the same bug apply to taskRun start time?

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 26, 2020
@mattmoor
Copy link
Member Author

I pushed unit tests for AdjustStartTime.

Does the same bug apply to taskRun start time

@pritidesai I haven't looked at the TaskRun StartTime logic. In principle it could similarly benefit from rationalizing its StartTime with the metadata.creationTimestamp of its child resources (e.g. Pods), but AFAIK it shouldn't manifest in failures the way it does in the linked bug for PipelineRun -> TaskRun because it is a "leaf" of the timeout evaluation chain. So it's at least more tolerant of minor drift.

@pritidesai
Copy link
Member

pritidesai commented Oct 26, 2020

wow that was quick @mattmoor 🙏

So taskRun has initializeConditions() if it hasn't started and updating startTime in case of any such gap:

func (trs *TaskRunStatus) InitializeConditions() {
started := false
if trs.StartTime.IsZero() {
trs.StartTime = &metav1.Time{Time: time.Now()}
started = true
}
conditionManager := taskRunCondSet.Manage(trs)
conditionManager.InitializeConditions()
// Ensure the started reason is set for the "Succeeded" condition
if started {
initialCondition := conditionManager.GetCondition(apis.ConditionSucceeded)
initialCondition.Reason = TaskRunReasonStarted.String()
conditionManager.SetCondition(*initialCondition)
}
}

if !tr.HasStarted() {
tr.Status.InitializeConditions()
// In case node time was not synchronized, when controller has been scheduled to other nodes.
if tr.Status.StartTime.Sub(tr.CreationTimestamp.Time) < 0 {
logger.Warnf("TaskRun %s createTimestamp %s is after the taskRun started %s", tr.GetNamespacedName().String(), tr.CreationTimestamp, tr.Status.StartTime)
tr.Status.StartTime = &tr.CreationTimestamp
}
// Emit events. During the first reconcile the status of the TaskRun may change twice

which means, we might not see this bug in taskRun 🤔

@mattmoor
Copy link
Member Author

@pritidesai that same logic is in PipelineRun (the initialization and clamping to its own creation timestamp), but the question is what outside resources we'd want to clamp the StartTime on. Maybe the Pod?

@pritidesai
Copy link
Member

@pritidesai that same logic is in PipelineRun (the initialization and clamping to its own creation timestamp), but the question is what outside resources we'd want to clamp the StartTime on. Maybe the Pod?

yup I noticed pipelineRun having the same logic 😜 yup podCreation time sounds right.

Copy link
Member

@pritidesai pritidesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @mattmoor

I was thinking of moving updating pr.Status to a separate function for readability, we have many different status fields being updated here, will create a refactoring PR.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pritidesai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2020
@tekton-robot tekton-robot merged commit 64dc043 into tektoncd:master Oct 27, 2020
@mattmoor mattmoor deleted the adjust-start-time branch October 27, 2020 13:38
@pritidesai
Copy link
Member

@pritidesai that same logic is in PipelineRun (the initialization and clamping to its own creation timestamp), but the question is what outside resources we'd want to clamp the StartTime on. Maybe the Pod?

yup I noticed pipelineRun having the same logic 😜 yup podCreation time sounds right.

Rather, step start time 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestPipelineRunTimeout is flaky
5 participants