Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate the PipelineRun controller to genreconciler. #2760

Merged
merged 3 commits into from
Jun 8, 2020

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Jun 5, 2020

This change migrates the PipelineRun controller to use // +genreconciler,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

  1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
    happen on a simple 30s period (see fixed issue below).

  2. Change the way we update labels back to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to 2.. I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update. With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:

  1. Status sub-resource updates modify the entire resource (mismatches K8s).
  2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted. However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:

  1. The label update will go through, and bump the resource version,
  2. The status update will attempt to happen with the stale resource version and FAIL, but
  3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
    thru the client, and retry the status update (this time succeeding).

This is done through new reactors in test/controller.go, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: #2729

Submitter Checklist

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

PipelineRun controller now uses genreconciler

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 5, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 5, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/controller.go 100.0% 94.1% -5.9
pkg/reconciler/pipelinerun/pipelinerun.go 80.2% 80.3% 0.1
pkg/reconciler/taskrun/controller.go 96.2% 90.6% -5.5
pkg/reconciler/taskrun/taskrun.go 76.0% 75.4% -0.6

@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

It is notable that the ResourceVersionReactor here is based on voodoo that @dprotaso used to make a GenerateNameReactor in Knative. So credit (or blame 😉 ) where it's due! 🙏

cc @afrittoli @n3wscott @vdemeester @imjasonh


// Code generated by injection-gen. DO NOT EDIT.

package pipelinerun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our controller.go is not in an API specific folder, but it uses this genreconciler which is API specific.
We only store one version of the API, so I suppose this simply means that in the controller will use the reconciler for the API version that is stored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So controllers use the version of the "primary resource" that they set up informers for, which is orthogonal from the stored version. This is cheapest when it matches the storage version because the informer will go through the conversion webhooks.

We used to (briefly) version our controller directories, but moved away from this when we realized that we'd only have a controller for the single canonical version and never multiple versions of the controller for a particular resource.

This is versioned because you could generate it for each version of a resource, and have multiple controllers[*] of a resource that consumer different versions of the generated code (likely during a version transition).

[*] - We do this in a few cases: extensibility (class-based reconciliation for plug-points), background processes that have been split off the main reconciliation loop (only one writes status).

Hopefully that clarifies, but if not I'd be happy to try to explain further.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this!
A have a few minor comments, but nothing blocking.
Testing to the test controller might be good, but that can be a different PR.

In future we may want to be able to use further functionality from the gencontroller, like returning events and possibly finalizers, but it's great to see that this can fit our controllers already.

Since this is a rather significant change, I was wondering if we should leave this change for v0.14, so that we have some time to play with it before introducing it in a major release?

pkg/reconciler/pipelinerun/pipelinerun.go Show resolved Hide resolved
pkg/reconciler/taskrun/controller.go Outdated Show resolved Hide resolved
test/controller.go Show resolved Hide resolved
@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

@afrittoli Yeah, I don't want to land this right before a release, which I assumed was imminent.

/hold

For good measure 😅

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/controller.go 100.0% 94.3% -5.7
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 80.2% 80.3% 0.1
pkg/reconciler/taskrun/controller.go 96.2% 96.3% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.0% 75.4% -0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 80.2% 80.3% 0.1
pkg/reconciler/taskrun/controller.go 96.2% 96.3% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.0% 75.4% -0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 80.2% 80.3% 0.1
pkg/reconciler/taskrun/controller.go 96.2% 96.3% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.0% 75.4% -0.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 80.2% 80.3% 0.1
pkg/reconciler/taskrun/controller.go 96.2% 96.3% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.0% 75.4% -0.6
test/controller.go 0.0% 28.9% 28.9

@mattmoor
Copy link
Member Author

mattmoor commented Jun 5, 2020

/hold cancel

0.13 is cut according to Slack :D

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2020
@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020
@vdemeester
Copy link
Member

(It depends on #2762 right?)

@mattmoor
Copy link
Member Author

mattmoor commented Jun 8, 2020

Yes, rebasing these now.

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/metrics.go 86.8% 84.7% -2.0
pkg/reconciler/pipelinerun/pipelinerun.go 81.5% 81.7% 0.2
pkg/reconciler/taskrun/controller.go 96.4% 96.6% 0.1
pkg/reconciler/taskrun/metrics.go 86.7% 85.4% -1.2
pkg/reconciler/taskrun/taskrun.go 76.1% 75.5% -0.6
test/controller.go 0.0% 27.3% 27.3

@mattmoor
Copy link
Member Author

mattmoor commented Jun 8, 2020

/hold

Until I can rebase on merged #2762

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2020
This change migrates the PipelineRun controller to use `// +genreconciler`,
which codifies many of the best practices we have adopted across Knative.

There are two key pieces to this change:

1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to
   happen on a simple 30s period (see fixed issue below).

2. Change the way we update labels *back* to an Update (vs. Patch).

The bulk of the diff (that's not generated) is due to `2.`.  I had introduced the
Patch in a previous commit in order to avoid optimistic concurrency failures
and to avoid needing to update the lister cache for it to properly reflect the
preceding status update.  With the genreconciler transition the status update now
happens second, so I had to bite the bullet.

Now this was not as simple as just updating the informer cache due to two gaps in
the fake client:
1. Status sub-resource updates modify the entire resource (mismatches K8s).
2. Fake clients don't update or check ResourceVersion for optimistic concurrency.

The other major difference here is that genreconciler uses its original copy of
the resource when it updates status, so the intervening label update isn't even
refetched from the informer cache before the update is attempted.  However, because
of the gaps above the update was allowed to happen despite the fast that it should
have failed optimistic concurrency, and should only have updated status.

To address this, I actually taught the fakes to simulate what will happen in reality:
1. The label update will go through, and bump the resource version,
2. The status update will attempt to happen with the stale resource version and FAIL, but
3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource
   thru the client, and retry the status update (this time succeeding).

This is done through new reactors in `test/controller.go`, and should now happen
for all of the resources in the Pipeline client (when setup through the seed function).

Fixes: tektoncd#2729
@mattmoor
Copy link
Member Author

mattmoor commented Jun 8, 2020

/hold cancel

This is rebased and RFAL

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 81.1% 81.7% 0.6
test/controller.go 0.0% 27.3% 27.3

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2020
@tekton-robot tekton-robot merged commit a2a9bc5 into tektoncd:master Jun 8, 2020
@mattmoor mattmoor deleted the genreconciler branch June 8, 2020 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RunningPipelineRun metric is racy
4 participants