-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The pipeline resource is fetched via client vs. lister in resolver #2740
Comments
Oh, nice catch, thank you! |
Reconciler now uses informer cache to get pipelines. This is to have no traffic to the API server if no changes are needed because of the reconcilation, e.g. reconcilation on completed pipelineruns. Fixes: tektoncd#2740 Signed-off-by: Arash Deshmeh <adeshmeh@ca.ibm.com>
Reconciler now uses lister instead of client to get pipelines. This is to have no traffic to the API server if no changes are needed because of the reconcilation, e.g. reconcilation on completed pipelineruns. Fixes: tektoncd#2740 Signed-off-by: Arash Deshmeh <adeshmeh@ca.ibm.com>
Is there some kind of unit test we could write to try to detect and prevent usage of the client to |
I'm under the impression that we are deeply suspicious that relying on the informer instead of doing a get could result in errors due to stale data. I'd love to get to the bottom of this and learn that we actually can rely on the listers and informers. In this PR that was closed you can see us debating and not really knowing: #1825
I don't think we really ever got to the bottom of whether or not relying on listers could cause us problems, but it does seem like it conceivably could. With PipelineRuns and TaskRuns, if the resources they depend on do not exist, we want them to fail, unlike other k8s types which can tolerate those types not existing and remain pending until they exist. If we rely on caches to retrieve resources needed by PipelienRuns and TaskRuns, then you could imagine if we create a Pipeline and a PipelineRun that references it simultaneously, the Pipeline might not be in the cache and the PipelineRun could fail. Or at least conceptually this seems like it makes sense!! Extremely happy to learn this is not the case :D |
@bobcatfish You make a very good point, thank you! Since we have no mechanism to to cache fetched resources, today we fetch them on every reconcile cycle - which is quite expensive (see #2690). I think there are a couple of possible solutions to this issues: Option (b) feels like the more natural for me, but it might degrade the start time when Task and TaskRun are applied together from the same file. For instance, @skaegi knows better here, at IBM Cloud we use ephemeral namespaces to run pipelineruns, and all resources are provisioned in there right before they are executed. |
I think we were also using this because we were using
Yep, (b) feels also more natural for me. |
Yes, you can look for
Yes, using the informer cache can result in stale reads. Yes, it may not yet have observed a resource yet when reconciling another resource created in the same I think a worse case is when it does exist, but a mutation may or not be picked up, e.g. kind: Task
... # change something
---
kind: TaskRun
... With Get, the above works reliably (TaskRun must be processed after Task is written to etcd), but what about this: kind: TaskRun
...
---
kind: Task
... # change something Even without the use of an informer cache, the above will sometimes get the new Task, and sometimes get the old Task (or no task). "But nobody would write that!" Well, as-is... maybe. However, the above is the moral equivalent of:
kind: Task
...
kind: TaskRun
... # use kaniko When used with I'm of two minds on this:
(brace for huge digression) The way we solve this in Knative is the Configuration / Revision duality. Configurations float forward over time, and stamp out Revisions: immutable snapshots of that Configuration (I like the parallel to Git branches and commits). In our Route resource, we let users point to the Configuration (YOLO, I think the analog in Tekton would be (different terminology used for clarity, not trying to paint a shed):
Then in a A nuance here is: How do I reference a revision when it hasn't been created? Without that I can't update a |
Not quite! The point with Patch was that it isn't a read-modify-write, so you avoid the stale read. The Patch also doesn't include the |
@afrittoli @vdemeester We raced a little bit (Oh the irony 🤣 ), but see my longer response above. My bigger concern here is NOT "wasn't created yet", but updates that aren't reflected yet. It's right before my huge digression 😅 |
This would beg the question of how many times we retry before failing: in the current design, as soon as this fails, the PipelineRun fails. If you start retrying, if users specify the name of a resource that doesn't exist, the Run will wait indefinitely. That sounds okay since it's how a lot of other Kubernetes resources work BUT I would say there is one big downside: If the resource springs into existence later, the Run will suddenly execute. Imagine I create a PipelineRun "foo-1" for Pipeline "foo" but do not create "foo". Two days later I create "foo" - do I want "foo-1" to run at that point? Probably not.
That's a good point. I'm not sure we need to optimize for the case where Tasks are mutated, especially as we are moving toward versioned Tasks, so I would avoid adding an entire new object to handle this case. That actually brings up a good point though: maybe this concern is less important given we are moving to a model where we will likely prefer to reference Task and Pipelines in an OCI registry? #1839
can we put some numbers to this? i want to avoid optimizing until we know what the current state is and whether it's worth the cost |
I think it would be fair to retry a fixed number of time, one or two, whatever is needed to ensure the cache is not stale. I don't think we need to support the case of
|
Heh, nice. In fact, one issue we have today, it to avoid trying to reconcile a mutating Task/Pipeline. Ideally the reconcile should work on the same version of a Task/Pipeline/Condition/PipelineResource through-out the lifecycle of the matching Run resource.
|
I'm not sure how far along we've gotten with this but afaik we're now storing the Pipeline/Task definition in the status of the Runs, so it would make sense to operate from that in the future and only retrieve the Pipeline/Task on the first reconcile. Which actually might do a lot to address the efficiency concern as well? |
We'll have to:
I'd like to be certain that this extra complication is worth saving the expense of using Get vs lister, esp. if we can reduce the number of Get calls required by only Getting on the first reconcile. |
Yeah, that would also fix the issue of a Task/Pipeline changing during reconcile. |
@afrittoli @bobcatfish @mattmoor so, what is the status of this issue/bug ? 🙃 |
If it is one-off, it can be fine. However, in principle making any API calls on steady-state reconciliations is actually a huge problem for scaling controllers, which is the main point of the goal around always using the lister. Generally: If you can do a global resync without any API calls, then 👍 if not, then you have problems 😬 |
Expected Behavior
Generally best practice dictates that "gets" in the Reconcile loop are done from the informer cache so that when the reconciliation is a nop (e.g. a resync on a done task) there is zero traffic to the API server.
Actual Behavior
I found the following in the pipeline resolver:
This explicit read through the client happens on every pipeline run reconciliation, even for done pipeline runs, which should be as lightweight as possible.
This logic is in
github.com/tektoncd/pipeline/pkg/reconciler/pipelinerun/resources/pipelineref.go
I found it by adding the following check when playing with a test for idempotency:
/kind bug
The text was updated successfully, but these errors were encountered: