-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RunningPipelineRun metric is racy #2729
Comments
/kind feature |
I'd probably say |
@mattmoor for my defense, I was on PTO, my brain is not completely wired back correctly it seems 😹 |
@vdemeester: Those labels are not set on the issue: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I am seeing the informer's copy being modified in both |
This modifies a resource fetched from a lister (informer cache):
|
You can watch things explode by adding a line like Line 108 in be9ac3b
|
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method. In pseudo-code: ```go func update(fr *FooRun) { newFr := lister.Get(fr.Name) if reflect.DeepEqual(newFr.Field, fr.Field) { newFr.Field = fr.Field // This modified the informer's copy! return client.Update(newFr) } } ``` I have worked around this in two different ways: 1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy. 2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue). There are two other notable aspects to this change: 1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered. I changed most of these to refetch through the client. 2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in. It is also worth calling out that this change will very likely destabilize the metric that I identified [here](tektoncd#2729) as racy, which is likely masked by the mutation of the informer copies. Fixes: tektoncd#2734
Alright, so #2736 should address the modifications of the resources in the informer cache, but I didn't touch this code because no tests blew up (yet 🤞 ) and I feel like fixing this race needs a deeper design conversation. Not sure how folks want to proceed, but that PR will almost certainly make this metric flakier until this issue is fixed. |
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method. In pseudo-code: ```go func update(fr *FooRun) { newFr := lister.Get(fr.Name) if reflect.DeepEqual(newFr.Field, fr.Field) { newFr.Field = fr.Field // This modified the informer's copy! return client.Update(newFr) } } ``` I have worked around this in two different ways: 1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy. 2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue). There are two other notable aspects to this change: 1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered. I changed most of these to refetch through the client. 2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in. It is also worth calling out that this change will very likely destabilize the metric that I identified [here](#2729) as racy, which is likely masked by the mutation of the informer copies. Fixes: #2734
One potential solution would be to replace the current logic with a simple periodic scan. You'd lose sub- If we launch a go routine with a |
This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729
This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: tektoncd#2729
This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729
This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729
This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: tektoncd#2729
This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729
This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: #2729
This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729
This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: #2729
I was looking at the PipelineRun controller, and noticed that there is a metric for the number of running PipelineRuns:
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 254 to 261 in baf3c72
This metric is updated when one of the reconciler's trailing updates modifies the resource (status or labels/annotations), but it bases its metric on the informer's lister cache. Now, informers are fast, but in all likelihood the lister that is being used to drive this metric hasn't received the updated version of the resource from the API Server, so unless the informer cache's copy is being modified (a no no) then the cache won't reflect what it's clearly (given the
if updated
check) trying to capture.Perhaps this metric could even be completely decoupled from this reconciler? It seems like fairly separable logic that's needlessly coupled to reconciliation. 🤷
The text was updated successfully, but these errors were encountered: