RunningPipelineRun metric is racy #2729

mattmoor · 2020-06-02T05:01:58Z

I was looking at the PipelineRun controller, and noticed that there is a metric for the number of running PipelineRuns:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 254 to 261 in baf3c72

    
           if updated { 
        
           	go func(metrics *Recorder) { 
        
           		err := metrics.RunningPipelineRuns(c.pipelineRunLister) 
        
           		if err != nil { 
        
           			c.Logger.Warnf("Failed to log the metrics : %v", err) 
        
           		} 
        
           	}(c.metrics) 
        
           }

This metric is updated when one of the reconciler's trailing updates modifies the resource (status or labels/annotations), but it bases its metric on the informer's lister cache. Now, informers are fast, but in all likelihood the lister that is being used to drive this metric hasn't received the updated version of the resource from the API Server, so unless the informer cache's copy is being modified (a no no) then the cache won't reflect what it's clearly (given the if updated check) trying to capture.

Perhaps this metric could even be completely decoupled from this reconciler? It seems like fairly separable logic that's needlessly coupled to reconciliation. 🤷

The text was updated successfully, but these errors were encountered:

vdemeester · 2020-06-02T08:30:29Z

/kind feature

mattmoor · 2020-06-02T16:01:15Z

I'd probably say /kind bug? 😇

vdemeester · 2020-06-02T16:42:15Z

@mattmoor for my defense, I was on PTO, my brain is not completely wired back correctly it seems 😹
/kind bug
/remove-kind feature

tekton-robot · 2020-06-02T16:43:11Z

@vdemeester: Those labels are not set on the issue: kind/feature

In response to this:

@mattmoor for my defense, I was on PTO, my brain is not completely wired back correctly it seems 😹
/kind bug
/remove-kind feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mattmoor · 2020-06-02T16:43:51Z

unless the informer cache's copy is being modified (a no no)

I am seeing the informer's copy being modified in both updateStatus and updateLabelsAndAnnotations.

mattmoor · 2020-06-02T16:46:23Z

This modifies a resource fetched from a lister (informer cache):

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Line 916 in be9ac3b

newPr.Status = pr.Status

mattmoor · 2020-06-02T16:55:38Z

You can watch things explode by adding a line like pr := pr.DeepCopy() to the top of each of these loops:

pipeline/test/controller.go

Line 108 in be9ac3b

for _, pr := range d.PipelineRuns {

In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method. In pseudo-code: ```go func update(fr *FooRun) { newFr := lister.Get(fr.Name) if reflect.DeepEqual(newFr.Field, fr.Field) { newFr.Field = fr.Field // This modified the informer's copy! return client.Update(newFr) } } ``` I have worked around this in two different ways: 1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy. 2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue). There are two other notable aspects to this change: 1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered. I changed most of these to refetch through the client. 2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in. It is also worth calling out that this change will very likely destabilize the metric that I identified [here](tektoncd#2729) as racy, which is likely masked by the mutation of the informer copies. Fixes: tektoncd#2734

mattmoor · 2020-06-03T05:18:20Z

Alright, so #2736 should address the modifications of the resources in the informer cache, but I didn't touch this code because no tests blew up (yet 🤞 ) and I feel like fixing this race needs a deeper design conversation.

Not sure how folks want to proceed, but that PR will almost certainly make this metric flakier until this issue is fixed.

In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method. In pseudo-code: ```go func update(fr *FooRun) { newFr := lister.Get(fr.Name) if reflect.DeepEqual(newFr.Field, fr.Field) { newFr.Field = fr.Field // This modified the informer's copy! return client.Update(newFr) } } ``` I have worked around this in two different ways: 1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy. 2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue). There are two other notable aspects to this change: 1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered. I changed most of these to refetch through the client. 2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in. It is also worth calling out that this change will very likely destabilize the metric that I identified [here](#2729) as racy, which is likely masked by the mutation of the informer copies. Fixes: #2734

mattmoor · 2020-06-04T14:41:12Z

One potential solution would be to replace the current logic with a simple periodic scan. You'd lose sub-{period} resolution, but it avoids a full lister scan (O(n)) that potentially keys off of updates to any of those O(n) resources.

If we launch a go routine with a for { <-time.After(period) } in controller.go it can key off of the same lister.

This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729

This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: tektoncd#2729

This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729

This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: tektoncd#2729

This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729

This changes the metric reporting to happen periodically vs. being triggered by reconciliation. The previous method was prone to stale data because it was driven by the informer cache immediately following writes through the client, and might not fix itself since the status may not change (even on resync every 10h). With this change it should exhibit the correct value within 30s + {informer delay}, where the 30s is configurable on the Recorders. Fixes: #2729

This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: tektoncd#2729

This change migrates the PipelineRun controller to use `// +genreconciler`, which codifies many of the best practices we have adopted across Knative. There are two key pieces to this change: 1. Change the RunningPipelineRun (and the TaskRun equiv) metric reporting to happen on a simple 30s period (see fixed issue below). 2. Change the way we update labels *back* to an Update (vs. Patch). The bulk of the diff (that's not generated) is due to `2.`. I had introduced the Patch in a previous commit in order to avoid optimistic concurrency failures and to avoid needing to update the lister cache for it to properly reflect the preceding status update. With the genreconciler transition the status update now happens second, so I had to bite the bullet. Now this was not as simple as just updating the informer cache due to two gaps in the fake client: 1. Status sub-resource updates modify the entire resource (mismatches K8s). 2. Fake clients don't update or check ResourceVersion for optimistic concurrency. The other major difference here is that genreconciler uses its original copy of the resource when it updates status, so the intervening label update isn't even refetched from the informer cache before the update is attempted. However, because of the gaps above the update was allowed to happen despite the fast that it should have failed optimistic concurrency, and should only have updated status. To address this, I actually taught the fakes to simulate what will happen in reality: 1. The label update will go through, and bump the resource version, 2. The status update will attempt to happen with the stale resource version and FAIL, but 3. The genreconciler's built-in retry logic will notice the conflict, refetch the resource thru the client, and retry the status update (this time succeeding). This is done through new reactors in `test/controller.go`, and should now happen for all of the resources in the Pipeline client (when setup through the seed function). Fixes: #2729

tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 2, 2020

tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 2, 2020

vdemeester removed the kind/feature Categorizes issue or PR as related to a new feature. label Jun 2, 2020

mattmoor mentioned this issue Jun 3, 2020

Tekton modifies objects in the informer cache #2734

Closed

mattmoor mentioned this issue Jun 3, 2020

Avoid modifications to the informer's copy of resources. #2736

Merged

vdemeester mentioned this issue Jun 4, 2020

The pipeline resource is fetched via client vs. lister in resolver #2740

Closed

mattmoor mentioned this issue Jun 5, 2020

Migrate the PipelineRun controller to genreconciler. #2760

Merged

mattmoor mentioned this issue Jun 5, 2020

Disentangle metric reporting from the actual reconciler. #2762

Merged

tekton-robot closed this as completed in #2762 Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunningPipelineRun metric is racy #2729

RunningPipelineRun metric is racy #2729

mattmoor commented Jun 2, 2020

vdemeester commented Jun 2, 2020

mattmoor commented Jun 2, 2020

vdemeester commented Jun 2, 2020 •

edited

Loading

tekton-robot commented Jun 2, 2020

mattmoor commented Jun 2, 2020

mattmoor commented Jun 2, 2020

mattmoor commented Jun 2, 2020

mattmoor commented Jun 3, 2020

mattmoor commented Jun 4, 2020

RunningPipelineRun metric is racy #2729

RunningPipelineRun metric is racy #2729

Comments

mattmoor commented Jun 2, 2020

vdemeester commented Jun 2, 2020

mattmoor commented Jun 2, 2020

vdemeester commented Jun 2, 2020 • edited Loading

tekton-robot commented Jun 2, 2020

mattmoor commented Jun 2, 2020

mattmoor commented Jun 2, 2020

mattmoor commented Jun 2, 2020

mattmoor commented Jun 3, 2020

mattmoor commented Jun 4, 2020

vdemeester commented Jun 2, 2020 •

edited

Loading