-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛clustercatalog_controller: hacky, but more correct status updating in reconciler #424
🐛clustercatalog_controller: hacky, but more correct status updating in reconciler #424
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #424 +/- ##
==========================================
+ Coverage 35.28% 38.55% +3.26%
==========================================
Files 14 14
Lines 802 843 +41
==========================================
+ Hits 283 325 +42
- Misses 472 474 +2
+ Partials 47 44 -3 ☔ View full report in Codecov by Sentry. |
Unpacker source.Unpacker | ||
Storage storage.Instance | ||
|
||
finalizers crfinalizer.Finalizers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to refactor finalizers so that it had access to the reconciler struct. That way it would be possible to delete(r.storedCatalogs, catalog.Name)
in the finalizer logic.
case r.needsPoll(storedCatalog.unpackResult.ResolvedSource.Image.LastSuccessfulPollAttempt.Time, catalog): | ||
l.Info("unpack required: poll duration has elapsed") | ||
needsUnpack = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally added lots of logging here to help us deduce why we might be unpacking more frequently than expected.
clearUnknownConditions(expectedStatus) | ||
if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) { | ||
updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration) | ||
updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to change the updateStatusProgressing
signature so that I could call it with a deep-copied status, and a different observed generation.
@@ -396,7 +396,7 @@ func TestImageRegistry(t *testing.T) { | |||
// If the digest should already exist check that we actually hit it | |||
if tt.digestAlreadyExists { | |||
assert.Contains(t, buf.String(), "image already unpacked") | |||
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime()) | |||
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even asserting on the second could be brittle if we happen to run on the second to next second break. How about just asserting the duration is in a reasonable range?
assert.Equal(t, rs.UnpackTime, unpackDirStat.ModTime().Truncate(time.Second)) | |
assert.WithinDuration(t, rs.UnpackTime, unpackDirStat.ModTime(), time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should be exact. rs.UnpackTime
is always truncated to the second in the same way. I'll add a comment in the code about why we need that though.
@joelanford when I point current operator-controller main at this with a # github.com/operator-framework/operator-controller/internal/catalogmetadata/cache
internal/catalogmetadata/cache/cache.go:86:42: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:122:62: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/catalogmetadata/cache/cache.go:164:52: catalog.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
# github.com/operator-framework/operator-controller/internal/controllers [github.com/operator-framework/operator-controller/internal/controllers.test]
internal/controllers/clusterextension_controller.go:419:53: oldObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
internal/controllers/clusterextension_controller.go:419:106: newObject.Status.ResolvedSource.Image.ResolvedRef undefined (type *"github.com/operator-framework/catalogd/api/core/v1alpha1".ResolvedImageSource has no field or method ResolvedRef)
make: *** [vet] Error 1 |
797ff05
to
2a05670
Compare
@bentito, that's expected. See the accompanying changes in operator-framework/operator-controller#1329. No matter how you slice it, there is a breaking change from 0.29.0 to 0.30.0 (and from 0.29.0 to this PR):
|
ah, okay, understanding that now, was just worried it was supposed to be fixed from catalogd side alone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left a small comment about the status comparison
clearUnknownConditions(expectedStatus) | ||
if hasStoredCatalog && r.Storage.ContentExists(catalog.Name) { | ||
updateStatusServing(expectedStatus, storedCatalog.unpackResult, r.Storage.ContentURL(catalog.Name), storedCatalog.observedGeneration) | ||
updateStatusProgressing(expectedStatus, storedCatalog.observedGeneration, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to avoid waiting till the next poll before reconciling in case of unpack errors? Looks like the equivalence check for the Progressing condition will always fail if we had an error on the last unpack; do we backoff between reconciling errored unpacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is a good question. Here's a scenario:
- We are successfully serving content for generation 1
- We reconcile generation 2, which has a bad image ref, let's say the tag doesn't exist. So we:
- Do not update
storedCatalog
map - DO update the status with
Progressing=True, Retrying
- Do not update
- On the next reconcile (that is requeued basically immediately), we build an "expectedStatus" that says
Installed=True
andProgressing=False
based on this code. We later compare this to the real status from the object and we are told that there are differences, which forces an unpack.
Step 3 is a little odd though. Maybe if we return an error from Reconcile
we should delete(r.storedCatalogs, catalog.Name)
as a better signal for "we need to attempt to unpack again"?
updateStatusProgressing(&catalog.Status, catalog.GetGeneration(), storageErr) | ||
return ctrl.Result{}, storageErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering this isn't a terminal error, should we requeue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning the storageErr
, we get requeue based on exponential backoff. We only want the requeue based on poll interval if there are no errors during reconcile.
… reconciler Signed-off-by: Joe Lanford <joe.lanford@gmail.com>
63b0914
2a05670
to
63b0914
Compare
For awhile in catalogd, we've had a very loose "needsUnpack" behavior where we would make the assumption that if we didn't need to unpack, then we also didn't need to update the status of the object.
This means that some objects may never have their status updated after an upgrade. That's problematic because upgrades may add new helpful fields or fix bugs in the way that status is reported.
This PR is a somewhat hacky fix for the situation. It solves the problem by keeping an in-memory map of the observedGeneration and resolved source, and then using that as the source of truth to and then use that expected status as part of the determination of whether an unpack is required.
A new criteria of "needsUnpack" is now: "does the expected status differ from the status that is on the object. If so, unpack again.
Fixes #422
Fixes #420