-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 wrap service account not found error #1477
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1477 +/- ##
==========================================
- Coverage 74.73% 74.63% -0.10%
==========================================
Files 42 42
Lines 3241 3244 +3
==========================================
- Hits 2422 2421 -1
- Misses 646 649 +3
- Partials 173 174 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
fixes #1297 |
While PR creates a predictable error string, the message is still pretty much the same and does not yet fix the issue:
@rashmi43 Would you please create your own error type[1] and catch this error before it is written to the Condition message to provide your own custom error message? |
@rashmi43 @thetechnick I have a feeling that prefix to the error message of operator-controller/internal/action/restconfig.go Lines 15 to 32 in e5820ae
I imagine somewhere, either in the client-go or Helm code, there is an assumption that if these round trips fail it is likely due to the cluster being "unreachable" for some reason. I'm not sure that implementing a custom error type will remove that information from the error message. Would improving the message to be more reflective of the context in which it is being returned be more helpful? Maybe we can have something like: - lastTransitionTime: "2024-11-18T14:41:05Z"
message: 'Unable to apply bundle content to cluster: Kubernetes cluster unreachable: Get "https://10.96.0.1:443/version": Unable to authenticate with Kubernetes cluster using ServiceAccount "foo": ServiceAccount "foo" not found'
observedGeneration: 1
reason: Failed
status: Unknown
type: Installed Would that make it a bit more clear that in this case the Kubernetes cluster is unreachable when using a client relying on authenticating as the provided ServiceAccount since that ServiceAccount does not exist? |
Alternatively, with the custom error type we could attempt to parse the error string/type wherever it is returned and attempt to strip out what we think is the unnecessary prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @rashmi43
Could you please rebase the commits?
Otherwise, when it gets merged, and we run git log
, it will result in a very lousy commit message.
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
@everettraven By creating a custom error into the TokenGetter we can keep the error message short and precise. To pick up your example, I would be very happy if the condition message just reads: |
I think this is reasonable. I mostly wanted to call out that I think it will take more of an effort to restrict the error string that we place into the status than simply adding a custom error type that is returned in the TokenGetter. The custom error may help us to identify when we've hit that error type, but there is most certainly additional error wrapping happening beyond the error message that we are returning from the TokenGetter. |
Update go version checker (operator-framework#1474) * Handle new files (old version is empty) * Handle the case where .0 patch is added/removed Signed-off-by: Todd Short <tshort@redhat.com> sa not found Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> 📖 Document OLMv1 permission model (operator-framework#1380) * changes to derice minimum service account Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * remove headers Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * add details about registry+v1 support * render yml correctly Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * create doc for olmv1 permission model Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * Delete docs/drafts directory * Update permission-model.md * update permission models with link Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * add space Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * add more structure Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * incorporate review comments Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * incorporate review comments Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> * pers review comments-s * example as header-s * update the example Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> --------- Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com> Delete docs/drafts/derive-serviceaccount.md add custom sa not found Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
can we get someone to review @camilamacedo86 perhaps @thetechnick can review my changes. we can hold the merge till v1 is shipped |
@@ -86,7 +95,8 @@ func (t *TokenGetter) getToken(ctx context.Context, key types.NamespacedName) (* | |||
Spec: authenticationv1.TokenRequestSpec{ExpirationSeconds: ptr.To(int64(t.expirationDuration / time.Second))}, | |||
}, metav1.CreateOptions{}) | |||
if err != nil { | |||
return nil, err | |||
saErr := &SANotFoundError{key.Name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that the only kind of error that is returned here is that the ServiceAccount we are requesting a token for is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thetechnick is this a fair assumption? Like discussed above since we are doing a GET for a serviceaccount that would be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left another comment here with my thoughts on this (since GitHub marked this comment as outdated): #1477 (comment)
var saerr *tokengetter.SANotFoundError | ||
if errors.As(err, &saerr) { | ||
l.Info("is a SAError") | ||
err = saerr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a lot of other places where this error could be returned, likely anywhere we are doing interactions with the cluster since we are using the provided ServiceAccount for those interactions. Off the top of my head, that would be:
- Getting the installed bundle (here)
- Applying contents to the cluster
- Setting up caches and informers for managed content
Is there a way we could more effectively capture this error type when setting the status conditions we are wanting a more specific message for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check with unwrap the errors and if its the same service account not found error, then thats what it will return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand how this is working in this particular error check, but I think there are other error checks in which it is possible for this error type to be returned as part of the wrapped error chain. Some off the top of my head is:
- When we attempt to fetch the currently installed bundle (which you already cover with this check here)
- When we attempt to apply content to the cluster (not covered by your changes)
- When we attempt to create informers to manage the content owned by a ClusterExtension (not covered by your changes)
The second piece of my comment was an attempt to get at the fact that instead of repeating this across all the error scenarios, and since we really only care about the message that is set in some of the status conditions when receiving this error, we could probably put this error check in the logic that is used to set the conditions we want to "customize" the errors for and reduce repetition of the same code in all our error checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will work on
- unit testcases
- encompassing more scenarios:
When we attempt to apply content to the cluster (not covered by your changes)
When we attempt to create informers to manage the content owned by a ClusterExtension (not covered by your changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple changes requested.
Additionally, I'd love to see some unit test cases added to ensure this behavior is working as expected.
Co-authored-by: Bryce Palmer <everettraven@gmail.com>
saErr := &ServiceAccountNotFoundError{key.Name} | ||
return nil, saErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to:
- Wrap this in a NotFound error check
- If the returned error is not a NotFound error, return the original error
This makes it so we are only returning our custom ServiceAccountNotFoundError whenever we know it is a NotFound error returned from our interaction with the Kubernetes API server. Not doing this means we assume that an error in this case always means the ServiceAccount is not found, which I think is a bad assumption to be making and will result in obfuscating any other errors that may be returned during this interaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when i do this, it doesnt work for me:
if errors.IsNotFound(err) {
+ saErr := &ServiceAccountNotFoundError{key.Name}
+ return nil, saErr
+ } else {
+ return nil, err
+ }
@thetechnick fyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebuild and it seems to be working as expected:
status:
conditions:
- lastTransitionTime: "2024-11-26T16:08:21Z"
message: 'Unable to authenticate with Kubernetes cluster using ServiceAccount
"nginx-install": ServiceAccount "nginx-install" not found.'
observedGeneration: 1
reason: Failed
status: Unknown
type: Installed
- lastTransitionTime: "2024-11-26T16:08:21Z"
message: retrying to get installed bundle
observedGeneration: 1
reason: Retrying
status: "True"
type: Progressing
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Hi @rashmi43, Could you please squash the commits? Since this is a small change, it should ideally have just one commit. I’m concerned that if we leave it as is (just forget), all these commits will show up in the logs (e.g., run git log to see what I mean). A clean and meaningful git history is important because, when debugging or identifying breaking changes, it’s common to rely on the history to pinpoint the cause. If the history isn’t well-organized, it can make life much harder. 😊 Additionally, if you need to rebase and encounter a conflict, resolving it might become more challenging. You could end up dealing with the same conflict multiple times (potentially 26 times!). Squashing the commits can make the process much smoother. 😉 |
var saerr *tokengetter.ServiceAccountNotFoundError | ||
if errors.As(err, &saerr) { | ||
l.Info("is a SAError") | ||
err = saerr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to log something here?
Also, we are checking the error and replacing it
Why do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are checking the type of the error and then returning our error message
require.NotNil(t, progressingCond) | ||
t.Log("Progressing condition status", progressingCond.Status) | ||
t.Log("Progressing condition reason", progressingCond.Reason) | ||
//require.Equal(t, metav1.ConditionTrue, progressingCond.Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line commented on?
Do we need to do this assert?
We need uncomment and fix or remove, IHMO we cannot leave commented lines
@@ -677,6 +677,97 @@ func TestClusterExtensionInstallationSucceeds(t *testing.T) { | |||
require.NoError(t, cl.DeleteAllOf(ctx, &ocv1.ClusterExtension{})) | |||
} | |||
|
|||
func TestClusterExtensionInstallationFailsWithNoServiceAccount(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rashmi43,
Without diving into the code itself, I think we should improve how t.Log is being used. The messages should describe actions like "Validating the status to ensure it is X" or "Checking the progressing condition." This makes the logs more meaningful and helps anyone reading them understand what’s happening in the test.
Also, we can make things cleaner by using defer for resource cleanup. I would like to suggest something like:
// TestClusterExtensionInstallationFailsWithNoServiceAccount
// Validates that the reconciliation process fails gracefully when the specified
// ServiceAccount does not exist. Ensures that appropriate conditions are set
// on the ClusterExtension status, and the resource is cleaned up properly after the test.
func TestClusterExtensionInstallationFailsWithNoServiceAccount(t *testing.T) {
cl, reconciler := newClientAndReconciler(t)
reconciler.Unpacker = &MockUnpacker{
result: &source.Result{
State: source.StateUnpacked,
Bundle: fstest.MapFS{},
},
}
ctx := context.Background()
extKey := types.NamespacedName{Name: fmt.Sprintf("cluster-extension-test-%s", rand.String(8))}
defer func() {
t.Log("Cleaning up all cluster extension resources")
cl.DeleteAllOf(ctx, &ocv1.ClusterExtension{})
}()
t.Log("Initializing cluster extension with a channel, version, and non-existent service account")
pkgName := "prometheus"
pkgVer := "1.0.0"
pkgChan := "beta"
namespace := fmt.Sprintf("test-ns-%s", rand.String(8))
serviceAccount := fmt.Sprintf("test-does-not-exist-%s", rand.String(8))
clusterExtension := &ocv1.ClusterExtension{
ObjectMeta: metav1.ObjectMeta{Name: extKey.Name},
Spec: ocv1.ClusterExtensionSpec{
Source: ocv1.SourceConfig{
SourceType: "Catalog",
Catalog: &ocv1.CatalogSource{
PackageName: pkgName,
Version: pkgVer,
Channels: []string{pkgChan},
},
},
Namespace: namespace,
ServiceAccount: ocv1.ServiceAccountReference{
Name: serviceAccount,
},
},
}
require.NoError(t, cl.Create(ctx, clusterExtension), "Failed to create the cluster extension resource")
t.Log("Setting up mock resolver and applier for reconciliation")
reconciler.Resolver = resolve.Func(func(_ context.Context, _ *ocv1.ClusterExtension, _ *ocv1.BundleMetadata) (*declcfg.Bundle, *bsemver.Version, *declcfg.Deprecation, error) {
v := bsemver.MustParse("1.0.0")
return &declcfg.Bundle{
Name: "prometheus.v1.0.0",
Package: "prometheus",
Image: "quay.io/operatorhubio/prometheus@fake1.0.0",
}, &v, nil, nil
})
reconciler.Applier = &MockApplier{
objs: []client.Object{},
}
reconciler.Manager = &MockManagedContentCacheManager{
cache: &MockManagedContentCache{},
}
t.Log("Running reconcile to process the cluster extension")
res, err := reconciler.Reconcile(ctx, ctrl.Request{NamespacedName: extKey})
require.Equal(t, ctrl.Result{}, res, "Unexpected reconcile result")
require.NoError(t, err, "Reconcile failed with an error")
t.Log("Fetching updated cluster extension after reconciliation")
require.NoError(t, cl.Get(ctx, extKey, clusterExtension), "Failed to retrieve the updated cluster extension resource")
t.Log("Verifying the status fields of the cluster extension")
require.Equal(t, ocv1.BundleMetadata{Name: "prometheus.v1.0.0", Version: "1.0.0"}, clusterExtension.Status.Install.Bundle, "Bundle metadata does not match expected values")
t.Log("Verifying the installed condition of the cluster extension")
installedCond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled)
require.NotNil(t, installedCond, "Installed condition not found")
require.Equal(t, metav1.ConditionTrue, installedCond.Status, "Installed condition status should be True")
require.Equal(t, ocv1.ReasonSucceeded, installedCond.Reason, "Installed condition reason should indicate success")
t.Log("Verifying the progressing condition of the cluster extension")
progressingCond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeProgressing)
require.NotNil(t, progressingCond, "Progressing condition not found")
require.Equal(t, ocv1.ReasonFailed, progressingCond.Reason, "Progressing condition reason should indicate failure")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to stash these changes
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
|
||
// Error implements the error interface for ServiceAccountNotFoundError. | ||
func (e *ServiceAccountNotFoundError) Error() string { | ||
return fmt.Sprintf("ServiceAccount \"%s\" not found: Unable to authenticate with the Kubernetes cluster.", e.ServiceAccountName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style suggestions to make this error message follow standard conventions:
return fmt.Sprintf("ServiceAccount \"%s\" not found: Unable to authenticate with the Kubernetes cluster.", e.ServiceAccountName) | |
return fmt.Sprintf("service account %q not found: unable to authenticate with the Kubernetes cluster", e.ServiceAccountName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we include the expected namespace in the error message as well?
type ServiceAccountNotFoundError struct { | ||
ServiceAccountName string // The name of the missing ServiceAccount. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not strictly necessary, but it may prove useful to:
- include the underlying error value in the struct
- implement the Unwrap function for this struct so that the underlying error can be recovered if necessary.
I recognize that we don't want to show up in the error string, so I'm not suggesting that we use fmt.Errorf("blah blah: %w", err)
But we can do this:
type ServiceAccountNotFoundError struct {
ServiceAccountName string // The name of the missing ServiceAccount.
Err error // The underlying error
}
func (e *ServiceAccountNotFoundError) Unwrap() error {
return e.Err
}
var saerr *authentication.ServiceAccountNotFoundError | ||
if errors.As(err, &saerr) { | ||
l.Info("ServiceAccount defined in the ClusterExtension is not found") | ||
err = saerr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning this error in line 217, aren't we getting logging by controller-runtime already? If so, this seems extraneous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing the log
Description
Reviewer Checklist