Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 wrap service account not found error #1477

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

rashmi43
Copy link
Contributor

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@rashmi43 rashmi43 requested a review from a team as a code owner November 18, 2024 12:04
Copy link

netlify bot commented Nov 18, 2024

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 5df5a1a
🔍 Latest deploy log https://app.netlify.com/sites/olmv1/deploys/676195e1d0a30b000817b7bc
😎 Deploy Preview https://deploy-preview-1477--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.63%. Comparing base (e5820ae) to head (6062677).

Files with missing lines Patch % Lines
internal/authentication/tokengetter.go 0.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1477      +/-   ##
==========================================
- Coverage   74.73%   74.63%   -0.10%     
==========================================
  Files          42       42              
  Lines        3241     3244       +3     
==========================================
- Hits         2422     2421       -1     
- Misses        646      649       +3     
- Partials      173      174       +1     
Flag Coverage Δ
e2e 52.28% <0.00%> (-0.15%) ⬇️
unit 57.12% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rashmi43
Copy link
Contributor Author

fixes #1297

@rashmi43 rashmi43 changed the title Service account not found error 🐛 wrap service account not found error Nov 18, 2024
@thetechnick
Copy link

While PR creates a predictable error string, the message is still pretty much the same and does not yet fix the issue:

  - lastTransitionTime: "2024-11-18T14:41:05Z"
    message: 'Kubernetes cluster unreachable: Get "https://10.96.0.1:443/version":
      service account not found'
    observedGeneration: 1
    reason: Failed
    status: Unknown
    type: Installed

@rashmi43 Would you please create your own error type[1] and catch this error before it is written to the Condition message to provide your own custom error message?
[1]: https://go.dev/tour/methods/19

@everettraven
Copy link
Contributor

@rashmi43 @thetechnick I have a feeling that prefix to the error message of Kubernetes cluster unreachable: Get .... : has to do with us injecting a custom http.RoundTripper for fetching service account tokens:

func ServiceAccountRestConfigMapper(tokenGetter *authentication.TokenGetter) func(ctx context.Context, o client.Object, c *rest.Config) (*rest.Config, error) {
return func(ctx context.Context, o client.Object, c *rest.Config) (*rest.Config, error) {
cExt := o.(*ocv1.ClusterExtension)
saKey := types.NamespacedName{
Name: cExt.Spec.ServiceAccount.Name,
Namespace: cExt.Spec.Namespace,
}
saConfig := rest.AnonymousClientConfig(c)
saConfig.Wrap(func(rt http.RoundTripper) http.RoundTripper {
return &authentication.TokenInjectingRoundTripper{
Tripper: rt,
TokenGetter: tokenGetter,
Key: saKey,
}
})
return saConfig, nil
}
}

I imagine somewhere, either in the client-go or Helm code, there is an assumption that if these round trips fail it is likely due to the cluster being "unreachable" for some reason.

I'm not sure that implementing a custom error type will remove that information from the error message. Would improving the message to be more reflective of the context in which it is being returned be more helpful?

Maybe we can have something like:

  - lastTransitionTime: "2024-11-18T14:41:05Z"
    message: 'Unable to apply bundle content to cluster: Kubernetes cluster unreachable: Get "https://10.96.0.1:443/version": Unable to authenticate with Kubernetes cluster using ServiceAccount "foo": ServiceAccount "foo" not found'
    observedGeneration: 1
    reason: Failed
    status: Unknown
    type: Installed

Would that make it a bit more clear that in this case the Kubernetes cluster is unreachable when using a client relying on authenticating as the provided ServiceAccount since that ServiceAccount does not exist?

@everettraven
Copy link
Contributor

Alternatively, with the custom error type we could attempt to parse the error string/type wherever it is returned and attempt to strip out what we think is the unnecessary prefix.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @rashmi43

Could you please rebase the commits?
Otherwise, when it gets merged, and we run git log, it will result in a very lousy commit message.

rashmi43 and others added 3 commits November 18, 2024 23:33
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
@thetechnick
Copy link

@everettraven By creating a custom error into the TokenGetter we can keep the error message short and precise.
I don't see a reason to expose the whole error wrap-chain to the user, when we know exactly that the Service Account is missing in this case and the additional information is not useful or actually missleading.

To pick up your example, I would be very happy if the condition message just reads:
Unable to authenticate with Kubernetes cluster using ServiceAccount "foo": ServiceAccount "foo" not found.

@everettraven
Copy link
Contributor

@everettraven By creating a custom error into the TokenGetter we can keep the error message short and precise. I don't see a reason to expose the whole error wrap-chain to the user, when we know exactly that the Service Account is missing in this case and the additional information is not useful or actually missleading.

To pick up your example, I would be very happy if the condition message just reads: Unable to authenticate with Kubernetes cluster using ServiceAccount "foo": ServiceAccount "foo" not found.

I think this is reasonable. I mostly wanted to call out that I think it will take more of an effort to restrict the error string that we place into the status than simply adding a custom error type that is returned in the TokenGetter. The custom error may help us to identify when we've hit that error type, but there is most certainly additional error wrapping happening beyond the error message that we are returning from the TokenGetter.

rashmi43 and others added 5 commits November 19, 2024 14:22
Update go version checker (operator-framework#1474)

* Handle new files (old version is empty)
* Handle the case where .0 patch is added/removed

Signed-off-by: Todd Short <tshort@redhat.com>

sa not found

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

📖 Document OLMv1 permission model (operator-framework#1380)

* changes to derice minimum service account

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* remove headers

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* add details about registry+v1 support

* render yml correctly

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* create doc for olmv1 permission model

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* Delete docs/drafts directory

* Update permission-model.md

* update permission models with link

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* add space

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* add more structure

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* incorporate review comments

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* incorporate review comments

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

* pers review comments-s

* example as header-s

* update the example

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

---------

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

Delete docs/drafts/derive-serviceaccount.md

add custom sa not found

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
@rashmi43
Copy link
Contributor Author

can we get someone to review @camilamacedo86 perhaps @thetechnick can review my changes. we can hold the merge till v1 is shipped

@@ -86,7 +95,8 @@ func (t *TokenGetter) getToken(ctx context.Context, key types.NamespacedName) (*
Spec: authenticationv1.TokenRequestSpec{ExpirationSeconds: ptr.To(int64(t.expirationDuration / time.Second))},
}, metav1.CreateOptions{})
if err != nil {
return nil, err
saErr := &SANotFoundError{key.Name}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that the only kind of error that is returned here is that the ServiceAccount we are requesting a token for is not found?

Copy link
Contributor Author

@rashmi43 rashmi43 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thetechnick is this a fair assumption? Like discussed above since we are doing a GET for a serviceaccount that would be the case.

Copy link
Contributor

@everettraven everettraven Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left another comment here with my thoughts on this (since GitHub marked this comment as outdated): #1477 (comment)

Comment on lines 210 to 214
var saerr *tokengetter.SANotFoundError
if errors.As(err, &saerr) {
l.Info("is a SAError")
err = saerr
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a lot of other places where this error could be returned, likely anywhere we are doing interactions with the cluster since we are using the provided ServiceAccount for those interactions. Off the top of my head, that would be:

  • Getting the installed bundle (here)
  • Applying contents to the cluster
  • Setting up caches and informers for managed content

Is there a way we could more effectively capture this error type when setting the status conditions we are wanting a more specific message for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check with unwrap the errors and if its the same service account not found error, then thats what it will return.

Copy link
Contributor

@everettraven everettraven Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how this is working in this particular error check, but I think there are other error checks in which it is possible for this error type to be returned as part of the wrapped error chain. Some off the top of my head is:

  • When we attempt to fetch the currently installed bundle (which you already cover with this check here)
  • When we attempt to apply content to the cluster (not covered by your changes)
  • When we attempt to create informers to manage the content owned by a ClusterExtension (not covered by your changes)

The second piece of my comment was an attempt to get at the fact that instead of repeating this across all the error scenarios, and since we really only care about the message that is set in some of the status conditions when receiving this error, we could probably put this error check in the logic that is used to set the conditions we want to "customize" the errors for and reduce repetition of the same code in all our error checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will work on

  1. unit testcases
  2. encompassing more scenarios:
When we attempt to apply content to the cluster (not covered by your changes)
When we attempt to create informers to manage the content owned by a ClusterExtension (not covered by your changes)

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple changes requested.

Additionally, I'd love to see some unit test cases added to ensure this behavior is working as expected.

Comment on lines 98 to 99
saErr := &ServiceAccountNotFoundError{key.Name}
return nil, saErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to:

  • Wrap this in a NotFound error check
  • If the returned error is not a NotFound error, return the original error

This makes it so we are only returning our custom ServiceAccountNotFoundError whenever we know it is a NotFound error returned from our interaction with the Kubernetes API server. Not doing this means we assume that an error in this case always means the ServiceAccount is not found, which I think is a bad assumption to be making and will result in obfuscating any other errors that may be returned during this interaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i do this, it doesnt work for me:

if errors.IsNotFound(err) {
+                       saErr := &ServiceAccountNotFoundError{key.Name}
+                       return nil, saErr
+               } else {
+                       return nil, err
+               }

@thetechnick fyi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebuild and it seems to be working as expected:

status:
  conditions:
  - lastTransitionTime: "2024-11-26T16:08:21Z"
    message: 'Unable to authenticate with Kubernetes cluster using ServiceAccount
      "nginx-install": ServiceAccount "nginx-install" not found.'
    observedGeneration: 1
    reason: Failed
    status: Unknown
    type: Installed
  - lastTransitionTime: "2024-11-26T16:08:21Z"
    message: retrying to get installed bundle
    observedGeneration: 1
    reason: Retrying
    status: "True"
    type: Progressing

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Dec 13, 2024

Hi @rashmi43,

Could you please squash the commits? Since this is a small change, it should ideally have just one commit.

I’m concerned that if we leave it as is (just forget), all these commits will show up in the logs (e.g., run git log to see what I mean). A clean and meaningful git history is important because, when debugging or identifying breaking changes, it’s common to rely on the history to pinpoint the cause. If the history isn’t well-organized, it can make life much harder. 😊

Additionally, if you need to rebase and encounter a conflict, resolving it might become more challenging. You could end up dealing with the same conflict multiple times (potentially 26 times!). Squashing the commits can make the process much smoother. 😉

Comment on lines 210 to 214
var saerr *tokengetter.ServiceAccountNotFoundError
if errors.As(err, &saerr) {
l.Info("is a SAError")
err = saerr
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to log something here?
Also, we are checking the error and replacing it
Why do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are checking the type of the error and then returning our error message

require.NotNil(t, progressingCond)
t.Log("Progressing condition status", progressingCond.Status)
t.Log("Progressing condition reason", progressingCond.Reason)
//require.Equal(t, metav1.ConditionTrue, progressingCond.Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line commented on?
Do we need to do this assert?
We need uncomment and fix or remove, IHMO we cannot leave commented lines

@@ -677,6 +677,97 @@ func TestClusterExtensionInstallationSucceeds(t *testing.T) {
require.NoError(t, cl.DeleteAllOf(ctx, &ocv1.ClusterExtension{}))
}

func TestClusterExtensionInstallationFailsWithNoServiceAccount(t *testing.T) {
Copy link
Contributor

@camilamacedo86 camilamacedo86 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rashmi43,

Without diving into the code itself, I think we should improve how t.Log is being used. The messages should describe actions like "Validating the status to ensure it is X" or "Checking the progressing condition." This makes the logs more meaningful and helps anyone reading them understand what’s happening in the test.

Also, we can make things cleaner by using defer for resource cleanup. I would like to suggest something like:

// TestClusterExtensionInstallationFailsWithNoServiceAccount
// Validates that the reconciliation process fails gracefully when the specified 
// ServiceAccount does not exist. Ensures that appropriate conditions are set 
// on the ClusterExtension status, and the resource is cleaned up properly after the test.
func TestClusterExtensionInstallationFailsWithNoServiceAccount(t *testing.T) {
	cl, reconciler := newClientAndReconciler(t)

	reconciler.Unpacker = &MockUnpacker{
		result: &source.Result{
			State:  source.StateUnpacked,
			Bundle: fstest.MapFS{},
		},
	}

	ctx := context.Background()
	extKey := types.NamespacedName{Name: fmt.Sprintf("cluster-extension-test-%s", rand.String(8))}
	defer func() {
		t.Log("Cleaning up all cluster extension resources")
		cl.DeleteAllOf(ctx, &ocv1.ClusterExtension{})
	}()

	t.Log("Initializing cluster extension with a channel, version, and non-existent service account")
	pkgName := "prometheus"
	pkgVer := "1.0.0"
	pkgChan := "beta"
	namespace := fmt.Sprintf("test-ns-%s", rand.String(8))
	serviceAccount := fmt.Sprintf("test-does-not-exist-%s", rand.String(8))

	clusterExtension := &ocv1.ClusterExtension{
		ObjectMeta: metav1.ObjectMeta{Name: extKey.Name},
		Spec: ocv1.ClusterExtensionSpec{
			Source: ocv1.SourceConfig{
				SourceType: "Catalog",
				Catalog: &ocv1.CatalogSource{
					PackageName: pkgName,
					Version:     pkgVer,
					Channels:    []string{pkgChan},
				},
			},
			Namespace: namespace,
			ServiceAccount: ocv1.ServiceAccountReference{
				Name: serviceAccount,
			},
		},
	}
	require.NoError(t, cl.Create(ctx, clusterExtension), "Failed to create the cluster extension resource")

	t.Log("Setting up mock resolver and applier for reconciliation")
	reconciler.Resolver = resolve.Func(func(_ context.Context, _ *ocv1.ClusterExtension, _ *ocv1.BundleMetadata) (*declcfg.Bundle, *bsemver.Version, *declcfg.Deprecation, error) {
		v := bsemver.MustParse("1.0.0")
		return &declcfg.Bundle{
			Name:    "prometheus.v1.0.0",
			Package: "prometheus",
			Image:   "quay.io/operatorhubio/prometheus@fake1.0.0",
		}, &v, nil, nil
	})
	reconciler.Applier = &MockApplier{
		objs: []client.Object{},
	}
	reconciler.Manager = &MockManagedContentCacheManager{
		cache: &MockManagedContentCache{},
	}

	t.Log("Running reconcile to process the cluster extension")
	res, err := reconciler.Reconcile(ctx, ctrl.Request{NamespacedName: extKey})
	require.Equal(t, ctrl.Result{}, res, "Unexpected reconcile result")
	require.NoError(t, err, "Reconcile failed with an error")

	t.Log("Fetching updated cluster extension after reconciliation")
	require.NoError(t, cl.Get(ctx, extKey, clusterExtension), "Failed to retrieve the updated cluster extension resource")

	t.Log("Verifying the status fields of the cluster extension")
	require.Equal(t, ocv1.BundleMetadata{Name: "prometheus.v1.0.0", Version: "1.0.0"}, clusterExtension.Status.Install.Bundle, "Bundle metadata does not match expected values")

	t.Log("Verifying the installed condition of the cluster extension")
	installedCond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeInstalled)
	require.NotNil(t, installedCond, "Installed condition not found")
	require.Equal(t, metav1.ConditionTrue, installedCond.Status, "Installed condition status should be True")
	require.Equal(t, ocv1.ReasonSucceeded, installedCond.Reason, "Installed condition reason should indicate success")

	t.Log("Verifying the progressing condition of the cluster extension")
	progressingCond := apimeta.FindStatusCondition(clusterExtension.Status.Conditions, ocv1.TypeProgressing)
	require.NotNil(t, progressingCond, "Progressing condition not found")
	require.Equal(t, ocv1.ReasonFailed, progressingCond.Reason, "Progressing condition reason should indicate failure")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to stash these changes

Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>
Signed-off-by: rashmi_kh <rashmi.khanna@in.ibm.com>

// Error implements the error interface for ServiceAccountNotFoundError.
func (e *ServiceAccountNotFoundError) Error() string {
return fmt.Sprintf("ServiceAccount \"%s\" not found: Unable to authenticate with the Kubernetes cluster.", e.ServiceAccountName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some style suggestions to make this error message follow standard conventions:

Suggested change
return fmt.Sprintf("ServiceAccount \"%s\" not found: Unable to authenticate with the Kubernetes cluster.", e.ServiceAccountName)
return fmt.Sprintf("service account %q not found: unable to authenticate with the Kubernetes cluster", e.ServiceAccountName)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we include the expected namespace in the error message as well?

Comment on lines +24 to +26
type ServiceAccountNotFoundError struct {
ServiceAccountName string // The name of the missing ServiceAccount.
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not strictly necessary, but it may prove useful to:

  1. include the underlying error value in the struct
  2. implement the Unwrap function for this struct so that the underlying error can be recovered if necessary.

I recognize that we don't want to show up in the error string, so I'm not suggesting that we use fmt.Errorf("blah blah: %w", err)

But we can do this:

type ServiceAccountNotFoundError struct {
	ServiceAccountName string // The name of the missing ServiceAccount.
	Err error // The underlying error
 }

func (e *ServiceAccountNotFoundError) Unwrap() error {
	return e.Err
}

Comment on lines +210 to +214
var saerr *authentication.ServiceAccountNotFoundError
if errors.As(err, &saerr) {
l.Info("ServiceAccount defined in the ClusterExtension is not found")
err = saerr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning this error in line 217, aren't we getting logging by controller-runtime already? If so, this seems extraneous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants