Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make generic controller for cluster-based resources #259

Merged
merged 2 commits into from
Oct 4, 2024

Conversation

andrewstucki
Copy link
Contributor

This tries to extract the core of the Users controller into a generic controller implementation that wraps a ResourceReconciler interface that controls the guts of the three operations that we pretty much need to do for every cluster-based resource:

  1. Patching finalizers (this could probably be removed if we want to do it in an even more generic way)
  2. Syncing (upserting) a resource
  3. Deleting a resource

Hence the interface looks like this:

	FinalizerPatch(request ResourceRequest[T]) client.Patch
	SyncResource(ctx context.Context, request ResourceRequest[T]) (client.Patch, error)
	DeleteResource(ctx context.Context, request ResourceRequest[T]) error

Most of the sync/delete logic can either then be inlined into the controllers, or, ideally wrapped in a "high-level" client/synchronizer (see the ACL synchronization code) in the clients package.

I'm intending on using this for CRD-defined schemas, and if y'all like the structure, would want to take a pass at refactoring TopicReconciler to use this as well for consistency.

This allows us to follow a really simple pattern for any additional CRDs. Basically anything that we're controlling via CRD and putting in a cluster should:

  1. Add a ClusterSource field.
  2. Generate some applyconfiguration structures
  3. If it's complex, implement a high-level synchronizer that handles the heavy lifting of the Redpanda operations
  4. Add a reconciler that implements the above interface

Doing so would give you:

  1. Consistent status conditions
  2. Consistent ways of connecting to a referenced cluster
  3. Server-side apply pretty much everywhere
  4. Similarly structured code patterns

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish review.

ResourceConditionReasonSynced = "Synced"
ResourceConditionReasonClusterRefInvalid = "ClusterRefInvalid"
ResourceConditionReasonConfigurationInvalid = "ConfigurationInvalid"
ResourceConditionReasonTerminalClientError = "TerminalClientError"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would terminal client error prevent from further reconciliation? What action should be taken by the user of the controller/custom resource? Could the error constant have brief description or explanation when they should be used?

Maybe I'm overexaggerate over the word Terminal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I use the notion of a "terminal" error in two places currently, which is how the UserReconciler works:

  1. func ignoreAllConnectionErrors(logger logr.Logger, err error) error {
    // If we have known errors where we're unable to actually establish
    // a connection to the cluster due to say, invalid connection parameters
    // we're going to just skip the cleanup phase since we likely won't be
    // able to clean ourselves up anyway.
    if internalclient.IsTerminalClientError(err) ||
    internalclient.IsConfigurationError(err) ||
    internalclient.IsInvalidClusterError(err) {
    // We use Info rather than Error here because we don't want
    // to ignore the verbosity settings. This is really only for
    // debugging purposes.
    logger.V(2).Info("Ignoring non-retryable client error", "error", err)
    return nil
    }
    return err
    }
  2. func handleResourceSyncErrors(err error) (metav1.Condition, error) {
    // If we have a known terminal error, just set the sync condition and don't re-run reconciliation.
    if internalclient.IsInvalidClusterError(err) {
    return redpandav1alpha2.ResourceNotSyncedCondition(redpandav1alpha2.ResourceConditionReasonClusterRefInvalid, err), nil
    }
    if internalclient.IsConfigurationError(err) {
    return redpandav1alpha2.ResourceNotSyncedCondition(redpandav1alpha2.ResourceConditionReasonConfigurationInvalid, err), nil
    }
    if internalclient.IsTerminalClientError(err) {
    return redpandav1alpha2.ResourceNotSyncedCondition(redpandav1alpha2.ResourceConditionReasonTerminalClientError, err), nil
    }
    // otherwise, set a generic unexpected error and return an error so we can re-reconcile.
    return redpandav1alpha2.ResourceNotSyncedCondition(redpandav1alpha2.ResourceConditionReasonUnexpectedError, err), err
    }

The first is basically if you want to ignore errors that shouldn't ever be retried (i.e. you have a non-existent cluster that's been blown away and you're trying to delete yourself from it, cluster configuration now has SASL auth disabled and you're trying to delete a user, etc.). This is used during the cleanup routine to do "best-effort" style cleanup, if something is "terminal" we basically just say, "oh well" and let the resource be GC'd if we hit one of these.

The second is in the case of you trying to sync a resource. In that case we just set the status as having encountered a terminal error with the error message in the status condition message and then stop reconciling because this isn't an "ephemeral"-type error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also 👍 on documenting this stuff more.

}

func registerUserClusterIndex(ctx context.Context, mgr ctrl.Manager) error {
return mgr.GetFieldIndexer().IndexField(ctx, &redpandav1alpha2.User{}, userClusterIndex, indexUserCluster)
func registerClusterSourceIndex[T client.Object, U clientList[T]](ctx context.Context, mgr ctrl.Manager, name string, o T, l U) (handler.EventHandler, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: For me the o and l is hard to read. Maybe the following suggestion could make it more readable 👇

Suggested change
func registerClusterSourceIndex[T client.Object, U clientList[T]](ctx context.Context, mgr ctrl.Manager, name string, o T, l U) (handler.EventHandler, error) {
func registerClusterSourceIndex[T client.Object, U clientList[T]](ctx context.Context, mgr ctrl.Manager, name string, obj T, list U) (handler.EventHandler, error) {

return nil
}
return requests
})
}

func clusterForHelmManagedObject(o client.Object) (types.NamespacedName, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR

	role := labels["app.kubernetes.io/name"]
	if !slices.Contains([]string{"redpanda", "console"}, role) {
		return types.NamespacedName{}, false
	}

Should include connectors too.

// Every 5 minutes try and check to make sure no manual modifications
// happened on the resource synced to the cluster and attempt to correct
// any drift.
Complete(controller.PeriodicallyReconcile(5 * time.Minute))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe make the period configurable.

Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

_ = testEnv.Stop()
})

container, err := redpanda.Run(ctx, "docker.redpanda.com/redpandadata/redpanda:v23.2.8",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: As a follow up we should bump this version in all places in our test suite

@andrewstucki
Copy link
Contributor Author

Going to go ahead and merge and then I'll address some of the documentation in a follow-up.

@andrewstucki andrewstucki merged commit e897bcf into main Oct 4, 2024
5 checks passed
@andrewstucki andrewstucki deleted the generalized-controller branch October 4, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants