Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lossy map functions #1996

Closed
shaneutt opened this issue Sep 8, 2022 · 32 comments
Closed

Lossy map functions #1996

shaneutt opened this issue Sep 8, 2022 · 32 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@shaneutt
Copy link
Member

shaneutt commented Sep 8, 2022

Preface

This relates to a conversation I brought up at the community meeting on 2022-09-08, @camilamacedo86 and @jmrodri suggested I create an issue so we can discuss it further.

Background

Currently in controller-runtime we have the following tools:

type MapFunc func(client.Object) []reconcile.Request

func EnqueueRequestsFromMapFunc(fn MapFunc) EventHandler {
	return &enqueueRequestsFromMapFunc{
		toRequests: fn,
	}
}

With these it's possible to express a Watch() on one source.Kind which enqueues objects of another kind by some arbitrary relation:

ctrl.NewControllerManagedBy(mgr).Watches(
	&source.Kind{Type: &customapi.Tree},
	handler.EnqueueRequestsFromMapFunc(r.mapFruitsFromTree),
)

Now ideally you use metav1.ObjectReference when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects, e.g.:

func (r *myReconciler) mapFruitsFromTree(tree client.Object) []reconcile.Request {
	var fruits customapi.FruitList
	if err := r.client.List(context.Background(), &fruits); err != nil {
		r.log.WithError(err).Error("error listing fruits")
		return nil
	}

	var req []reconcile.Request
	for _, fruit := range fruits.Items {
		if isRelated(tree, fruit) {
			req = append(req, reconcile.Request{
				NamespacedName: types.NamespacedName{
					Namespace: fruit.Namespace,
					Name:      fruit.Name,
				},
			})
		}
	}

	return req
}

In these cases it would appear there is effectively no provisions to handle error conditions.

This kind of implementation and use case is not uncommon. I was able to find several examples of notable open source projects which are using it this way:

Problem

The EnqueueRequestsFromMapFunc generator for EventHandlers has no concept of handling errors that may occur when making the mapping. As such in the example above and all the links provided for projects which are doing this that process is currently "lossy": that is to say, it appears the machinery could effectively end up dropping an enqueued object from the queue if a failure occurred during the mapping process.

For a specific example using the Tree and Fruit sample above: if the object listing produces an error (e.g. performed at a time when the API server is having trouble and cache can't be used) this failure means that the Tree object is consumed from the queue but the related objects don't get enqueued. This would cause the resource state to become "stuck" and this wont heal without adding some other mechanism to re-queue it, or some unrelated action to re-trigger it.

Exploring Solutions

The purpose of this issue is to ask for community guidance on an existing or new provision to support this kind of mapping in a manner that is fault tolerant and can heal from an error.

Some of the workarounds that I have conceived of are:

  • using object references as mentioned above (wont work in all cases)
  • some fallback process which would actively re-enqueue all relevant objects (messy, costly)
  • watching a custom backup source.Channel where objects are re-enqueued in an implementation-specific manner (clunky)

So far I think the third option there might be the most effective strategy with what we have today, but ideally what I would like to see is the ability to cleanly trigger a re-queue from within the mapping function as part of the function return. I would very much appreciate community feedback on this.

@alvaroaleman
Copy link
Member

alvaroaleman commented Sep 8, 2022

Yeah, this is a common-ish problem and something that controller-runtime doesn't help with. On a conceptional level, you likely want the handler itself to be a controller (i.E. have a workqueue and a reconciler that can fail, resulting in retries with exponential backoff).

I've built something like this that uses a source.Chanel to pass reconcile.Requests around some time ago: https://github.com/openshift/ci-tools/blob/946582d9de6c25b241fcaa7ffe9f1fe9ff2921c8/pkg/controller/promotionreconciler/reconciler.go#L173

Maybe the easiest would be to built a EnqueueRequestsFromFallibleMapFunc(func(client.Object) ([]reconcile.Request, error)) that internally constructs a controller whose Reconciler just calls the mapfunc until it succeeds?

@shaneutt
Copy link
Member Author

shaneutt commented Sep 9, 2022

Thanks for your reply @alvaroaleman.

Maybe the easiest would be to built a EnqueueRequestsFromFallibleMapFunc(func(client.Object) ([]reconcile.Request, error)) that internally constructs a controller whose Reconciler just calls the mapfunc until it succeeds?

My understanding of what you're suggesting is that EnqueueRequestsFromFallibleMapFunc would be an alternative map creator in handler and under the hood it would simply call AddRateLimited(obj) on the rate limited work queue any time an error occurs to just keep enqueuing that object until the errors go away. This is pretty much the kind of thing I had in mind, so I'm in favor, but is that accurate, or was there more to it than that? If that ends up being accurate and there's consensus amongst maintainers that this is something that would be accepted, I myself or one of my colleagues would like to start work on this.

@pmalek
Copy link

pmalek commented Sep 27, 2022

Hi 👋

As I think of this problem more and more I believe that one more option would be to either introduce an additional API that would account for the errors and have the possibility of having additional logic for retries, something like

ctrl.NewControllerManagedBy(mgr).WatchesWithRetry(
	&source.Kind{Type: &customapi.Tree},
	handler.EnqueueRequestsFromMapFuncWithError(r.mapFruitsFromTree),
	ctrl.Retry{
		Max: 3,
		RetryTime: 3 * time.Second,
		// Or even more complex logic
	},
)

(not ideal: additional method to implement, document test and propagate to users) OR make an adapter(s) which would encapsulate the retry logic in itself, wrapping the calls to EnqueueRequestsFromMapFunc (basically EnqueueRequestsFromFallibleMapFunc with a logic to stop if particular conditions are met).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2023
@shaneutt
Copy link
Member Author

/remove-lifecycle rotten

@alvaroaleman are you still on board with the solution you and I discussed some months prior? Is anyone opposed to us going ahead and fixing this as we described above?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 25, 2023
@vincepri
Copy link
Member

vincepri commented Feb 2, 2023

@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc, although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors?

@shaneutt
Copy link
Member Author

shaneutt commented Feb 3, 2023

@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc

I can see the ctx being helpful in some situations for sure 👍

although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors?

Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun.

After thinking more about how this could go wrong and considering that going forward a ctx will be available: I'm actually starting to think that what we need is a mechanism to enable re-queuing which would make the downstream developer responsible for handling the errors instead of controller-runtime, but still give them a way to retry or re-queue the object when problems arise during mapping, but that be something that's more explicit and whether or not the reason is an error is their prerogative. Perhaps something where:

func (r *myReconciler) mapFruitsFromTree(tree client.Object) (recs []reconcile.Request) {
	var fruits customapi.FruitList
	if err := r.client.List(context.Background(), &fruits); err != nil {
		r.log.WithError(err).Error("error listing fruits")
		return
	}

	for _, fruit := range fruits.Items {
		if isRelated(tree, fruit) {
			recs = append(recs, reconcile.Request{
				NamespacedName: types.NamespacedName{
					Namespace: fruit.Namespace,
					Name:      fruit.Name,
				},
			})
		}
	}

	return
}

becomes:

func (r *myReconciler) mapFruitsFromTree(tree client.Object) (requeue bool, recs []reconcile.Request) {
	var fruits customapi.FruitList
	if err := r.client.List(context.Background(), &fruits); err != nil {
		r.log.WithError(err).Error("error listing fruits")
		requeue = true
		return
	}

	for _, fruit := range fruits.Items {
		if isRelated(tree, fruit) {
			recs = append(recs, reconcile.Request{
				NamespacedName: types.NamespacedName{
					Namespace: fruit.Namespace,
					Name:      fruit.Name,
				},
			})
		}
	}

	return
}

Curious as to your thoughts on that 🤔

@nathanperkins
Copy link

I like a combination of adding the ctx and the requeue bool.

We are also running into this issue where we need to list objects in our mapping function. If we wait on context.Background() forever, that's going to block the controller (somewhere? not sure). If we add a timeout and log the error, our system is going to end up in an inconsistent state.

@nathanperkins
Copy link

nathanperkins commented Apr 25, 2023

@shaneutt

Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun. If the client is returning errors consistently, then it's not like the controller would be able to make progress otherwise.

Can you explain the footgun more? If these mapper functions need to work correctly to keep the system consistent, then infinite retry seems necessary.

It's worth mentioning that the current implementation is a footgun right now because the common usage results in dropped events and there's not really a way to use it correctly.

@shaneutt
Copy link
Member Author

Been a while since I wrote that but I think I was just a little cautious of weird ways people might use this. Ultimately however I agree with you that we should move forward with this.

@nathanperkins
Copy link

@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued? That would make the behavior similar to Reconcile. It's a bit more ergonomic than having to create a logger, log your error, and return a bool.

@shaneutt
Copy link
Member Author

That seems entirely reasonable 👍

@pmalek
Copy link

pmalek commented Apr 26, 2023

@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued? That would make the behavior similar to Reconcile. It's a bit more ergonomic than having to create a logger, log your error, and return a bool.

I like this idea as well. One note though: I'd consider a separate type just for the sake of separation of concerns and not coupling this with reconciliation. WDYT?

@nathanperkins
Copy link

I put []reconcile.Request because that's the current type. I don't have a strong opinion on changing it other than we should probably keep it the same unless there is a good reason to change.

@alvaroaleman
Copy link
Member

@shaneutt what about ([]reconcile.Request, error) for the return values, where err != nil will be logged and requeued?

That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :) The only thing we could do is have a handler implementation that internally is a controller and thus has its own workqueue as suggested in #1996 (comment)

@nathanperkins
Copy link

That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :)

Sorry for imprecise language. I mean requeuing to call the map function itself again. That sounds like what #1996 (comment) is describing.

@nathanperkins
Copy link

nathanperkins commented Jul 24, 2023

One thing I noticed today. The newer function signature in v1.15.0 accepts a context.Context:

type MapFunc func(context.Context, client.Object) []reconcile.Request

Any function which accepts a context is inherently fallible or lossy. It doesn't make sense to have a context without returning an error, so an error should probably be added to the existing MapFunc definition, rather than creating a separate fallible MapFunc type.

If we want to have an infallible MapFunc which cannot error, it should not have context.Context.

@seh
Copy link

seh commented Jul 24, 2023

The caller can still check (Context).Err, though it's possible that the return value could change in between the called function returning and when the caller checks whether (Context).Err now returns non-nil.

I take it you're assuming that the called function needs a way to indicate that it had to give up due to the Context before it could complete all of its intended work.

@nathanperkins
Copy link

You could do that but it goes against general go practices. A function that can cause an error should handle it locally or return it up the stack so it can be handled or logged higher up. Expecting the higher function to check (Context).Err rather than an error is pretty unusual.

It's possible that the return value could change in between the called function returning and when the caller checks whether (Context).Err now returns non-nil.

I suppose even with this race condition, it would be valid for the caller to ignore whatever results came from MapFunc because the caller will fail due to the ctx error. You could get away with this from a technical perspective.

It's still odd. Pretty much any function which the MapFunc could pass that context into will return its own error, which should be passed up the stack. The common usage of that context is for client functions, which can fail.

@seh
Copy link

seh commented Jul 25, 2023

I'm familiar with Context's motivations and behavior, and I don't see any relationship between it and the fallibility of functions that use it, other than that it introduces another reason why such functions can "fail"—namely cooperative cancellation.

Functions accept a Context either as a means of them agreeing to be interrupted—because they do something that would block long enough to warrant respecting a request for interruption—or, less pertinently, because they need to read or bind values with dynamic extent.

@vincepri
Copy link
Member

+1 from my side to add a return error, this would be a breaking change; although before doing so, we need to nail down what the behavior is going to look like, and how configurable it should be

@nathanperkins
Copy link

Functions accept a Context either as a means of them agreeing to be interrupted—because they do something that would block long enough to warrant respecting a request for interruption—or, less pertinently, because they need to read or bind values with dynamic extent.

Even when you need to read or bind values with dynamic extent, that is something that can fail and probably needs error handling (like with logger.FromContext(ctx)).

It may not be true in every case: you could handle the missing values by providing a default or it may be reasonable to panic if your code is structured in a way that it should never happen. But, this map func API can't make such assumptions about how it is going to be used; it needs to support standard error handling.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024
@pmalek
Copy link

pmalek commented Mar 27, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 27, 2024
@camilamacedo86
Copy link
Member

By reading this one, I understand that the change that is missing / convey here is just allow the api return an error: #1996 (comment)

Then, by looking at the problem:

Now ideally you use metav1.ObjectReference when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects,

What seems that you are trying to do is: (unless I misunderstand it)
I want watch and trigger the reconcile for an CRD B in the controller for my CRD A
That means, my CRD A in certain conditions requires / depend on of my CRD B

I have an old sample project that does that. You can see here that in a controller I am watching my other Kind at the same way that I watch the Services and etc created and owned by it.

Then, it seems for me that it should be done as you do with any core API using the refs so this problem would not be faced. Either you can trigger any other reconciliation directly as suggested in : #1996 (comment) .

@pmalek
Copy link

pmalek commented Jun 6, 2024

The referenced controller does rely on the owner relationship https://github.com/dev4devs-com/postgresql-operator/blob/c71c742743270ded079188ada7ced2132b7c820f/pkg/service/watches.go#L13 which is not always the case.

For instance: we implement Gateway API's Gateway controller and in that controller (as one example) we want to check for changes in GatewayClass so that the Gateways using that class can be enqueued. Something like this:

func (r *Reconciler) Setup() {

...
		Watches(
			&gatewayv1.GatewayClass{},
			handler.EnqueueRequestsFromMapFunc(r.listGatewaysForGatewayClass),
			builder.WithPredicates(predicate.NewPredicateFuncs(GatewayClassMatchesController))).

...

}

....

func GatewayClassMatchesController(obj client.Object) bool {
	gatewayClass, ok := obj.(*gatewayv1.GatewayClass)
	if !ok {
		return false
	}

	return string(gatewayClass.Spec.ControllerName) == vars.ControllerName()
}

...

func (r *Reconciler) listGatewaysForGatewayClass(ctx context.Context, obj client.Object) (recs []reconcile.Request) {
	gatewayClass, ok := obj.(*gatewayv1.GatewayClass)
	if !ok {
		return
	}

	gateways := new(gatewayv1.GatewayList)
	if err := r.Client.List(ctx, gateways); err != nil {
		return
	}

	for _, gateway := range gateways.Items {
		if gateway.Spec.GatewayClassName == gatewayv1.ObjectName(gatewayClass.Name) {
			recs = append(recs, reconcile.Request{
				NamespacedName: types.NamespacedName{
					Namespace: gateway.Namespace,
					Name:      gateway.Name,
				},
			})
		}
	}

	return
}

There is not owner relationship between the 2 so we cannot use handler.TypedEnqueueRequestForOwner or similar. Both the predicate type and handler enqueue map func do not allow to handle the requeues, as a matter of fact, https://github.com/kubernetes-sigs/controller-runtime/blob/f9d51ba63e205155a0d25efff911673ea201426c/pkg/handler/enqueue_owner.go also does not allow requeuing.

Hope that clears out what is (yet another) angle at this, from user PoV.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants