-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lossy map functions #1996
Comments
Yeah, this is a common-ish problem and something that controller-runtime doesn't help with. On a conceptional level, you likely want the handler itself to be a controller (i.E. have a workqueue and a reconciler that can fail, resulting in retries with exponential backoff). I've built something like this that uses a Maybe the easiest would be to built a |
Thanks for your reply @alvaroaleman.
My understanding of what you're suggesting is that |
Hi 👋 As I think of this problem more and more I believe that one more option would be to either introduce an additional API that would account for the errors and have the possibility of having additional logic for retries, something like
(not ideal: additional method to implement, document test and propagate to users) OR make an adapter(s) which would encapsulate the retry logic in itself, wrapping the calls to |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten @alvaroaleman are you still on board with the solution you and I discussed some months prior? Is anyone opposed to us going ahead and fixing this as we described above? |
@shaneutt Open to learn a bit more what you have in mind, we've recently added a ctx (only on the main branch) as first parameter in the MapFunc, although error handling and retries can be a good solution. If we go with a handler.RetriableMapFunc or similar, how would the logic handle errors? |
I can see the ctx being helpful in some situations for sure 👍
Errors that occur this way are expected to be transient, but the reality is that if we give people an infinite error retry mechanism there will be some cases where it's going to be a footgun. After thinking more about how this could go wrong and considering that going forward a ctx will be available: I'm actually starting to think that what we need is a mechanism to enable re-queuing which would make the downstream developer responsible for handling the errors instead of controller-runtime, but still give them a way to retry or re-queue the object when problems arise during mapping, but that be something that's more explicit and whether or not the reason is an func (r *myReconciler) mapFruitsFromTree(tree client.Object) (recs []reconcile.Request) {
var fruits customapi.FruitList
if err := r.client.List(context.Background(), &fruits); err != nil {
r.log.WithError(err).Error("error listing fruits")
return
}
for _, fruit := range fruits.Items {
if isRelated(tree, fruit) {
recs = append(recs, reconcile.Request{
NamespacedName: types.NamespacedName{
Namespace: fruit.Namespace,
Name: fruit.Name,
},
})
}
}
return
} becomes: func (r *myReconciler) mapFruitsFromTree(tree client.Object) (requeue bool, recs []reconcile.Request) {
var fruits customapi.FruitList
if err := r.client.List(context.Background(), &fruits); err != nil {
r.log.WithError(err).Error("error listing fruits")
requeue = true
return
}
for _, fruit := range fruits.Items {
if isRelated(tree, fruit) {
recs = append(recs, reconcile.Request{
NamespacedName: types.NamespacedName{
Namespace: fruit.Namespace,
Name: fruit.Name,
},
})
}
}
return
} Curious as to your thoughts on that 🤔 |
I like a combination of adding the ctx and the requeue bool. We are also running into this issue where we need to list objects in our mapping function. If we wait on context.Background() forever, that's going to block the controller (somewhere? not sure). If we add a timeout and log the error, our system is going to end up in an inconsistent state. |
Can you explain the footgun more? If these mapper functions need to work correctly to keep the system consistent, then infinite retry seems necessary. It's worth mentioning that the current implementation is a footgun right now because the common usage results in dropped events and there's not really a way to use it correctly. |
Been a while since I wrote that but I think I was just a little cautious of weird ways people might use this. Ultimately however I agree with you that we should move forward with this. |
@shaneutt what about |
That seems entirely reasonable 👍 |
I like this idea as well. One note though: I'd consider a separate type just for the sake of separation of concerns and not coupling this with reconciliation. WDYT? |
I put |
That isn't possible. We don't know what to requeue when the map function failed, as its purpose is to tell us that :) The only thing we could do is have a handler implementation that internally is a controller and thus has its own workqueue as suggested in #1996 (comment) |
Sorry for imprecise language. I mean requeuing to call the map function itself again. That sounds like what #1996 (comment) is describing. |
One thing I noticed today. The newer function signature in v1.15.0 accepts a context.Context:
Any function which accepts a context is inherently fallible or lossy. It doesn't make sense to have a context without returning an error, so an error should probably be added to the existing MapFunc definition, rather than creating a separate fallible MapFunc type. If we want to have an infallible MapFunc which cannot error, it should not have context.Context. |
The caller can still check I take it you're assuming that the called function needs a way to indicate that it had to give up due to the |
You could do that but it goes against general go practices. A function that can cause an error should handle it locally or return it up the stack so it can be handled or logged higher up. Expecting the higher function to check
I suppose even with this race condition, it would be valid for the caller to ignore whatever results came from MapFunc because the caller will fail due to the ctx error. You could get away with this from a technical perspective. It's still odd. Pretty much any function which the MapFunc could pass that context into will return its own error, which should be passed up the stack. The common usage of that context is for client functions, which can fail. |
I'm familiar with Functions accept a |
+1 from my side to add a return error, this would be a breaking change; although before doing so, we need to nail down what the behavior is going to look like, and how configurable it should be |
Even when you need to read or bind values with dynamic extent, that is something that can fail and probably needs error handling (like with It may not be true in every case: you could handle the missing values by providing a default or it may be reasonable to panic if your code is structured in a way that it should never happen. But, this map func API can't make such assumptions about how it is going to be used; it needs to support standard error handling. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
By reading this one, I understand that the change that is missing / convey here is just allow the api return an error: #1996 (comment) Then, by looking at the problem:
What seems that you are trying to do is: (unless I misunderstand it) I have an old sample project that does that. You can see here that in a controller I am watching my other Kind at the same way that I watch the Services and etc created and owned by it. Then, it seems for me that it should be done as you do with any core API using the refs so this problem would not be faced. Either you can trigger any other reconciliation directly as suggested in : #1996 (comment) . |
The referenced controller does rely on the owner relationship https://github.com/dev4devs-com/postgresql-operator/blob/c71c742743270ded079188ada7ced2132b7c820f/pkg/service/watches.go#L13 which is not always the case. For instance: we implement Gateway API's
There is not owner relationship between the 2 so we cannot use Hope that clears out what is (yet another) angle at this, from user PoV. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Preface
This relates to a conversation I brought up at the community meeting on 2022-09-08, @camilamacedo86 and @jmrodri suggested I create an issue so we can discuss it further.
Background
Currently in
controller-runtime
we have the following tools:With these it's possible to express a
Watch()
on onesource.Kind
which enqueues objects of another kind by some arbitrary relation:Now ideally you use
metav1.ObjectReference
when dealing with relationships between objects, however it can sometimes be the case that you need to use the API client or some other utility that may produce an error to map the related objects, e.g.:In these cases it would appear there is effectively no provisions to handle error conditions.
This kind of implementation and use case is not uncommon. I was able to find several examples of notable open source projects which are using it this way:
Problem
The
EnqueueRequestsFromMapFunc
generator forEventHandlers
has no concept of handling errors that may occur when making the mapping. As such in the example above and all the links provided for projects which are doing this that process is currently "lossy": that is to say, it appears the machinery could effectively end up dropping an enqueued object from the queue if a failure occurred during the mapping process.For a specific example using the
Tree
andFruit
sample above: if the object listing produces an error (e.g. performed at a time when the API server is having trouble and cache can't be used) this failure means that theTree
object is consumed from the queue but the related objects don't get enqueued. This would cause the resource state to become "stuck" and this wont heal without adding some other mechanism to re-queue it, or some unrelated action to re-trigger it.Exploring Solutions
The purpose of this issue is to ask for community guidance on an existing or new provision to support this kind of mapping in a manner that is fault tolerant and can heal from an error.
Some of the workarounds that I have conceived of are:
source.Channel
where objects are re-enqueued in an implementation-specific manner (clunky)So far I think the third option there might be the most effective strategy with what we have today, but ideally what I would like to see is the ability to cleanly trigger a re-queue from within the mapping function as part of the function return. I would very much appreciate community feedback on this.
The text was updated successfully, but these errors were encountered: