-
-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected trigger and missing object during on.update() and on.field() handlers with callback filters #784
Comments
Hello. Thanks for reporting this. It might be an issue with the documentation and phrasing. Can you please point to the exact locations that made you think that on-field is executed only on updates? The on-field handler is executed any time the specified field is involved. The creation of an object with that field having a value is treated as a change from As a result, the on-field handler is called for creations too. To only react to the field changes when the object is changed but not created, use Roughly, On a side note: after the introduction of the field/value filters for all handlers, on-field handlers became useless. And worse: they always were confusing — but kept for backward compatibility (there is no direct replacement for them). Maybe it is a good idea to avoid them at all for clarity — at least, in the new operators. As for the random execution on unrelated changes — that might be a bug. Can you please create a new issue with a reproducible snippet (if possible) or logs? Before that, one thing to look at — my personal favourite mistake — is whether there is another instance of the operator running on any machine or pod that also changes the same annotations. Two operators with the same identity usually create weird ping-pong effects — unless peering is used to freeze them properly. |
Thanks for the quick reply, @nolar
Bottom line: What I really want is something to fire only on an actual to change to an annotation change (roughly Thanks again! |
That is the intention behind the field change filters: @kopf.on.update(..., field='...', new='x') # absent "old" is implied to be "anything but 'x'"
def fn(**_): ... If that does not work, it should be fixed to work. But a repro is needed on how exactly it does not work.
on-update also support old/new values of the field.
Almost. You cannot use the string notation then. You can specify any field as a preparsed tuple/list:
The Such changes should be caught and treated as regular changes. If you want to catch ONLY the changes but not additions of the annotation, I would try this (not sure if it works, but it should): @kopf.on.update(..., field='...', old=kopf.PRESENT, new='x') # implied: from anything but absent and "x" — to "x" specifically
def fn(**_): ... Or that way for anything-to-anything change except the additions/deletions of the field: @kopf.on.update(..., field='...', old=kopf.PRESENT, new=kopf.PRESENT) # implied: old!=new && neither is None
def fn(**_): ... |
PS: Also mind that when the annotation is removed, it will be a change from |
Once again, appreciate all of your input on this. I haven't seen issues when removing the field ( Thanks for the tip on using the tuple/list for setting the field value. I guess part of my confusion comes from the fact that the content of Thanks again for the tips, will play with it more this week. |
I’ve now got my own doubt on the old/new values. I remember the exact place where their values are reduced down to the field’s values for the handler itself (if it has a field defined). But I’m not sure if the same is done for the filtering callbacks — I need to check that. Regarding the test/repro — maybe the exact code snippet is not necessary. But the logs (with --verbose) can be helpful to understand what happens and why. I’ll also try to play with this example and annotations as fields — maybe I can catch the case. Sometimes, new bugs come with new versions of K8s via wrong initial assumptions in Kopf. Thank you for providing the details. |
Sorry about that. :) I will definitely be doing more testing on my end as well. |
OK, well I am trying various cases with the |
Ok... so I am encountering a missed trigger again. Here is the definition of the handler that is not firing.
The change of the annotation to Any advice on this? This is why I originally started looking at the |
@cjbaar Can you please tell me the scale? I.e. the number of ConfigMaps & Services involved. I can try to reproduce this locally (unlike pods, these two do not require actual CPU/RAM, so I can have many of them). |
@cjbaar By the way, did you consider solving this problem with in-memory indexing? I.e., having an index of Services, grouped by a key that defines how they belong to ConfigMaps (e.g. a namespace). Every time a ConfigMap X changes, iterate over all Services under the key of X (the same namespace) and do whatever you what to do with them. The in-memory indexing is a way to keep an up-to-the-moment "mirror" of the cluster's state for faster access to the objects' data. For stability, the processing of Services can be organised as subhandlers, one per Service, with their id equal to e.g. Service's name — so that if one patching fails, the whole ConfigMaps change would be retriggered, the same index rescanned, and the same subhandlers generated, but those that have already succeeded — to be skipped. With this approach, there would be no need to do extra listing & kick-patching of Services in their own handlers. (Which does not mitigate the original issue, of course.) |
Can you also please hint what does change the annotation to something different from I.e., Kubernetes only reports events on resources when something changes, not when something patches the resources (from a value to the same value). If patched from I need to know this so that I can construct a mini-operator with this issue reproduced. |
Thanks, as always, for the quick reply. When the reconcile process for a given Service completes, it patches the annotation value back to 'ok'. So the starting value should always be 'ok' until another handler tells it not to be. Here is the rest of the function called by the handlers above.
I am just not sure why I get inconsistent results with this. When I test locally with a few services, it works as I expected. I am wondering if there is something in the timing of double-patching, because I patch the annotation, but then Kopf is going to apply its own patch to set |
There is only one configmap currently involved; the dev cluster has about 35 services. I will take a look at the indexing. I guess I was generally trying to avoid the case of a single failed service update affecting the rest or the configmap handler. I was imaging the goal of writing operators was to be as independent on each action as possible. |
Also, not to confuse the issue, but I turned on debugging for some of my logging to try to follow this closer, and now I see evidence that an
|
So... basically for my patch action, I am excepting the |
Hm. On-creation is triggered when there is no annotation for "last-handled-configuration" (or whatever it is named). Is the issue happening only to the new services? Or to the existing ones too? Is it possible to extract logs for one problematic Service (randomly selected) here? Just to see what and why is happening. |
There should be no "new" services at this point. We aren't creating additional resources while the operator is running. So based on debug the |
I can probably give you a hint to think about. Kopf accumulates events within a short time window (0.1s) and handles only the last one in the batch. It currently ignores the expected version of the resource — this is a known issue with a drafted fix, not yet fully verified and merged (#718). Such cases will be more noticeable on high-scale clusters than on low-scale clusters. The trigger for the issue can be this chain of actions: If you originally react to the object's version X with Under normal circumstances (low-load), Kopf will wait for the last known version X+2 to arrive within max 0.1s from X+1. It then starts the processing cycle for the change "update->ok", filters out the handlers (because the actual annotation does not match the criterion), and will store a new version X+3 with the updated However, if the X+2 is delayed by longer than 0.1s from X+1 (when Kubernetes or operator are under high-load), Kopf will go processing X+1 as the last seen in the batch. This is an intermediate state that was not supposed to be handled. It might miss the annotations, so Kopf will believe it is processing a new object (again!). This is only a hypothesis. It needs to be verified by an experiment. Currently, it can be remediated by increasing the batch-window setting to 0.2-0.5 seconds (at the expense of some slowdown of the operator's reactions). The proper fix — which I mentioned as implemented but not merged — is to remember the resource version of Kopf's own patch (X+2) and to skip all versions before that expected version arrives (e.g. X+1) — regardless of how slowly they arrive. There are nuances, of course: e.g. we do not want to wait forever if the version is missed, and there still has to be a timeout, just bigger. It is never easy. |
Okay, this is unusual. May I ask you to show how exactly you patch the service and its annotations? Is it a merge-patch? Can it be so that you accidentally remove Kopf's annotations? |
Patching is done using the k8s client |
FYI, other annotations that the operator does not interfere with appear to be intact. |
Thanks again. I will try tweaking the batch-window and see if that helps. It seems very likely this is some kind of timing issue with the two patch events. As far as the indexes, I have read through this page a couple of times, and brain is not quite getting there yet. As I understand it now, the |
NM on the last point. I see the probe listed in the health checks page. The search just doesn't seem to work very well :) |
FYI, I tried increasing
|
@nolar I have started playing with the indexes more. While it does seem it would solve some of the extra api reads I am currently doing, I am not sure it totally solves my use cases. Also, I am able to easily duplicate similar problems with the index being out-of-sync because it is relies on the same event streams which are getting missed or misinterpreted in a similar way as the other handlers. When I started the operator, it creates the index, which appears correct. Then it goes on to process and necessary resume or stale create operations, and if I change the indexed resource during this time, it does not get picked up. This is just a test cluster with a few dozen resources being handled by the operator. I am starting to get concerned about using this framework in production, where we have clusters with over 500 services across various namespaces. |
Which actually brings up another point. If I started the operator with debugging enabled, I still see all these messages that indicate the |
That is not expected. However, it is difficult for me trying to help without at least the operator’s logs (with “-v” enabled), not to mention a reproducible snippet (with all the handlers, settings, and api operations inside). I am just blindly guessing what could be happening hypothetically — and that does not help. |
Sure. Is there somewhere I can send them to you? |
Thanks; I will put something together. I been looking at the index option, but it doesn't negate my need to still patch the initial status (to make sure the item gets in the index) and also potentially run some code on object creation (in addition to be added to the index). That seems like I am just making more handlers, not less. Also not sure about the subhandler approach where the main trigger is going to wait on 50+ items, and fail if any single one fails. I understand that they don't get re-handled, but still seems far less inefficient to me than having each resource maintain its own state. Am I not following that correctly? |
Update for the morning. I am making more progress trying to move to indexing; found a solution that might work for us. I wanted to add that with further testing, I am also seeing the log message from my |
OK, so running into several other issues when trying to use the index/loop method. One is that I'd like to be able to reference an index anywhere in code, without putting in the handler definition, and I don't see a way to do that. The reason for this is I need to reference it 3-4 levels deep in other functions, and don't want to pass it around everywhere that is isn't needed. The bigger issue is owner references. Since the handler is now the namespace with a loop over services, instead of individual handlers for each service, the objects using |
There was an idea on this in #730 (comment) — the one with For the owner references, that should be easy: it is just a little tiny dict with 6 fields in the object's metadata. You can form it yourself. |
@nolar Yeah I tried provide the owner reference manually. The API won't accept it unless it includes the |
So, apologies if this thread is spiraling into too many other topics. One thing I noticed when using the list/subhandler is that things seems to go a lot slower, because each subhandler is executed sequentially and waits for each to finish. There is no multi-threading or simultaneous processing which I felt was in effect when using individual handlers. I am defining the subhandler as this. Do I need to use the
|
Update on the last post, I was monitoring the wrong handler. The actual problem I having with this one is that only the last element in the list gets run via subhandler. So if I use this code:
Then I get this in the logs:
The first log output shows me list is looping properly, but only 1 subhandler is triggered, where I expect 2. |
Question
I am using custom callback filters to try to limit when a handler executes. In some cases of new object creation, I am getting am empty (null) value for the
old
object.It is my understanding, from reading the documentation, that
on.update()
andon.field()
handlers should only execute when a change is made to a given resource. Also, if I read correctly, the objectsold
andnew
should be present for any update/field handler, and the callback filter should "receive the same keyword arguments as the respective handlers". In the case of theon.field()
below, the field handler not only executes on the creation of an resource, but also seems to execute before theon.create()
does.Also, I really want the
on.field()
handler to only execute when the value of the annotation changes, but have seen it execute randomly on edits of the other sections of the configMap (separate issue?). To get around that, I tried creating a custom filter callback for this as well, but that's when I run into the nullold
issue.So, in the above code, is it intended that
on.field()
should be firing when the object is created? My interpretation of the docs says no, but it does. Similarly I have anotheron.update()
trigger that fires when I don't expect it to, and theold
object is also empty.Checklist
Keywords
handlers
filters
on.update
on.field
The text was updated successfully, but these errors were encountered: