Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected trigger and missing object during on.update() and on.field() handlers with callback filters #784

Open
2 tasks done
cjbaar opened this issue Jun 11, 2021 · 37 comments
Labels
bug Something isn't working question Further information is requested

Comments

@cjbaar
Copy link
Contributor

cjbaar commented Jun 11, 2021

Question

I am using custom callback filters to try to limit when a handler executes. In some cases of new object creation, I am getting am empty (null) value for the old object.

It is my understanding, from reading the documentation, that on.update() and on.field() handlers should only execute when a change is made to a given resource. Also, if I read correctly, the objects old and new should be present for any update/field handler, and the callback filter should "receive the same keyword arguments as the respective handlers". In the case of the on.field() below, the field handler not only executes on the creation of an resource, but also seems to execute before the on.create() does.

def is_ccp_config(name, **_):
    """filter on ConfigMap name"""
    if name == 'xxx':
        return True
    return False

@k8s.kopf.on.create(
    'configmap',
    when=is_ccp_config,
    annotations={'ccp/xxx':'true'},
    param='create'
)
@k8s.kopf.on.field(
    'configmap',
    when=is_ccp_config,
    field='metadata.annotations.ccp/xxx',
    new='true',
    param='update-annotations'
)

Also, I really want the on.field() handler to only execute when the value of the annotation changes, but have seen it execute randomly on edits of the other sections of the configMap (separate issue?). To get around that, I tried creating a custom filter callback for this as well, but that's when I run into the null old issue.

def is_annotation_changed(old, new, **_):
    """helper for change handler filter"""
    old_val = None
    if ('annotations' in old['metadata'].keys() and
        'ccp/xxx' in old['metadata']['annotations'].keys()):
        old_val = old['metadata']['annotations']['ccp/xxx']
    new_val = None
    if ('annotations' in new['metadata'].keys() and
        'ccp/xxx' in new['metadata']['annotations'].keys()):
        new_val = new['metadata']['annotations']['ccp/xxx']
    return not (old_val == new_val)

@k8s.kopf.on.field(
    'configmap',
    when=k8s.kopf.all_([is_ccp_config, is_annotation_changed]),
    field='metadata.annotations.ccp/xxx',
    new='true',
    param='update-annotations'
)

So, in the above code, is it intended that on.field() should be firing when the object is created? My interpretation of the docs says no, but it does. Similarly I have another on.update() trigger that fires when I don't expect it to, and the old object is also empty.

Checklist

Keywords

handlers
filters
on.update
on.field

@cjbaar cjbaar added the question Further information is requested label Jun 11, 2021
@nolar
Copy link
Owner

nolar commented Jun 11, 2021

Hello. Thanks for reporting this.

It might be an issue with the documentation and phrasing. Can you please point to the exact locations that made you think that on-field is executed only on updates?

The on-field handler is executed any time the specified field is involved. The creation of an object with that field having a value is treated as a change from null to the value — the same way as when the field is added to the existing object when it was not there before.

As a result, the on-field handler is called for creations too.

To only react to the field changes when the object is changed but not created, use @kopf.on.update(..., field="...") — that is an on-update handler for the object (not the field!), but additionally filtered by the criteria of the field existence & value change. See more: https://kopf.readthedocs.io/en/stable/filters/#field-filters

Roughly, on.field(field=...) == on.create(field=...) + on.update(field=...) (but really roughly).


On a side note: after the introduction of the field/value filters for all handlers, on-field handlers became useless. And worse: they always were confusing — but kept for backward compatibility (there is no direct replacement for them). Maybe it is a good idea to avoid them at all for clarity — at least, in the new operators.


As for the random execution on unrelated changes — that might be a bug. Can you please create a new issue with a reproducible snippet (if possible) or logs?

Before that, one thing to look at — my personal favourite mistake — is whether there is another instance of the operator running on any machine or pod that also changes the same annotations. Two operators with the same identity usually create weird ping-pong effects — unless peering is used to freeze them properly.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 11, 2021

Thanks for the quick reply, @nolar
Sounds like there are a few things to unpack.

  1. My impression from the docs comes from this line in https://kopf.readthedocs.io/en/stable/filters/, when using a new value.
    Match when the field reaches a specific value either by being edited/patched to it or by adding it to the resource (i.e. regardless of the old value):
    That, to me, implied a change, not a creation.

  2. I will try moving back to an on.update(), but I started using on.field() in the first place because I was not getting consistent results when trying to check for a change in the annotations. From what I can tell, when keying on annotations, I can only check the new spec, and not the old. I can only specify new/old on field, and even then can only give specific values.
    2b. Side note: I can't use field to check for annotations with . in them, because an annotation with a name like me.k8s.io/object-name gets written as meta.annotations.me.k8s.io/object-name, and so the parts of the domain name get interpreted as further dictionary paths.

  3. As noted, I have experienced similar behavior with on.update(), where the update fails because the old object is missing. I will try to reproduce that again. Is there a case where old would not be present for an update handler?

  4. Yes I have confirmed only one operator is running currently; testing locally with Docker desktop cluster right now.

Bottom line: What I really want is something to fire only on an actual to change to an annotation change (roughly new=x, old!=-x). Is the only way to get that via a custom callback?

Thanks again!

@nolar
Copy link
Owner

nolar commented Jun 12, 2021

new=x, old!=-x

That is the intention behind the field change filters:

@kopf.on.update(..., field='...', new='x')  # absent "old" is implied to be "anything but 'x'"
def fn(**_): ...

If that does not work, it should be fixed to work. But a repro is needed on how exactly it does not work.

I can only specify new/old on field,

on-update also support old/new values of the field.

and even then can only give specific values.

value=, old=, new= can also be callbacks rather than specific values. With one extra positional arg (not a kwarg) — the value to check.

I can't use field to check for annotations with . in them, because an annotation with a name like me.k8s.io/object-name gets written as meta.annotations.me.k8s.io/object-name

Almost. You cannot use the string notation then. You can specify any field as a preparsed tuple/list: field=['metadata', 'annotations', 'me.k8s.io/object-name']. See: https://github.com/nolar/kopf/blob/1.32.1/kopf/_cogs/structs/dicts.py#L23-L43 — it is part of the public API; though, I'm not sure if it is documented (if not, it should be).

Is there a case where old would not be present for an update handler?

The old kwarg will be present always. It can be None in the update handlers if the field (or the annotation) was added to the existing object, where it was absent before.

Such changes should be caught and treated as regular changes.

If you want to catch ONLY the changes but not additions of the annotation, I would try this (not sure if it works, but it should):

@kopf.on.update(..., field='...', old=kopf.PRESENT, new='x')  # implied: from anything but absent and "x" — to "x" specifically
def fn(**_): ...

Or that way for anything-to-anything change except the additions/deletions of the field:

@kopf.on.update(..., field='...', old=kopf.PRESENT, new=kopf.PRESENT)  # implied: old!=new && neither is None
def fn(**_): ...

@nolar
Copy link
Owner

nolar commented Jun 12, 2021

PS: Also mind that when the annotation is removed, it will be a change from old="x" to new=None. It can also be filtered by kopf.PRESET/kopf.ABSENT markers accordingly (e.g. @kopf.on.update(..., old=PRESENT, new=ABSENT)).

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 12, 2021

Once again, appreciate all of your input on this.

I haven't seen issues when removing the field (old=x), just the adding. The whole reason I went down the custom callback route is because I have seen instances where I use field with new=x, and the handler still fires even when the value is not changing. Of course it's not consistent, so not sure if I can come up with a test case, but will try.

Thanks for the tip on using the tuple/list for setting the field value.

I guess part of my confusion comes from the fact that the content of old and new seem to change depending on the nature of the handler definition. I was expecting their values would always be just the entire contents of the resource, but you are saying if I use field as a filter, the value of old should just be that particular field? In the case where my old is empty using on.update(), the value of new is still the entire resource def, so it doesn't make sense to me that old would be empty. I would expect old would be the entire resource def as it was before the requested change.

Thanks again for the tips, will play with it more this week.

@nolar
Copy link
Owner

nolar commented Jun 12, 2021

I’ve now got my own doubt on the old/new values. I remember the exact place where their values are reduced down to the field’s values for the handler itself (if it has a field defined). But I’m not sure if the same is done for the filtering callbacks — I need to check that.

Regarding the test/repro — maybe the exact code snippet is not necessary. But the logs (with --verbose) can be helpful to understand what happens and why.

I’ll also try to play with this example and annotations as fields — maybe I can catch the case. Sometimes, new bugs come with new versions of K8s via wrong initial assumptions in Kopf. Thank you for providing the details.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 13, 2021

I’ve now got my own doubt on the old/new values.

Sorry about that. :) I will definitely be doing more testing on my end as well.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 14, 2021

OK, well I am trying various cases with the on.update() with field handler, and so far can't get it to misfire again. So before we dig too deeply in the code, let's chalk that up to some kind of user error. I will try to create a reproducible case for the on.update() with old empty at least.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

Ok... so I am encountering a missed trigger again. Here is the definition of the handler that is not firing.

@k8s.kopf.on.resume(
    'service',
    annotations={'service-mesh.operator.ccp/component-sync': 'update'},
    param='resume'
)
@k8s.kopf.on.create(
    'service',
    annotations={'service-mesh.operator.ccp/component-sync': 'update'},
    param='create'
)
@k8s.kopf.on.update(
    'service',
    annotations={'service-mesh.operator.ccp/component-sync': 'update'},
    param='update'
)
def sync_service_components(namespace, name, logger, param, **kwargs):

The change of the annotation to update is done by another handler. Basically, I am watching a specific ConfigMap, and if the data in that cm changes, then I want to re-check all of the Services in that namespace. This is done by patching the annotation of each relevant Service (with some filters) to update... and then I expect the on.update() handler above to fire to reconcile each resource. When I tested locally with a few Services, it works fine. But when I put it into an active cluster, it does not fire. If I kill the operator pod and let a new one spawn, the on.resume() does fire and finish the job.

Any advice on this? This is why I originally started looking at the on.field() and custom callbacks, because I didn't feel the on.update() was being triggered reliably. I am not sure how to replicate this on a small scale; there are currently about a dozen handler functions in the operator.

@nolar
Copy link
Owner

nolar commented Jun 30, 2021

@cjbaar Can you please tell me the scale? I.e. the number of ConfigMaps & Services involved. I can try to reproduce this locally (unlike pods, these two do not require actual CPU/RAM, so I can have many of them).

@nolar
Copy link
Owner

nolar commented Jun 30, 2021

@cjbaar By the way, did you consider solving this problem with in-memory indexing? I.e., having an index of Services, grouped by a key that defines how they belong to ConfigMaps (e.g. a namespace). Every time a ConfigMap X changes, iterate over all Services under the key of X (the same namespace) and do whatever you what to do with them. The in-memory indexing is a way to keep an up-to-the-moment "mirror" of the cluster's state for faster access to the objects' data.

For stability, the processing of Services can be organised as subhandlers, one per Service, with their id equal to e.g. Service's name — so that if one patching fails, the whole ConfigMaps change would be retriggered, the same index rescanned, and the same subhandlers generated, but those that have already succeeded — to be skipped.

With this approach, there would be no need to do extra listing & kick-patching of Services in their own handlers. (Which does not mitigate the original issue, of course.)

@nolar
Copy link
Owner

nolar commented Jun 30, 2021

This is done by patching the annotation of each relevant Service (with some filters) to update

Can you also please hint what does change the annotation to something different from update after the reconciliation is done?

I.e., Kubernetes only reports events on resources when something changes, not when something patches the resources (from a value to the same value). If patched from update to update, Kubernetes will not report a change. Besides, even if it does so, Kopf will remain blind to the event with no actual change (on its own reasons). So, something should be changing the annotations from update to e.g. null (or anything that is not update).

I need to know this so that I can construct a mini-operator with this issue reproduced.

@nolar nolar added the bug Something isn't working label Jun 30, 2021
@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

Thanks, as always, for the quick reply. When the reconcile process for a given Service completes, it patches the annotation value back to 'ok'. So the starting value should always be 'ok' until another handler tells it not to be. Here is the rest of the function called by the handlers above.

def sync_service_components(namespace, name, logger, param, **kwargs):
    """handler to add Istio Components for new Service"""
    logger.info(f"Processing {param} method for sync_service_components")

    act_destination_rule.sync_resource(namespace, name, logger)
    act_virtual_service.sync_resource(namespace, name, logger)
    act_service.set_sync_status(namespace, name, 'ok', logger)

I am just not sure why I get inconsistent results with this. When I test locally with a few services, it works as I expected. I am wondering if there is something in the timing of double-patching, because I patch the annotation, but then Kopf is going to apply its own patch to set kopf.zalando.org/last-handled-configuration, right?

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

There is only one configmap currently involved; the dev cluster has about 35 services. I will take a look at the indexing. I guess I was generally trying to avoid the case of a single failed service update affecting the rest or the configmap handler. I was imaging the goal of writing operators was to be as independent on each action as possible.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

Also, not to confuse the issue, but I turned on debugging for some of my logging to try to follow this closer, and now I see evidence that an on.create() handler is attempting to fire when the Service is patched. Is this expected? The callback filter is preventing action, but I would not expect it to even be trying.

@k8s.kopf.on.create(
    'service',
    when=filter_service_with_ns_injection,
    param='create'
)

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

So... basically for my patch action, I am excepting the on.update() to fire, but instead an on.create() is attempted, and denied by a filter. But, this happens only part of the time (not consistently).

@nolar
Copy link
Owner

nolar commented Jun 30, 2021

Hm. On-creation is triggered when there is no annotation for "last-handled-configuration" (or whatever it is named).

Is the issue happening only to the new services? Or to the existing ones too?

Is it possible to extract logs for one problematic Service (randomly selected) here? Just to see what and why is happening.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

There should be no "new" services at this point. We aren't creating additional resources while the operator is running. So based on debug the on.create() handler is attempting to run when the service annotations is patched to update (by another handler), and then again when it is patched back to 'ok'. So every patch seems to trigger on.create().

@nolar
Copy link
Owner

nolar commented Jun 30, 2021

I can probably give you a hint to think about.

Kopf accumulates events within a short time window (0.1s) and handles only the last one in the batch. It currently ignores the expected version of the resource — this is a known issue with a drafted fix, not yet fully verified and merged (#718). Such cases will be more noticeable on high-scale clusters than on low-scale clusters.

The trigger for the issue can be this chain of actions:

If you originally react to the object's version X with annotation==update, your own patch creates version X+1: annotation==ok, and Kopf's follow-up patch creates version X+2: annotation==ok + last-handled-configuration .

Under normal circumstances (low-load), Kopf will wait for the last known version X+2 to arrive within max 0.1s from X+1. It then starts the processing cycle for the change "update->ok", filters out the handlers (because the actual annotation does not match the criterion), and will store a new version X+3 with the updated last-handled-configuration. Silently (almost).

However, if the X+2 is delayed by longer than 0.1s from X+1 (when Kubernetes or operator are under high-load), Kopf will go processing X+1 as the last seen in the batch. This is an intermediate state that was not supposed to be handled. It might miss the annotations, so Kopf will believe it is processing a new object (again!).

This is only a hypothesis. It needs to be verified by an experiment.

Currently, it can be remediated by increasing the batch-window setting to 0.2-0.5 seconds (at the expense of some slowdown of the operator's reactions).

The proper fix — which I mentioned as implemented but not merged — is to remember the resource version of Kopf's own patch (X+2) and to skip all versions before that expected version arrives (e.g. X+1) — regardless of how slowly they arrive. There are nuances, of course: e.g. we do not want to wait forever if the version is missed, and there still has to be a timeout, just bigger. It is never easy.

@nolar
Copy link
Owner

nolar commented Jun 30, 2021

So every patch seems to trigger on.create().

Okay, this is unusual. May I ask you to show how exactly you patch the service and its annotations? Is it a merge-patch? Can it be so that you accidentally remove Kopf's annotations?

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

Patching is done using the k8s client patch_namespaced_service. It is my understanding that this only supports a "strategic merge patch", and since annotations are a map (dict), I thought other elements should be preserved, but maybe not

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

FYI, other annotations that the operator does not interfere with appear to be intact.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

Thanks again. I will try tweaking the batch-window and see if that helps. It seems very likely this is some kind of timing issue with the two patch events.

As far as the indexes, I have read through this page a couple of times, and brain is not quite getting there yet. As I understand it now, the index() function allow me to build the index, and then I can pass my_index: kopf.Index as another arg to any handler? Are the probe() functions required? I don't see probe mentioned anywhere else in the docs.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

NM on the last point. I see the probe listed in the health checks page. The search just doesn't seem to work very well :)

@cjbaar
Copy link
Contributor Author

cjbaar commented Jun 30, 2021

FYI, I tried increasing settings.batching.batch_window = 0.5 but still see the attempted create handlers.
UPDATE: I am not setting something properly. Added this, but when I look at the setting later it is still showing 0.1

@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
    settings.batching.batch_window = 0.5

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 1, 2021

@nolar I have started playing with the indexes more. While it does seem it would solve some of the extra api reads I am currently doing, I am not sure it totally solves my use cases. Also, I am able to easily duplicate similar problems with the index being out-of-sync because it is relies on the same event streams which are getting missed or misinterpreted in a similar way as the other handlers. When I started the operator, it creates the index, which appears correct. Then it goes on to process and necessary resume or stale create operations, and if I change the indexed resource during this time, it does not get picked up.

This is just a test cluster with a few dozen resources being handled by the operator. I am starting to get concerned about using this framework in production, where we have clusters with over 500 services across various namespaces.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 1, 2021

Which actually brings up another point. If I started the operator with debugging enabled, I still see all these messages that indicate the on.create() is attempting to fire for all of my resources that are definitely previously seen and annotated by kopf. The actual execution is blocked by a callback filter, but I am not expecting that handler to even be attempted. Please help me understand what I am missing here. :) on.create() fires at startup, and any time a resource is touched. This is not expected, correct?

@nolar
Copy link
Owner

nolar commented Jul 1, 2021

That is not expected.

However, it is difficult for me trying to help without at least the operator’s logs (with “-v” enabled), not to mention a reproducible snippet (with all the handlers, settings, and api operations inside). I am just blindly guessing what could be happening hypothetically — and that does not help.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 1, 2021

Sure. Is there somewhere I can send them to you?

@nolar
Copy link
Owner

nolar commented Jul 1, 2021

nolar@nolar.info

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 1, 2021

Thanks; I will put something together. I been looking at the index option, but it doesn't negate my need to still patch the initial status (to make sure the item gets in the index) and also potentially run some code on object creation (in addition to be added to the index). That seems like I am just making more handlers, not less. Also not sure about the subhandler approach where the main trigger is going to wait on 50+ items, and fail if any single one fails. I understand that they don't get re-handled, but still seems far less inefficient to me than having each resource maintain its own state. Am I not following that correctly?

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 2, 2021

Update for the morning. I am making more progress trying to move to indexing; found a solution that might work for us. I wanted to add that with further testing, I am also seeing the log message from my on.create() callback function when deleting a service as well. Is there something about the use of a custom callback filter that overrides the create/update/etc functionality of the default decorators?

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 2, 2021

OK, so running into several other issues when trying to use the index/loop method. One is that I'd like to be able to reference an index anywhere in code, without putting in the handler definition, and I don't see a way to do that. The reason for this is I need to reference it 3-4 levels deep in other functions, and don't want to pass it around everywhere that is isn't needed.

The bigger issue is owner references. Since the handler is now the namespace with a loop over services, instead of individual handlers for each service, the objects using append_owner_reference get the wrong owner (namespace instead of service). I have tried putting this in a subhandler, but unless I do an API call every time to read the service definition, I don't have that spec. So, it seems like using indexes doesn't actually reduce my dependence on multiple API calls.

@nolar
Copy link
Owner

nolar commented Jul 4, 2021

I need to reference it 3-4 levels deep in other functions, and don't want to pass it around everywhere

There was an idea on this in #730 (comment) — the one with @kopf.index(…, to=custom_idx_obj) — let me see if I can implement that quickly.

For the owner references, that should be easy: it is just a little tiny dict with 6 fields in the object's metadata. You can form it yourself.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 6, 2021

@nolar Yeah I tried provide the owner reference manually. The API won't accept it unless it includes the uid field, which I don't know without another API call. I guess I could try storing that in the the index, but want to avoid too much in-memory caching. I am already indexing way more than I really want to, because in our use case I need the index to be updated when things other than the actual resource are changed.

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 6, 2021

So, apologies if this thread is spiraling into too many other topics. One thing I noticed when using the list/subhandler is that things seems to go a lot slower, because each subhandler is executed sequentially and waits for each to finish. There is no multi-threading or simultaneous processing which I felt was in effect when using individual handlers. I am defining the subhandler as this. Do I need to use the async method to run them in parallel?

for name in list(idx_svc_by_ns.get(namespace)):
    @k8s.kopf.subhandler(id=name)
    def handle_service(spec, **_):
        # ... do some thing

@cjbaar
Copy link
Contributor Author

cjbaar commented Jul 6, 2021

Update on the last post, I was monitoring the wrong handler. The actual problem I having with this one is that only the last element in the list gets run via subhandler. So if I use this code:

    for svc in list(idx_svc_by_ns.get(namespace)):
        name, uid = svc['name'], svc['uid']
        logger.info(f"Subhandle resources for {name} ({uid})")
        @k8s.kopf.subhandler(id=uid)
        def handle_service(**_):
            logger.info(f"Syncing resources for {name} ({uid})")

Then I get this in the logs:

[test-label] Subhandle resources for test-svc (97132a09-a5ca-4815-b571-d3622b5c2a7e)
[test-label] Subhandle resources for test-svc2 (7e6d190b-260d-4057-8a92-e55764170223)
[test-label] Syncing resources for test-label (4ca6b9e0-5430-485c-832d-3b0a81368298)

The first log output shows me list is looping properly, but only 1 subhandler is triggered, where I expect 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants