some 404s from split client soon after startup, even after cache sync #267

joel-bluedata · 2020-02-11T17:42:37Z

If you restart the operator while there are existing kdclusters, the manager will immediately fire off reconciler handlers to process those existing resources.

If this is a mildly "busy" K8s with a not completely trivial number of resources, I've noticed that this first reconciler pass can fail in odd ways because of 404 results when it tries to fetch various resources (both CRs and native-K8s). Subsequent handlers are then OK.

My first thought is that this is related to the "caching client" that the manager sets up for us to use. I do see our code logging that mgr.GetCache().WaitForCacheSync has completed before any of the manager controller and workers even start. I believed that this meant the client was now aware of the current set of resources in the cluster but maybe not.

This isn't a functional showstopper since each handler should be robust to the client telling it that something doesn't exist when it actually does. Worst case = we try to re-create an object, fail, and the handler exits to try again later. Subsequent handler invocations always seem to be fine.

However it makes the logs look ugly after restart, hiding possible actual bugs.

I've found that I can prevent almost all occurrences of this in my test clusters by having the reconciler handler do a test read of the KD deployment resource first, just to make sure that the K8s client is live. Sometimes even with this in place though I do see a bogus "not found" result on an initial handler pass, so that does still make me think that it's a matter of the cache still being gradually populated (even after waiting a second or two) rather than the client just being generally not live yet.

That being the case, "GET the KD deployment" may not be the most reliable sanity check since who knows when it might arrive in the cache relative to other objects. Maybe I should just do a general wait of several seconds on the first handler pass.

The real/best solution is to figure out what is really up with the client and/or cache, but the 0.4.0 release doesn't need to be blocked on that.

joel-bluedata · 2020-02-11T20:21:05Z

It can be more problematic than I thought. If reconcile fails to GET a statefulset that should be backing a role, it puts those members into delete-pending state to be able to send notifies about them to all other ready nodes. The way that the members reconcile logic work, this means that if a future handler pass then DOES find the statefulset, it will shrink it to 0 to reconcile it with the current "desired" state of the members. After those shrinks are done, we can go back and take a look at the spec to see the next desired state, which resizes the statefulset back out to its original size.

If PVs are being used this is not a harmless operation, because the intentional shrink will also cause PVC to be deleted (and therefore PVs). So the statefulset returns to its original size but has lost important on-disk state.

The handlers need to be robust against the cached reads lagging behind reality, but in this case the cached read is showing us a discontinuity. We had previously created the statefulset (and stored that fact), but now the read says it has gone away. That's alarming.

Cf. bluek8s#267 To ensure this, remove the direct export of the client from the shared pkg, and instead require calling shared.Get, shared.Create, etc. For the most part those functions will call the split client, but Get will have the desired special behavior.

joel-bluedata · 2020-02-11T21:44:19Z

Waiting for some amount of time does not help. The first handler after restart is always susceptible to bogus 404s.

For now I'm going to rearrange our GET calls so that they fall back to a non-cached client on a 404 from the split client.

joel-bluedata · 2020-02-12T17:01:14Z

Removing the milestone on this one and lowering priority since we have a workaround.

joel-bluedata · 2020-03-11T00:25:56Z

Doesn't seem to be a resolution for this imminent in the split-client code, and we have a workaround, so closing this.

joel-bluedata added Priority: Low Type: Enhancement Project: Cluster Reconcile beyond simple xlate of model to K8s spec labels Feb 11, 2020

joel-bluedata added this to the 0.4.0 milestone Feb 11, 2020

joel-bluedata self-assigned this Feb 11, 2020

joel-bluedata added Priority: High and removed Priority: Low labels Feb 11, 2020

joel-bluedata mentioned this issue Feb 11, 2020

collection of fixes #268

Merged

joel-bluedata changed the title ~~on operator restart, some GETs fail with 404~~ some 404s from split client soon after startup, even after cache sync Feb 12, 2020

joel-bluedata removed this from the 0.4.0 milestone Feb 12, 2020

joel-bluedata added Priority: Low Type: Bug and removed Priority: High Type: Enhancement labels Feb 12, 2020

joel-bluedata closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some 404s from split client soon after startup, even after cache sync #267

some 404s from split client soon after startup, even after cache sync #267

joel-bluedata commented Feb 11, 2020 •

edited

Loading

joel-bluedata commented Feb 11, 2020

joel-bluedata commented Feb 11, 2020

joel-bluedata commented Feb 12, 2020

joel-bluedata commented Mar 11, 2020

some 404s from split client soon after startup, even after cache sync #267

some 404s from split client soon after startup, even after cache sync #267

Comments

joel-bluedata commented Feb 11, 2020 • edited Loading

joel-bluedata commented Feb 11, 2020

joel-bluedata commented Feb 11, 2020

joel-bluedata commented Feb 12, 2020

joel-bluedata commented Mar 11, 2020

joel-bluedata commented Feb 11, 2020 •

edited

Loading