Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some 404s from split client soon after startup, even after cache sync #267

Closed
joel-bluedata opened this issue Feb 11, 2020 · 4 comments
Closed
Assignees
Labels
Priority: Low Project: Cluster Reconcile beyond simple xlate of model to K8s spec Type: Bug

Comments

@joel-bluedata
Copy link
Member

joel-bluedata commented Feb 11, 2020

If you restart the operator while there are existing kdclusters, the manager will immediately fire off reconciler handlers to process those existing resources.

If this is a mildly "busy" K8s with a not completely trivial number of resources, I've noticed that this first reconciler pass can fail in odd ways because of 404 results when it tries to fetch various resources (both CRs and native-K8s). Subsequent handlers are then OK.

My first thought is that this is related to the "caching client" that the manager sets up for us to use. I do see our code logging that mgr.GetCache().WaitForCacheSync has completed before any of the manager controller and workers even start. I believed that this meant the client was now aware of the current set of resources in the cluster but maybe not.

This isn't a functional showstopper since each handler should be robust to the client telling it that something doesn't exist when it actually does. Worst case = we try to re-create an object, fail, and the handler exits to try again later. Subsequent handler invocations always seem to be fine.

However it makes the logs look ugly after restart, hiding possible actual bugs.

I've found that I can prevent almost all occurrences of this in my test clusters by having the reconciler handler do a test read of the KD deployment resource first, just to make sure that the K8s client is live. Sometimes even with this in place though I do see a bogus "not found" result on an initial handler pass, so that does still make me think that it's a matter of the cache still being gradually populated (even after waiting a second or two) rather than the client just being generally not live yet.

That being the case, "GET the KD deployment" may not be the most reliable sanity check since who knows when it might arrive in the cache relative to other objects. Maybe I should just do a general wait of several seconds on the first handler pass.

The real/best solution is to figure out what is really up with the client and/or cache, but the 0.4.0 release doesn't need to be blocked on that.

@joel-bluedata
Copy link
Member Author

It can be more problematic than I thought. If reconcile fails to GET a statefulset that should be backing a role, it puts those members into delete-pending state to be able to send notifies about them to all other ready nodes. The way that the members reconcile logic work, this means that if a future handler pass then DOES find the statefulset, it will shrink it to 0 to reconcile it with the current "desired" state of the members. After those shrinks are done, we can go back and take a look at the spec to see the next desired state, which resizes the statefulset back out to its original size.

If PVs are being used this is not a harmless operation, because the intentional shrink will also cause PVC to be deleted (and therefore PVs). So the statefulset returns to its original size but has lost important on-disk state.

The handlers need to be robust against the cached reads lagging behind reality, but in this case the cached read is showing us a discontinuity. We had previously created the statefulset (and stored that fact), but now the read says it has gone away. That's alarming.

joel-bluedata added a commit to joel-bluedata/kubedirector that referenced this issue Feb 11, 2020
Cf. bluek8s#267

To ensure this, remove the direct export of the client from the shared pkg, and instead require calling shared.Get, shared.Create, etc.

For the most part those functions will call the split client, but Get will have the desired special behavior.
joel-bluedata added a commit to joel-bluedata/kubedirector that referenced this issue Feb 11, 2020
Cf. bluek8s#267

To ensure this, remove the direct export of the client from the shared pkg, and instead require calling shared.Get, shared.Create, etc.

For the most part those functions will call the split client, but Get will have the desired special behavior.
@joel-bluedata
Copy link
Member Author

Waiting for some amount of time does not help. The first handler after restart is always susceptible to bogus 404s.

For now I'm going to rearrange our GET calls so that they fall back to a non-cached client on a 404 from the split client.

@joel-bluedata
Copy link
Member Author

Removing the milestone on this one and lowering priority since we have a workaround.

@joel-bluedata joel-bluedata changed the title on operator restart, some GETs fail with 404 some 404s from split client soon after startup, even after cache sync Feb 12, 2020
@joel-bluedata joel-bluedata removed this from the 0.4.0 milestone Feb 12, 2020
@joel-bluedata
Copy link
Member Author

Doesn't seem to be a resolution for this imminent in the split-client code, and we have a workaround, so closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low Project: Cluster Reconcile beyond simple xlate of model to K8s spec Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant