-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some 404s from split client soon after startup, even after cache sync #267
Comments
It can be more problematic than I thought. If reconcile fails to GET a statefulset that should be backing a role, it puts those members into delete-pending state to be able to send notifies about them to all other ready nodes. The way that the members reconcile logic work, this means that if a future handler pass then DOES find the statefulset, it will shrink it to 0 to reconcile it with the current "desired" state of the members. After those shrinks are done, we can go back and take a look at the spec to see the next desired state, which resizes the statefulset back out to its original size. If PVs are being used this is not a harmless operation, because the intentional shrink will also cause PVC to be deleted (and therefore PVs). So the statefulset returns to its original size but has lost important on-disk state. The handlers need to be robust against the cached reads lagging behind reality, but in this case the cached read is showing us a discontinuity. We had previously created the statefulset (and stored that fact), but now the read says it has gone away. That's alarming. |
Cf. bluek8s#267 To ensure this, remove the direct export of the client from the shared pkg, and instead require calling shared.Get, shared.Create, etc. For the most part those functions will call the split client, but Get will have the desired special behavior.
Cf. bluek8s#267 To ensure this, remove the direct export of the client from the shared pkg, and instead require calling shared.Get, shared.Create, etc. For the most part those functions will call the split client, but Get will have the desired special behavior.
Waiting for some amount of time does not help. The first handler after restart is always susceptible to bogus 404s. For now I'm going to rearrange our GET calls so that they fall back to a non-cached client on a 404 from the split client. |
Removing the milestone on this one and lowering priority since we have a workaround. |
Doesn't seem to be a resolution for this imminent in the split-client code, and we have a workaround, so closing this. |
If you restart the operator while there are existing kdclusters, the manager will immediately fire off reconciler handlers to process those existing resources.
If this is a mildly "busy" K8s with a not completely trivial number of resources, I've noticed that this first reconciler pass can fail in odd ways because of 404 results when it tries to fetch various resources (both CRs and native-K8s). Subsequent handlers are then OK.
My first thought is that this is related to the "caching client" that the manager sets up for us to use. I do see our code logging that mgr.GetCache().WaitForCacheSync has completed before any of the manager controller and workers even start. I believed that this meant the client was now aware of the current set of resources in the cluster but maybe not.
This isn't a functional showstopper since each handler should be robust to the client telling it that something doesn't exist when it actually does. Worst case = we try to re-create an object, fail, and the handler exits to try again later. Subsequent handler invocations always seem to be fine.
However it makes the logs look ugly after restart, hiding possible actual bugs.
I've found that I can prevent almost all occurrences of this in my test clusters by having the reconciler handler do a test read of the KD deployment resource first, just to make sure that the K8s client is live. Sometimes even with this in place though I do see a bogus "not found" result on an initial handler pass, so that does still make me think that it's a matter of the cache still being gradually populated (even after waiting a second or two) rather than the client just being generally not live yet.
That being the case, "GET the KD deployment" may not be the most reliable sanity check since who knows when it might arrive in the cache relative to other objects. Maybe I should just do a general wait of several seconds on the first handler pass.
The real/best solution is to figure out what is really up with the client and/or cache, but the 0.4.0 release doesn't need to be blocked on that.
The text was updated successfully, but these errors were encountered: