-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes an already created record is missing during GET #391
Comments
What client are you using? What is your backend database? Reads always query the database directly, there's no cache at all - so if your object is not returned by the read, then the insert must not be complete yet. |
Thanks @brandond for the quick reply. The database is sqlite, and my client is not talking to kine directly. It is calling Kubernetes API server, which has I don't understand how my insert is not complete at this time. Kine called Original code: Line 31 in edd8f35
My patch: rev, err := l.backend.Create(ctx, string(put.Key), put.Value, put.Lease)
if err == ErrKeyExists {
return &etcdserverpb.TxnResponse{
Header: txnHeader(rev),
Succeeded: false,
}, nil
} else if err != nil {
return nil, err
}
createdCache.Set(string(put.Key), Event{
Create: true,
KV: &KeyValue{
Key: string(put.Key),
CreateRevision: rev,
ModRevision: rev,
Value: put.Value,
Lease: put.Lease,
},
}) So Kine executed Later the get request calls Line 15 in 19618f9
But the cache contains it, which is possible only, if |
The etcd (kine) datastore itself MUST not have a cache, as kine can run in a multi-node environment where cache coherency would be an issue. The database itself must be read to ensure a consistent view.
I would probably investigate that side then. What is the specific sequence of create/get calls that you're making against the apiserver? The apiserver DOES have caches that may or may not be used depending on the requested resource version. See: https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/ |
@brandond the cache is just a workaround to prove there are cases, when get returns nil after a successful create. This test is an official Kubernetes e2e test, it first creates a namespace and waits for service account creation. The service account is created by Kube controller manager. The test is a common usecase, which Kine has to support properly. I try to run Kubernetes built in e2e test, and it fails on Kine. In this case Kuberneste cache doesn't play much role, because this happens during the initial sync, so there are no items in the cache. The test works perfectly with etcd but rarely fails on Kine. That's why i think the problem is on Kine side and not Kubernetes or e2e test. |
Which specific E2E test is this? We run the upstream Kubernetes conformance tests on every Kine PR (ref), and on both kine and etcd on every K3s PR. I suspect something else is going on in your environment. How are you starting k3s/kine? What version are you using? |
@brandond i'm not near my computer, but the ginkgo focus 'should patch a Namespace' matches with only one test. I'm executing the following tests: 'sig-api-machinery|sig-apps|sig-auth|sig-instrumentation|sig-scheduling'. Every test fails which tries to create a namespace and waits for service account creation. I saw the problem during Kubernetes startup as well (hack/local-cluster-up.sh because im on latest). As i wrote this doesn't happens all the time. If there is an error in my environment how it is working if i use built in etcd? |
@brandond i forgot to write, thanks for your effor, i would do nore investigation. |
Please share more information on how you're running kine and your Kubernetes nodes. Versions, storage, available resources, and so on. |
My Kubernetes version is: I found what causes the issue. I disabled Here is the test name you were asking for: First it creates the namespace: https://github.com/kubernetes/kubernetes/blob/cabf04828e7f2b33cea7cb23e7fe3dc158d990eb/test/e2e/framework/framework.go#L260 The reason is I hope this gives better picture what is happening. I'm still investigating and trying to reproduce it on other versions. 🙏 |
Why are you running an alpha? Kubernetes 1.32 has been out since December. Can you reproduce this on non-alpha versions of Kubernetes?
Why? At some point the feature-gate will go GA and be hardcoded on, so you should start adapting to it.
Is this an issue with your client, or with Kubernetes itself? What resource version parameters are you using for your List (that does not find the resource) vs the Watch (that starts after the resource has been created)? If you start watching at the revision that did not find the resource when listing, the watch itself should return it even if it has already been created - as the Watch itself does a List internally when starting. |
I'm working on some Kubernetes development, and that was the time i created my branch. I rebased it to
As i understand this is not a a feature gate, it is a config option for etcd storage. I don't find any deprecation notice around it. Please fix me if i'm wrong. If this flag would be removed sometimes, then i need an alternative driver because on some high load systems building the objects in each request is cheaper then holding everything in the memory.
First of all, the list is a get because of the field selector
Yes.
I created a test to isolate Kine At this point this problem is a catch 22, because each component individually works as designed, but the overall result is not what we expect. Let me explain by counting changes fix the problem:
Thank you for helping me investigating this problem, and if you have any more idea please share it with me. 🙏 |
I was referring to the ConsistentListFromCache feature gate. It sounds like you're not disabling that though - how are you turning off the watch cache? By telling Kine to use an emulated etcd version that will cause the apiserver to disable it internally?
Can you run kine with --debug while running the Kubernetes test, and capture the sql query logs? It'll probably be a lot, but it would probably help us understand the sequence of events. I probably won't have time this week, but I might be able to do some investigation myself next week. |
@brandond Kubernetes api server has a CLI flag to disable it: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options |
Oh interesting, I've honestly never seen anyone use that, I thought you were tinkering with the newer consistent cache read stuff from https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/ I don't know that we've ever tested Kine with that flag set on the apiserver. It is very possible you're doing something that we've never tried. |
Then no one else tried to push api server to it's limit ;)
Nope, but this is an other pain, i have to test everything with the combination of WatchList, WatchListClient, ResilientWatchCacheInitialization, ConsistentListFromCache :D but it is a different story
Yes, that is my life mission, using things they are not capable for, than figure out how they would be capable. |
As title mentions, sometimes I can't fetch an already created object. It is hard to reproduce, but here how i did.
First I added a cache to remember created objects:
Than i updated get request:
To reproduce, i executed the following Kubernetes E2E test in a loop:
Investigated on 59c88f9 but tested on latest master: a4169b9 , so multiple 'versions' are affected.
What i see in the logs:
Please let me know if i'm wrong, but this looks an issue. Thank you for any action.
The text was updated successfully, but these errors were encountered: