Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes an already created record is missing during GET #391

Open
mhmxs opened this issue Jan 6, 2025 · 15 comments
Open

Sometimes an already created record is missing during GET #391

mhmxs opened this issue Jan 6, 2025 · 15 comments

Comments

@mhmxs
Copy link

mhmxs commented Jan 6, 2025

As title mentions, sometimes I can't fetch an already created object. It is hard to reproduce, but here how i did.

First I added a cache to remember created objects:

diff --git a/pkg/server/create.go b/pkg/server/create.go
index 00ce3c434..08366a02f 100644
--- a/pkg/server/create.go
+++ b/pkg/server/create.go
@@ -2,10 +2,14 @@ package server
 
 import (
        "context"
+       "time"
 
+       "github.com/scalalang2/golang-fifo/sieve"
        "go.etcd.io/etcd/api/v3/etcdserverpb"
 )
 
+var createdCache = sieve.New[string, Event](1e3, time.Minute)
+
 func isCreate(txn *etcdserverpb.TxnRequest) *etcdserverpb.PutRequest {
        if len(txn.Compare) == 1 &&
                txn.Compare[0].Target == etcdserverpb.Compare_MOD &&
@@ -38,6 +42,17 @@ func (l *LimitedServer) create(ctx context.Context, put *etcdserverpb.PutRequest
                return nil, err
        }
 
+       createdCache.Set(string(put.Key), Event{
+               Create: true,
+               KV: &KeyValue{
+                       Key:            string(put.Key),
+                       CreateRevision: rev,
+                       ModRevision:    rev,
+                       Value:          put.Value,
+                       Lease:          put.Lease,
+               },
+       })
+
        return &etcdserverpb.TxnResponse{
                Header: txnHeader(rev),
                Responses: []*etcdserverpb.ResponseOp{

Than i updated get request:

diff --git a/pkg/server/get.go b/pkg/server/get.go
index b86153acd..010533d22 100644
--- a/pkg/server/get.go
+++ b/pkg/server/get.go
@@ -19,6 +19,19 @@ func (l *LimitedServer) get(ctx context.Context, r *etcdserverpb.RangeRequest) (
        if kv != nil {
                resp.Kvs = []*KeyValue{kv}
                resp.Count = 1
+       } else if e, ok := createdCache.Get(string(r.Key)); ok {
+               if r.Revision == 0 || e.KV.ModRevision == r.Revision {
+                       println(string(r.Key), "found only in cache", string(r.RangeEnd), r.Revision)
+                       resp.Kvs = []*KeyValue{{
+                               Key:            e.KV.Key,
+                               CreateRevision: e.KV.CreateRevision,
+                               ModRevision:    e.KV.ModRevision,
+                               Value:          e.KV.Value,
+                               Lease:          e.KV.Lease,
+                       }}
+                       resp.Count = 1
+               }
        }
+
        return resp, err
 }

To reproduce, i executed the following Kubernetes E2E test in a loop:

./build/run.sh make WHAT=test/e2e/e2e.test
alias test='kubetest2-noop --up && kubetest2-tester-ginkgo --parallel=1 --timeout=5m --use-binaries-from-path=true --focus-regex="should patch a Namespace"'
 while test; do sleep 0; done

Investigated on 59c88f9 but tested on latest master: a4169b9 , so multiple 'versions' are affected.

What i see in the logs:

/registry/serviceaccounts/namespaces-1081/default found only in cache  0
/registry/serviceaccounts/nspatchtest-8b40c23c-efc4-4a9d-a40f-1f5b40c0e425-5669/default found only in cache  0
/registry/serviceaccounts/namespaces-1060/default found only in cache  0
/registry/serviceaccounts/namespaces-1060/default found only in cache  0
/registry/serviceaccounts/nspatchtest-9506bf3a-ec89-42d5-998a-3cbcc3c3f1f0-6025/default found only in cache  0
/registry/serviceaccounts/namespaces-2257/default found only in cache  0
/registry/serviceaccounts/namespaces-2678/default found only in cache  0

Please let me know if i'm wrong, but this looks an issue. Thank you for any action.

@brandond
Copy link
Member

brandond commented Jan 6, 2025

sometimes I can't fetch an already created object.

What client are you using? What is your backend database?

Reads always query the database directly, there's no cache at all - so if your object is not returned by the read, then the insert must not be complete yet.

@mhmxs
Copy link
Author

mhmxs commented Jan 6, 2025

Thanks @brandond for the quick reply. The database is sqlite, and my client is not talking to kine directly. It is calling Kubernetes API server, which has go.etcd.io/etcd/api/v3 v3.5.16 inside.

I don't understand how my insert is not complete at this time. Kine called rev, err := l.backend.Create(ctx, string(put.Key), put.Value, put.Lease)˛ successfully before my patch added the Event to the cache. This cache is a workaround to prove the problem.

Original code:

rev, err := l.backend.Create(ctx, string(put.Key), put.Value, put.Lease)

My patch:

	rev, err := l.backend.Create(ctx, string(put.Key), put.Value, put.Lease)
	if err == ErrKeyExists {
		return &etcdserverpb.TxnResponse{
			Header:    txnHeader(rev),
			Succeeded: false,
		}, nil
	} else if err != nil {
		return nil, err
	}

       createdCache.Set(string(put.Key), Event{
               Create: true,
               KV: &KeyValue{
                       Key:            string(put.Key),
                       CreateRevision: rev,
                       ModRevision:    rev,
                       Value:          put.Value,
                       Lease:          put.Lease,
               },
       })

So Kine executed Create, the rev contains the last insert id. My code pushes the Event into the cache with the key string(put.Key).

Later the get request calls rev, kv, err := l.backend.Get(ctx, string(r.Key), string(r.RangeEnd), r.Limit, r.Revision), and the kv is nil, so it doesn`t find the record in the database

rev, kv, err := l.backend.Get(ctx, string(r.Key), string(r.RangeEnd), r.Limit, r.Revision)
.

But the cache contains it, which is possible only, if backend.Create() was ok.
As i know Kine is not eventually consistent, so i don't understand how is this possible at all. Seems there is some random latency during the insert, which i don't understand where it is comes from.

@brandond
Copy link
Member

brandond commented Jan 6, 2025

The etcd (kine) datastore itself MUST not have a cache, as kine can run in a multi-node environment where cache coherency would be an issue. The database itself must be read to ensure a consistent view.

It is calling Kubernetes API server

I would probably investigate that side then. What is the specific sequence of create/get calls that you're making against the apiserver? The apiserver DOES have caches that may or may not be used depending on the requested resource version. See: https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/

@mhmxs
Copy link
Author

mhmxs commented Jan 6, 2025

@brandond the cache is just a workaround to prove there are cases, when get returns nil after a successful create. This test is an official Kubernetes e2e test, it first creates a namespace and waits for service account creation. The service account is created by Kube controller manager. The test is a common usecase, which Kine has to support properly. I try to run Kubernetes built in e2e test, and it fails on Kine. In this case Kuberneste cache doesn't play much role, because this happens during the initial sync, so there are no items in the cache. The test works perfectly with etcd but rarely fails on Kine. That's why i think the problem is on Kine side and not Kubernetes or e2e test.

@brandond
Copy link
Member

brandond commented Jan 6, 2025

I try to run Kubernetes built in e2e test, and it fails on Kine

Which specific E2E test is this? We run the upstream Kubernetes conformance tests on every Kine PR (ref), and on both kine and etcd on every K3s PR. I suspect something else is going on in your environment. How are you starting k3s/kine? What version are you using?

@mhmxs
Copy link
Author

mhmxs commented Jan 6, 2025

@brandond i'm not near my computer, but the ginkgo focus 'should patch a Namespace' matches with only one test. I'm executing the following tests: 'sig-api-machinery|sig-apps|sig-auth|sig-instrumentation|sig-scheduling'. Every test fails which tries to create a namespace and waits for service account creation. I saw the problem during Kubernetes startup as well (hack/local-cluster-up.sh because im on latest). As i wrote this doesn't happens all the time. If there is an error in my environment how it is working if i use built in etcd?

@mhmxs
Copy link
Author

mhmxs commented Jan 6, 2025

@brandond i forgot to write, thanks for your effor, i would do nore investigation.

@brandond
Copy link
Member

brandond commented Jan 6, 2025

Please share more information on how you're running kine and your Kubernetes nodes. Versions, storage, available resources, and so on.

@mhmxs
Copy link
Author

mhmxs commented Jan 7, 2025

Please share more information on how you're running kine and your Kubernetes nodes. Versions, storage, available resources, and so on.

@brandond

My Kubernetes version is: v1.32.0-alpha.1, it is a single node instance, and i'm running if from source code by executing hack/local-up-cluster.sh. No storage, the load on this machine is 0.7 with 10 cores, maximum 50% of ram usage.

I found what causes the issue. I disabled watch-cache at kubernetes api server. If i enable the cache, it works as well, i executed the test in a loop for more than an hour without any problem. But once i disable watch-cache 1 per 20 (each has 3 namespace creation, so 1 per 60 is the ratio) of the tests failed. (enabling watch cache is not an option for me)

Here is the test name you were asking for: [sig-api-machinery] Namespaces [Serial] should patch a Namespace [Conformance] [sig-api-machinery, Serial, Conformance], this is the fastest way to reproduce. (every test affected, which tries to create namespace)

First it creates the namespace: https://github.com/kubernetes/kubernetes/blob/cabf04828e7f2b33cea7cb23e7fe3dc158d990eb/test/e2e/framework/framework.go#L260
Then it waits for service account: https://github.com/kubernetes/kubernetes/blob/cabf04828e7f2b33cea7cb23e7fe3dc158d990eb/test/e2e/framework/framework.go#L269
And finally the watcher never receives create notification here: https://github.com/kubernetes/kubernetes/blob/cabf04828e7f2b33cea7cb23e7fe3dc158d990eb/test/e2e/framework/util.go#L289

The reason is ListFunc of the watcher is using a field selector: https://github.com/kubernetes/kubernetes/blob/cabf04828e7f2b33cea7cb23e7fe3dc158d990eb/test/e2e/framework/util.go#L291, so the list request became a get request to Kine. The time when it is executed the create endpoint has been called, but Kine get endpoint founds nil. Then Kine returns an empty response, but the revision number has been already increased, so WatchFunc start watching of the service account 1 or 2 revisions after the create revision, so Kine never sends any watch event.

I hope this gives better picture what is happening. I'm still investigating and trying to reproduce it on other versions. 🙏

@brandond
Copy link
Member

brandond commented Jan 7, 2025

v1.32.0-alpha.1

Why are you running an alpha? Kubernetes 1.32 has been out since December. Can you reproduce this on non-alpha versions of Kubernetes?

enabling watch cache is not an option for me

Why? At some point the feature-gate will go GA and be hardcoded on, so you should start adapting to it.

The time when it is executed the create endpoint has been called, but Kine get endpoint founds nil. Then Kine returns an empty response, but the revision number has been already increased, so WatchFunc start watching of the service account 1 or 2 revisions after the create revision, so Kine never sends any watch event.

Is this an issue with your client, or with Kubernetes itself? What resource version parameters are you using for your List (that does not find the resource) vs the Watch (that starts after the resource has been created)? If you start watching at the revision that did not find the resource when listing, the watch itself should return it even if it has already been created - as the Watch itself does a List internally when starting.

@mhmxs
Copy link
Author

mhmxs commented Jan 9, 2025

v1.32.0-alpha.1

Why are you running an alpha? Kubernetes 1.32 has been out since December. Can you reproduce this on non-alpha versions of Kubernetes?

I'm working on some Kubernetes development, and that was the time i created my branch. I rebased it to v1.33.0-alpha.0 and the problem is still there.

enabling watch cache is not an option for me

Why? At some point the feature-gate will go GA and be hardcoded on, so you should start adapting to it.

As i understand this is not a a feature gate, it is a config option for etcd storage. I don't find any deprecation notice around it. Please fix me if i'm wrong. If this flag would be removed sometimes, then i need an alternative driver because on some high load systems building the objects in each request is cheaper then holding everything in the memory.

The time when it is executed the create endpoint has been called, but Kine get endpoint founds nil. Then Kine returns an empty response, but the revision number has been already increased, so WatchFunc start watching of the service account 1 or 2 revisions after the create revision, so Kine never sends any watch event.

Is this an issue with your client, or with Kubernetes itself? What resource version parameters are you using for your List (that does not find the resource)

First of all, the list is a get because of the field selector metadata.name. The resource verion is 0.

vs the Watch (that starts after the resource has been created)?

Yes.

  • After namespace creation, the test starts watching service accounts with metadata.name=X.
  • In the same time kube-controller-manager creates the service account in this case. Let's say creation revision is 3. So create request gives back 3.
  • Next the reflector inside the cache tries to fill initial data, so it executes get request via ListFunc with resource version 0, because lastSyncResourceVersion is "".
  • In the worst case, Kine doesn't find the service account, so returns nil, but sometimes the revision is bumping, so get returns the actual revision let's say 4 in the header.
  • After the call, reflector sets the last known revision number to 4 (expects there is no object before revision 4) and starts watching via WatchFunc.
  • Then etcd watcher bumps the revision number clientv3.WithRev(lastKnownRevision + 1), and starts watching service accounts with revision 5.
  • Kine watch endpoint doesn't send the create event (revision 3) to the client and the test times out.

If you start watching at the revision that did not find the resource when listing, the watch itself should return it even if it has already been created - as the Watch itself does a List internally when starting.

I created a test to isolate Kine create and get, and the result is, every get found the resource previously created. So something wrong around the watch function, with revision 5 it doesn't sends create event at revision 3. This looks a timing issue. I think on case of watch-cache, the get request hits the cache, which hides the problem.

At this point this problem is a catch 22, because each component individually works as designed, but the overall result is not what we expect. Let me explain by counting changes fix the problem:

  • Enable watch cache
    • not an option, because i don't have enough memory to hold everything in cache
  • Swapping to etcd
    • not an option because etcd is not capable to store my data
  • Introducing a time.Sleep(time.Millisecond * 15) to the official Kube e2e test between namespace creation and service account watch
    • 15ms is the minimum delay works on my env
    • this is not an option, because the test is a valid workflow, nobody wants to implement extra wait ahead of watch or put sleeps in the code
  • Using some cache mechanism
    • you didn't suggest this and i can agree
  • Fixing watch to detect this issue, fetch the object for create revision, fetch the object with create revision, hopefully it hasn't compacted, and send an initial create event before regular watch
    • i'm not sure sending initial create event is how every watch has to work, please shed some light in my head
    • hard (or impossible) to detect this scenario properly
    • this solution never would be merged into Kine

Thank you for helping me investigating this problem, and if you have any more idea please share it with me. 🙏

@brandond
Copy link
Member

brandond commented Jan 9, 2025

As i understand this is not a a feature gate, it is a config option for etcd storage.

I was referring to the ConsistentListFromCache feature gate. It sounds like you're not disabling that though - how are you turning off the watch cache? By telling Kine to use an emulated etcd version that will cause the apiserver to disable it internally?

I created a test to isolate Kine create and get, and the result is, every get found the resource previously created. So something wrong around the watch function, with revision 5 it doesn't sends create event at revision 3. This looks a timing issue. I think on case of watch-cache, the get request hits the cache, which hides the problem.

Can you run kine with --debug while running the Kubernetes test, and capture the sql query logs? It'll probably be a lot, but it would probably help us understand the sequence of events.

I probably won't have time this week, but I might be able to do some investigation myself next week.

@mhmxs
Copy link
Author

mhmxs commented Jan 9, 2025

I was referring to the ConsistentListFromCache feature gate. It sounds like you're not disabling that though - how are you turning off the watch cache? By telling Kine to use an emulated etcd version that will cause the apiserver to disable it internally?

@brandond Kubernetes api server has a CLI flag to disable it: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#options

@brandond
Copy link
Member

brandond commented Jan 9, 2025

Oh interesting, I've honestly never seen anyone use that, I thought you were tinkering with the newer consistent cache read stuff from https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/

I don't know that we've ever tested Kine with that flag set on the apiserver. It is very possible you're doing something that we've never tried.

@mhmxs
Copy link
Author

mhmxs commented Jan 9, 2025

Oh interesting, I've honestly never seen anyone use that,

Then no one else tried to push api server to it's limit ;)

I thought you were tinkering with the newer consistent cache read stuff from https://kubernetes.io/blog/2024/08/15/consistent-read-from-cache-beta/

Nope, but this is an other pain, i have to test everything with the combination of WatchList, WatchListClient, ResilientWatchCacheInitialization, ConsistentListFromCache :D but it is a different story

I don't know that we've ever tested Kine with that flag set on the apiserver. It is very possible you're doing something that we've never tried.

Yes, that is my life mission, using things they are not capable for, than figure out how they would be capable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants