During rolling update in Kubernetes data in KV store is lost in some nodes #4351

yoadey · 2023-07-31T10:46:09Z

We are having a problem with our NATS cluster in a Kubernetes cluster, were the entries in KV stores differ betwenn the nats nodes after server restarts, leaving us with data loss in some cases. As the Kubernetes cluster gets updated every week, this issue was now seen several times, although we cannot reproduce it in a stable manner.
Our current workaround is, that if we don't get the data in the first run, we do the same request up to 5 times and then we normally get the wanted data due to the server request routing.

Defect

Included nats-server -DV output
Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

[68] 2023/07/31 10:13:00.901015 [INF] Starting nats-server
[68] 2023/07/31 10:13:00.901069 [INF]   Version:  2.9.20
[68] 2023/07/31 10:13:00.901073 [INF]   Git:      [not set]
[68] 2023/07/31 10:13:00.901079 [DBG]   Go build: go1.20.6
[68] 2023/07/31 10:13:00.901091 [INF]   Name:     ND2HTZ5KWVBYGEJPIJBGHBVKSPXTUSWULKAOKXKSFLQNW23C6RYO7DTC
[68] 2023/07/31 10:13:00.901100 [INF]   ID:       ND2HTZ5KWVBYGEJPIJBGHBVKSPXTUSWULKAOKXKSFLQNW23C6RYO7DTC
[68] 2023/07/31 10:13:00.901175 [DBG] Created system account: "$SYS"

(We used a custom build from 2.9.20, where we included this specific fix: #4320, Problem also occured with offical 2.9.19 nats-server container)

All client libraries are affected.

OS/Container environment:

Kubernetes containerd

Steps or code to reproduce the issue:

Have the following cluster setup:

3 NATS nodes
Some (~1000) KV stores with replicas set to 3 and persistence set to file
Write random entries to the KV stores (and keep writing/updating)
restart one node one by one (rolling update of cluster/nats)
check all entries exist in all KV stores on all nats servers/nodes, e.g. with requesting multiple times all entries in the same KV Store with the request nats kv history <keyvaluestore> --trace (trace option shows the server)

Expected result:

Requesting multiple times all entries in the same kv store should always return the same list.

Actual result:

Sometimes after restarts the result differs between the nodes and stay different.

Additional details

This does not occur after every restart, seem also to depend a little bit on the downtime from the node and for some KVs it seem to heal itself after some time (sometimes minutes, sometime hours, sometime days).

Go code for testscript:

Following script executes listing the entries in the KV store multiple times against the KV store and prints the result.
As the NATS server routes the traffic by itself, we don't need to switch servers.

package main

import (
	"log"
	"os"
	"strings"

	"github.com/nats-io/nats.go"
)

func connect() nats.JetStreamContext {
	url := os.Getenv("NATS_URL")
	if url == "" {
		url = nats.DefaultURL
	}

	nc, err := nats.Connect(url)
	if err != nil {
		log.Fatal(err)
	}
	js, err := nc.JetStream()
	if err != nil {
		log.Fatal(err)
	}
	log.Printf("connected %s", url)
	return js
}

func kvNames() []string {

	js := connect()

	list := make([]string, 0, 1024)
	for r := range js.KeyValueStoreNames() {
		list = append(list, r[3:])
	}
	return list
}

func main() {
	log.Print("natscheck ...")

	list := kvNames()

	js := connect()
	for i, kvname := range list {
		log.Printf("%d/%d check buckets:%s", i, len(list), kvname)
		var lastList []string
		var entries int = -1
		for loop := 0; loop < 10; loop++ {

			kv, err := js.KeyValue(kvname)
			if err != nil {
				log.Fatalf("[%s]:%v", kvname, err)
			}
			currList, err := kv.Keys()
			if err != nil && !strings.Contains(err.Error(), "no keys found") {
				log.Fatal(err)
			}
			//kv.History()
			if loop > 1 {
				diff1 := difference(currList, lastList)
				if len(diff1) > 0 {
					log.Printf("Found %d/%d different keys in current, but not last list: %s", len(diff1), len(currList), strings.Join(diff1, ", "))
					entries = -1
				}
				diff2 := difference(lastList, currList)
				if len(diff2) > 0 {
					log.Printf("Found %d/%d different keys in last, but not current list: %s", len(diff2), len(lastList), strings.Join(diff2, ", "))
					entries = -1
				}
			} else {
				entries = len(currList)
			}
			lastList = currList
		}
		if entries > 0 {
			log.Printf("Found %d same entries", entries)
		}
	}
}

func difference(a, b []string) []string {
	mb := make(map[string]struct{}, len(b))
	for _, x := range b {
		mb[x] = struct{}{}
	}
	var diff []string
	for _, x := range a {
		if _, found := mb[x]; !found {
			diff = append(diff, x)
		}
	}
	return diff
}

The text was updated successfully, but these errors were encountered:

derekcollison · 2023-07-31T18:01:51Z

When doing restarts, do you wait for /healthz to return ok before moving to next server?

derekcollison · 2023-07-31T18:02:59Z

Also be good to make sure we can reproduce from top of main, the 2.9.21 RC candidate.

yoadey · 2023-07-31T18:36:08Z

The readiness check waits for /healthz to return ok, and before that we don't server any traffic to that nats server.
However, I have the impression, that the new nats server registers itself at the other nats severs before its /healthz returns ok and those ready nats nodes serve requests to it. But that's just an impression from the cpu and memory stats and some logs I had, which I can't say reliably.

derekcollison · 2023-07-31T18:37:40Z

For the direct GETS, which all stream peers can participate in yes once they think they are caught up they can participate..

wallyqs · 2023-07-31T18:49:17Z

One thing that you could try too is to bump the initialDelaySeconds to be longer during your rollouts:

    liveness:
      enabled: true
      initialDelaySeconds: 100

https://github.com/nats-io/k8s/blob/f87652b54cbcbee6c32f7a3392982f4f690f0dc2/helm/charts/nats/values.yaml#L51C7-L51C26

yoadey · 2023-08-01T06:49:23Z

Will try the change in the initialDelaySeconds for the next restart next week, we currently want to restart as few times as possible as only the production system hast currently the load where we can monitor this with good data. We are trying to setup another test instance where we can reproduce this behavior, but until we have it finished we have to wait.

yoadey · 2023-08-01T14:25:39Z

Just tested the liveness.initialDelaySeconds set to 120 but it didn't helped, also for readiness.initialDelaySeconds.
One thing to note: the state before the test was good, meaning we didn't had any issues as the nats cluster seemed to have healed itself this morning when we made a rolling update to the application which uses the nats.

yoadey · 2023-08-14T08:30:36Z

I've tried to reproduce the behavior with a small go client, but didn't succeed. We can reproduce the behavior with our full java application in a dev environment, but couldn't yet isolate the root cause.

But we had it now multiple times, that the nats cluster synchronized itself after our application was restarted. Is there anything (open connections, special commands) which can prevent the cluster from syncronizing? Especially in the java library?

yoadey · 2023-08-16T13:48:23Z

I've seen, that in the helm chart the service has set publishNotReadyAddresses: true. Afterwards I found your blog post of the new helm chart release, explicitely mentioning this flag: https://nats.io/blog/nats-helm-chart-1.0.0-rc/

Could this be the reason for this behavior? Since we didn't create an additional service, our client probably connected to a not yet ready nats pod through this service, this would also render the readiness.initialDelaySeconds useless.

We'll try to upgrade to the latest chart, but this will take some time as we need to thoroughly test the changes before going life.

derekcollison · 2023-08-16T15:58:43Z

Keep us posted, you should not be losing any messages of course.

mpwalkerdine · 2024-01-31T09:00:13Z

We appear to have come across a very similar situation, in which an upgrade to NATS 2.10.9 has resulted in a loss of the complete history of messages on two R3 streams. Initial investigations suggest this is due to the server nodes being killed during recovery by a failing health check, but we're yet to determine a minimal reproduction (minus Nomad and CSI plugins for persistent disk storage in GCP) which demonstrates the deletion of the message block files in this case.

Any thoughts on additional avenues to explore or whether this sounds like a plausible explanation would be greatly appreciated.

wallyqs · 2024-01-31T15:05:00Z

@mpwalkerdine were you using interest streams?

mpwalkerdine · 2024-01-31T15:12:19Z

Hi @wallyqs - no, permanent streams with durable consumers. A couple of potentially interesting things appeared in our logs around this time too, Resetting WAL state and Stream state too short (0 bytes) 🤔

mpwalkerdine · 2024-02-01T17:00:57Z

A quick update for any future readers - we've narrowed the issue down somewhat and it appears to be related to the NATS disks being pulled at inappropriate times by the CSI plugin, leading to corruption and the log line during next startup:
Detected orphaned stream '[ACC] > [NAME]' will cleanup, which ultimately causes that node to delete it's message storage on startup.

mpwalkerdine · 2024-02-06T19:59:32Z

@wallyqs FYI it looks like one or more of the fixes in 2.10.10 has resolved our issue - I can no longer reproduce the deletion on startup issue, no orphaned stream logs and no propagation of that empty state to other nodes in the cluster 🤞

wallyqs · 2024-02-06T20:14:12Z

@mpwalkerdine thanks for the update!

mpwalkerdine · 2024-02-07T12:12:57Z

Looks like I spoke too soon - more testing shows that one of the nodes has lost its state on disk, and if it becomes a cluster leader that stream reports 0 messages. The files on disk are indeed empty:

-rw-r----- 1 root root  0 Feb  7 09:33 1.blk
-rw------- 1 root root 34 Feb  7 11:53 index.db

Information for Stream REDACTED created 2024-02-06 17:41:58

          Description: REDACTED
             Subjects: REDACTED.>
             Replicas: 3
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2h0m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: 9.3 GiB
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


Cluster Information:

                 Name: nats
               Leader: nats-node-2
              Replica: nats-node-0, current, seen 0.53s ago
              Replica: nats-node-1, current, seen 0.53s ago

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 6,915
              LastSeq: 6,914
     Active Consumers: 362

yoadey · 2024-02-21T08:35:04Z

For me, the issue seems to be solved with the latest version 2.10.11. We are not able to reproduce it anymore, so it could be closed

Jarema · 2024-02-21T09:01:29Z

@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too?

mpwalkerdine · 2024-02-21T12:23:29Z

@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too?

I'll retest against 2.10.11 and report back. Our issue seems to be caused by some pretty unreasonable behaviour in our CSI disk management, so it is perhaps not unexpected for NATS to take issue with that. I did expect the faulty node to recover from the rest of the cluster though.

mpwalkerdine · 2024-02-21T13:48:01Z

@Jarema the behaviour has changed, although there still seems to be a discrepancy between nodes when they come back up:

Healthy node as leader:

State:

             Messages: 789
                Bytes: 5.0 MiB
             FirstSeq: 1 @ 2024-02-21T13:29:29 UTC
              LastSeq: 789 @ 2024-02-21T13:43:32 UTC
     Active Consumers: 321
   Number of Subjects: 95

Faulty node as leader:

State:

             Messages: 543
                Bytes: 2.4 MiB
             FirstSeq: 247 @ 2024-02-21T13:42:58 UTC
              LastSeq: 789 @ 2024-02-21T13:43:32 UTC
     Active Consumers: 321
   Number of Subjects: 3

So they're no longer losing all their state, just some of it.

Pertinent logs appear to be:

[warn] Detected orphaned stream '$G > REDACTED', will cleanup
[warn] Advisory could not be sent for account "$G": system account not setup

mpwalkerdine · 2024-03-05T10:12:34Z

Another quick update - we've been running 2.10.11 in production for a week or so now and although it appears the state is no longer lost completely, it does still break the ability to publish messages when they're sent to a node with bad state. Our workaround for this is to scale down to a healthy node and up again and recreate all the consumers (we've got separate idempotency protection).

wallyqs · 2024-08-27T21:41:35Z

Fixed via #5821, thanks a lot @yoadey for the reproduction at https://github.com/yoadey/nats-test that was very useful.

yoadey added the 🐞 bug label Jul 31, 2023

bruth added defect Suspected defect such as a bug or regression and removed 🐞 bug labels Aug 18, 2023

yoadey mentioned this issue Dec 19, 2023

[Jetstream] Stream and consumer went out of sync after rolling restart of NATS servers [v2.10.6, v2.10.7] #4875

Open

wallyqs closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During rolling update in Kubernetes data in KV store is lost in some nodes #4351

During rolling update in Kubernetes data in KV store is lost in some nodes #4351

yoadey commented Jul 31, 2023 •

edited

Loading

derekcollison commented Jul 31, 2023

derekcollison commented Jul 31, 2023

yoadey commented Jul 31, 2023

derekcollison commented Jul 31, 2023

wallyqs commented Jul 31, 2023

yoadey commented Aug 1, 2023

yoadey commented Aug 1, 2023

yoadey commented Aug 14, 2023

yoadey commented Aug 16, 2023

derekcollison commented Aug 16, 2023

mpwalkerdine commented Jan 31, 2024

wallyqs commented Jan 31, 2024

mpwalkerdine commented Jan 31, 2024

mpwalkerdine commented Feb 1, 2024

mpwalkerdine commented Feb 6, 2024

wallyqs commented Feb 6, 2024

mpwalkerdine commented Feb 7, 2024

yoadey commented Feb 21, 2024 •

edited

Loading

Jarema commented Feb 21, 2024

mpwalkerdine commented Feb 21, 2024

mpwalkerdine commented Feb 21, 2024 •

edited

Loading

mpwalkerdine commented Mar 5, 2024

wallyqs commented Aug 27, 2024

During rolling update in Kubernetes data in KV store is lost in some nodes #4351

During rolling update in Kubernetes data in KV store is lost in some nodes #4351

Comments

yoadey commented Jul 31, 2023 • edited Loading

Defect

Versions of nats-server and affected client libraries used:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

Additional details

Go code for testscript:

derekcollison commented Jul 31, 2023

derekcollison commented Jul 31, 2023

yoadey commented Jul 31, 2023

derekcollison commented Jul 31, 2023

wallyqs commented Jul 31, 2023

yoadey commented Aug 1, 2023

yoadey commented Aug 1, 2023

yoadey commented Aug 14, 2023

yoadey commented Aug 16, 2023

derekcollison commented Aug 16, 2023

mpwalkerdine commented Jan 31, 2024

wallyqs commented Jan 31, 2024

mpwalkerdine commented Jan 31, 2024

mpwalkerdine commented Feb 1, 2024

mpwalkerdine commented Feb 6, 2024

wallyqs commented Feb 6, 2024

mpwalkerdine commented Feb 7, 2024

yoadey commented Feb 21, 2024 • edited Loading

Jarema commented Feb 21, 2024

mpwalkerdine commented Feb 21, 2024

mpwalkerdine commented Feb 21, 2024 • edited Loading

mpwalkerdine commented Mar 5, 2024

wallyqs commented Aug 27, 2024

yoadey commented Jul 31, 2023 •

edited

Loading

Versions of `nats-server` and affected client libraries used:

yoadey commented Feb 21, 2024 •

edited

Loading

mpwalkerdine commented Feb 21, 2024 •

edited

Loading