Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NATS Deleting Recovered Stream as Orphaned #5382

Closed
thorntonmc opened this issue May 2, 2024 · 7 comments
Closed

NATS Deleting Recovered Stream as Orphaned #5382

thorntonmc opened this issue May 2, 2024 · 7 comments
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while

Comments

@thorntonmc
Copy link

Observed behavior

NATS recovered messages from a stream, but then deleted messages afterwards

[1] 2024/05/02 14:25:12.795323 [INF]   Max Storage:     1000.00 GB
[1] 2024/05/02 14:25:12.795329 [INF]   Store Directory: "/nats/jetstream"
[1] 2024/05/02 14:25:12.795333 [INF] -------------------------------------------
[1] 2024/05/02 14:25:12.796142 [INF]   Starting restore for stream '$G > BR'
[1] 2024/05/02 14:25:12.822254 [INF]   Restored 771 messages for stream '$G > BR'
[1] 2024/05/02 14:25:12.822472 [INF]   Starting restore for stream '$G > LR'
[1] 2024/05/02 14:25:12.822902 [INF]   Restored 0 messages for stream '$G > LR'
[1] 2024/05/02 14:25:12.823062 [INF]   Starting restore for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823418 [INF]   Restored 0 messages for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823531 [INF]   Starting restore for stream '$G > OR'
[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.868300 [INF]   Recovering 3 consumers for stream - '$G > OEN'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'
[1] 2024/05/02 14:25:30.201230 [INF] Starting JetStream cluster
[1] 2024/05/02 14:25:30.201246 [INF] Creating JetStream metadata controller
[1] 2024/05/02 14:25:30.201507 [INF] JetStream cluster bootstrapping
[1] 2024/05/02 14:25:30.201980 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2024/05/02 14:25:30.202065 [WRN] Detected orphaned stream '$G > BR', will cleanup
[1] 2024/05/02 14:25:30.202342 [INF] Server is ready
[1] 2024/05/02 14:25:30.202457 [INF] Cluster name is gq-nats
[1] 2024/05/02 14:25:30.202537 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2024/05/02 14:25:30.208864 [ERR] Error trying to connect to route (attempt 1): lookup for host "gq-nats-0.gq-nats.generic-queue.svc.cluster.local": lookup gq-nats-0.gq-nats.generic-queue.svc.cluster.local on 10.96.0.10:53: no such host
[1] 2024/05/02 14:25:30.239648 [WRN] Detected orphaned stream '$G > LR', will cleanup
[1] 2024/05/02 14:25:30.240497 [WRN] Detected orphaned stream '$G > OEN', will cleanup
[1] 2024/05/02 14:25:30.243571 [WRN] Detected orphaned stream '$G > OR', will cleanup

Expected behavior

NATS should not delete the "OR" stream - as it and its consumers were recovered

[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'

Server and client version

2.9.15

Host environment

Running as a Kubernetes statefulset.

Steps to reproduce

The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1

A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:

2024-05-02 09:56:55	
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.538942 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
	
2024-05-02 09:56:31	
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

This then drove us to restart the service.

@thorntonmc thorntonmc added the defect Suspected defect such as a bug or regression label May 2, 2024
@derekcollison
Copy link
Member

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

@thorntonmc thorntonmc changed the title NATS Deleting Recovered Stream as "Orphaned" NATS Deleting Recovered Stream as Orphaned May 3, 2024
@thorntonmc
Copy link
Author

thorntonmc commented May 3, 2024

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

Trying to understand this a bit more - here's the order of what happened:

  1. NATS restarts in a bad state
  2. The node in question comes up, sees messages and consumers from a stream called "OR", recovers them
  3. Doesn't see that stream in the meta layer, deletes the stream
  4. Later the same stream appears with 0 messages, on that same node.

If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?

@derekcollison
Copy link
Member

Could you add some more information to restarts in a bad state?

@thorntonmc
Copy link
Author

thorntonmc commented May 3, 2024

Could you add some more information to restarts in a bad state?

Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:

[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]

After the "new consumer" logs stop - wee see these errors:

2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
	
2024-05-02 09:56:31	
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:23	
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:23	
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:22	
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:22	
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:21	
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:21	
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]

Followed by repeated logging of the following:

[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

and

2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL

At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart.

@derekcollison
Copy link
Member

It says it ran out of resources and shutdown JetStream. We should address that first.

@github-actions github-actions bot added the stale This issue has had no activity in a while label Jun 29, 2024
@VeryStrongFingers
Copy link

VeryStrongFingers commented Aug 19, 2024

@derekcollison I've just run into a similar occurrence on NATS 2.9.20 with a cluster of 3, and my stream of only 1 replica getting wiped.
Not sure if it's the same issue as OP though, let me know if you'd rather I create a separate issue.

tl;dr: Stream "AAAAA" was recovered but then wiped in the same process by nats-server after a restart (due to underlying jetstream volume was 100% full)

Cluster:

  • nats-1 (stream "AAAAA" data persists here (leader?))
  • nats-2
  • nats-3

Streams:

  • "AAAAA" = 1 replica, jetstream data lives on nats-1
  • "BBBBB" = 3 replica, was not affected by nats-1 outage

Sequence of events:

  1. Jetstream volume (disk) hits 100% used on nats-1 server
  2. Attempted to restart nats-1 pod (before checking anything, obviously restart typically fixes everything)
    • failed. nats-1 was still on a startup crash/loop with logs below
nats-1 logs on crash/bootloop
❯ k logs nats-1 --container nats
[547] 2024/08/19 02:04:44.222427 [INF] Starting nats-server
[547] 2024/08/19 02:04:44.222464 [INF]   Version:  2.9.20
[547] 2024/08/19 02:04:44.222467 [INF]   Git:      [97dd7cb]
[547] 2024/08/19 02:04:44.222469 [INF]   Cluster:  nats
[547] 2024/08/19 02:04:44.222471 [INF]   Name:     nats-1
[547] 2024/08/19 02:04:44.222474 [INF]   Node:     yrzKKRBu
[547] 2024/08/19 02:04:44.222477 [INF]   ID:       ~
[547] 2024/08/19 02:04:44.222483 [INF] Using configuration file: /etc/nats-config/nats.conf
[547] 2024/08/19 02:04:44.223363 [INF] Starting http monitor on 0.0.0.0:8222
[547] 2024/08/19 02:04:44.223397 [INF] Starting JetStream
[547] 2024/08/19 02:04:44.223506 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[547] 2024/08/19 02:04:44.223508 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[547] 2024/08/19 02:04:44.223510 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[547] 2024/08/19 02:04:44.223511 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[547] 2024/08/19 02:04:44.223512 [INF]
[547] 2024/08/19 02:04:44.223513 [INF]          https://docs.nats.io/jetstream
[547] 2024/08/19 02:04:44.223514 [INF]
[547] 2024/08/19 02:04:44.223515 [INF] ---------------- JETSTREAM ----------------
[547] 2024/08/19 02:04:44.223517 [INF]   Max Memory:      0 B
[547] 2024/08/19 02:04:44.223519 [INF]   Max Storage:     10.00 GB
[547] 2024/08/19 02:04:44.223521 [INF]   Store Directory: "/data/jetstream"
[547] 2024/08/19 02:04:44.223522 [INF] -------------------------------------------
[547] 2024/08/19 02:04:44.223828 [INF]   Starting restore for stream '$G > AAAAA'
[547] 2024/08/19 02:04:45.007605 [INF]   Restored 19,792,763 messages for stream '$G > AAAAA'
[547] 2024/08/19 02:04:45.007761 [INF]   Starting restore for stream '$G > BBBBB'
[547] 2024/08/19 02:04:45.008076 [INF]   Restored 7,386 messages for stream '$G > BBBBB'
[547] 2024/08/19 02:04:45.008110 [INF]   Recovering 8 consumers for stream - '$G > AAAAA'
[547] 2024/08/19 02:04:45.008955 [INF] Starting JetStream cluster
[547] 2024/08/19 02:04:45.008964 [INF] Creating JetStream metadata controller
[547] 2024/08/19 02:04:45.009279 [INF] JetStream cluster recovering state
[547] 2024/08/19 02:04:45.016651 [WRN] Could not start metadata controller: could not create snapshots directory - mkdir /data/jetstream/$SYS/_js_/_meta_/snapshots: no space left on device
[547] 2024/08/19 02:04:45.016663 [FTL] Can't start JetStream: could not create snapshots directory - mkdir /data/jetstream/$SYS/_js_/_meta_/snapshots: no space left on device
  1. Volume was resized
  2. nats-1 was restarted again, which is when the "AAAAA" stream was recovered then deleted (nats-1 is online with the cluster now)
nats-1 logs on restart (stream recover+wipe)
❯ k logs nats-1 --container nats
[52] 2024/08/19 02:26:52.121966 [INF] Starting nats-server
[52] 2024/08/19 02:26:52.122022 [INF]   Version:  2.9.20
[52] 2024/08/19 02:26:52.122038 [INF]   Git:      [97dd7cb]
[52] 2024/08/19 02:26:52.122051 [INF]   Cluster:  nats
[52] 2024/08/19 02:26:52.122064 [INF]   Name:     nats-1
[52] 2024/08/19 02:26:52.122084 [INF]   Node:     yrzKKRBu
[52] 2024/08/19 02:26:52.122098 [INF]   ID:       ~
[52] 2024/08/19 02:26:52.122330 [INF] Using configuration file: /etc/nats-config/nats.conf
[52] 2024/08/19 02:26:52.123241 [INF] Starting http monitor on 0.0.0.0:8222
[52] 2024/08/19 02:26:52.123273 [INF] Starting JetStream
[52] 2024/08/19 02:26:52.123360 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[52] 2024/08/19 02:26:52.123362 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[52] 2024/08/19 02:26:52.123363 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[52] 2024/08/19 02:26:52.123365 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[52] 2024/08/19 02:26:52.123366 [INF]
[52] 2024/08/19 02:26:52.123367 [INF]          https://docs.nats.io/jetstream
[52] 2024/08/19 02:26:52.123368 [INF]
[52] 2024/08/19 02:26:52.123369 [INF] ---------------- JETSTREAM ----------------
[52] 2024/08/19 02:26:52.123371 [INF]   Max Memory:      0 B
[52] 2024/08/19 02:26:52.123373 [INF]   Max Storage:     50.00 GB
[52] 2024/08/19 02:26:52.123374 [INF]   Store Directory: "/data/jetstream"
[52] 2024/08/19 02:26:52.123376 [INF] -------------------------------------------
[52] 2024/08/19 02:26:52.123667 [INF]   Starting restore for stream '$G > AAAAA'
[52] 2024/08/19 02:26:52.909718 [INF]   Restored 19,792,763 messages for stream '$G > AAAAA'
[52] 2024/08/19 02:26:52.909834 [INF]   Starting restore for stream '$G > BBBBB'
[52] 2024/08/19 02:26:52.910189 [INF]   Restored 7,386 messages for stream '$G > BBBBB'
[52] 2024/08/19 02:26:52.910323 [INF]   Recovering 8 consumers for stream - '$G > AAAAA'
[52] 2024/08/19 02:26:52.911199 [INF] Starting JetStream cluster
[52] 2024/08/19 02:26:52.911208 [INF] Creating JetStream metadata controller
[52] 2024/08/19 02:26:52.911540 [INF] JetStream cluster recovering state
[52] 2024/08/19 02:26:52.911954 [INF] Listening for client connections on 0.0.0.0:4222
[52] 2024/08/19 02:26:52.911978 [WRN] Detected orphaned stream '$G > AAAAA', will cleanup
[52] 2024/08/19 02:26:52.912125 [INF] Server is ready
[52] 2024/08/19 02:26:52.912145 [INF] Cluster name is nats
[52] 2024/08/19 02:26:52.912176 [INF] Listening for route connections on 0.0.0.0:6222
[52] 2024/08/19 02:26:52.916458 [INF] 10.2.4.154:6222 - rid:28 - Route connection created
[52] 2024/08/19 02:26:52.916895 [INF] 10.2.3.147:6222 - rid:29 - Route connection created
[52] 2024/08/19 02:26:52.934425 [INF] 10.2.3.147:60874 - rid:30 - Route connection created
[52] 2024/08/19 02:26:52.935030 [INF] 10.2.3.147:6222 - rid:29 - Router connection closed: Duplicate Route
[52] 2024/08/19 02:26:52.936331 [INF] 10.2.3.147:60874 - rid:30 - Router connection closed: Client Closed
[52] 2024/08/19 02:26:52.972944 [INF] 10.2.3.147:60902 - rid:31 - Route connection created
[52] 2024/08/19 02:26:52.973135 [INF] 10.2.3.147:60890 - rid:32 - Route connection created
[52] 2024/08/19 02:26:52.973422 [INF] 10.2.3.147:60902 - rid:31 - Router connection closed: Duplicate Route
[52] 2024/08/19 02:26:53.012803 [WRN] Waiting for routing to be established...
[52] 2024/08/19 02:26:53.018662 [INF] 10.2.3.147:60904 - rid:33 - Route connection created
[52] 2024/08/19 02:26:53.019046 [INF] 10.2.3.147:60904 - rid:33 - Router connection closed: Duplicate Route
[52] 2024/08/19 02:26:53.075148 [INF] 10.2.4.154:48354 - rid:34 - Route connection created
[52] 2024/08/19 02:26:53.075391 [INF] 10.2.4.154:48354 - rid:34 - Router connection closed: Duplicate Route
[52] 2024/08/19 02:26:53.231558 [WRN] Detected orphaned stream '$G > BBBBB', will cleanup
[52] 2024/08/19 02:26:53.232347 [INF] JetStream cluster new metadata leader: nats-2/nats
[52] 2024/08/19 02:26:53.235063 [INF] JetStream cluster new stream leader for '$G > AAAAA'
[52] 2024/08/19 02:26:53.236201 [INF] JetStream cluster new consumer leader for '$G > AAAAA > logger-vsfds'
[52] 2024/08/19 02:26:53.244121 [WRN] RAFT [yrzKKRBu - _meta_] Falling behind in health check, commit 3128454 != applied 0
[52] 2024/08/19 02:26:53.244132 [WRN] JetStream is not current with the meta leader
[52] 2024/08/19 02:26:53.246372 [INF] JetStream cluster new consumer leader for '$G > BBBBB > PMGQkC39SrLs0pvkd6zCDB_1'
[52] 2024/08/19 02:26:53.249819 [INF] JetStream cluster new consumer leader for '$G > AAAAA > wW0rW6BmVAXZTthq2HA5dc_3'
[52] 2024/08/19 02:26:53.754579 [INF] 10.2.3.147:60908 - rid:102 - Route connection created
[52] 2024/08/19 02:26:53.754732 [INF] 10.2.3.147:60908 - rid:102 - Router connection closed: Duplicate Route
[52] 2024/08/19 02:26:53.965672 [INF] 10.2.3.147:6222 - rid:103 - Route connection created
[52] 2024/08/19 02:26:53.966182 [INF] 10.2.3.147:6222 - rid:103 - Router connection closed: Duplicate Route
[52] 2024/08/19 02:26:57.951420 [INF] JetStream cluster new consumer leader for '$G > BBBBB > EO8RDR3Ob3zTwDuxVizAiI_1'
  1. "AAAAA" was effectively restarted, back to sequence 0, no data on disk despite the Restored 19,792,763 messages for stream '$G > AAAAA' on the curent process logs

Warning logs excerpt:

[52] 2024/08/19 02:26:52.911978 [WRN] Detected orphaned stream '$G > AAAAA', will cleanup
[52] 2024/08/19 02:26:53.012803 [WRN] Waiting for routing to be established...
[52] 2024/08/19 02:26:53.231558 [WRN] Detected orphaned stream '$G > BBBBB', will cleanup
[52] 2024/08/19 02:26:53.244121 [WRN] RAFT [yrzKKRBu - _meta_] Falling behind in health check, commit 3128454 != applied 0
[52] 2024/08/19 02:26:53.244132 [WRN] JetStream is not current with the meta leader

@wallyqs
Copy link
Member

wallyqs commented Aug 19, 2024

Fixed via #5767

@wallyqs wallyqs closed this as completed Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while
Projects
None yet
Development

No branches or pull requests

4 participants