-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot #10371
Comments
brandond
changed the title
Multiple simultaneous snapshots result in silent failure of at least one snapshot
Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot
Jun 18, 2024
This was referenced Jun 19, 2024
FYI this is NOT a regression introduced by #9816, older releases would do this too: root@k3s-server-1:/# k3s --version
k3s version v1.29.2+k3s1 (86f10213)
go version go1.21.7
root@k3s-server-1:/# k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459
{"level":"info","ts":"2024-06-19T16:44:19.419289Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part"}
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459
{"level":"info","ts":"2024-06-19T16:44:19.419976Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part"}
{"level":"info","ts":"2024-06-19T16:44:19.42158Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-06-19T16:44:19.421614Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2024-06-19T16:44:19.422379Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-06-19T16:44:19.422565Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2024-06-19T16:44:19.444205Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-06-19T16:44:19.444299Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-06-19T16:44:19.453012Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"3.8 MB","took":"now"}
{"level":"info","ts":"2024-06-19T16:44:19.45302Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"3.8 MB","took":"now"}
{"level":"info","ts":"2024-06-19T16:44:19.453083Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459"}
ERRO[0000] Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459: no such file or directory)
INFO[0000] Reconciling ETCDSnapshotFile resources
INFO[0000] Reconciling ETCDSnapshotFile resources
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete
[1]+ Done k3s etcd-snapshot save |
Validated on Version:-$ k3s version v1.30.2-rc3+k3s1 (aa4794b3)
Environment DetailsInfrastructure Node(s) CPU architecture, OS, and Version: Cluster Configuration: Steps to validate the fix
Reproduction Issue:
Validation Results:
|
This was referenced Jul 24, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When running with embedded etcd, the semaphore that prevents more than one snapshot from being executed at a time is not working. If multiple snapshots are run at the same time, one will succeed, while the other will silently fail - from the CLI's perspective. There is an error reported on the server side:
CLI:
Service log:
ERRO[0116] Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355: no such file or directory)
The two snapshots use the same timestamp in the file name, and race to create and the same snapshot and temp files.
starting kubernetes: preparing server: start managed database: expected sha256 [255 75 90 78 90 111 68 23 86 217 93 179 107 108 119 123 137 66 118 159 13 222 254 55 165 179 240 118 42 106 52 227], got [78 59 101 77 122 31 226 10 108 170 71 20 56 213 159 175 121 211 215 158 194 45 204 220 42 9 244 230 144 201 194 32]
https://github.com/etcd-io/etcd/blob/v3.5.11/etcdutl/snapshot/v3_snapshot.go#L450
https://github.com/etcd-io/etcd/blob/v3.5.11/etcdutl/snapshot/v3_snapshot.go#L308
The full stack on the restore error is:
The text was updated successfully, but these errors were encountered: