Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot #10371

Closed
brandond opened this issue Jun 18, 2024 · 2 comments

Comments

@brandond
Copy link
Contributor

brandond commented Jun 18, 2024

When running with embedded etcd, the semaphore that prevents more than one snapshot from being executed at a time is not working. If multiple snapshots are run at the same time, one will succeed, while the other will silently fail - from the CLI's perspective. There is an error reported on the server side:

CLI:

root@k3s-server-1:~# k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5
INFO[0000] Snapshot on-demand-k3s-server-1-1718750355 saved.
[1]+  Done                       k3s etcd-snapshot save
root@k3s-server-1:~#

Service log:

INFO[0116] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355
{"level":"info","ts":"2024-06-18T22:39:15.15636Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part"}
{"level":"info","ts":"2024-06-18T22:39:15.158226Z","logger":"etcd-client.client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-06-18T22:39:15.158266Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2024-06-18T22:39:15.158438Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":2936832,"size":"2.9 MB"}
INFO[0116] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355
{"level":"info","ts":"2024-06-18T22:39:15.172618Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part"}
{"level":"info","ts":"2024-06-18T22:39:15.173781Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":2936832,"checksum-size":32}
{"level":"info","ts":"2024-06-18T22:39:15.173823Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":2936832,"size":"2.9 MB","took":"now"}
{"level":"info","ts":"2024-06-18T22:39:15.173926Z","logger":"etcd-client.client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-06-18T22:39:15.174275Z","logger":"etcd-client.client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-06-18T22:39:15.174303Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2024-06-18T22:39:15.174516Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":2936832,"size":"2.9 MB"}
{"level":"info","ts":"2024-06-18T22:39:15.177892Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"2.9 MB","took":"now"}
{"level":"info","ts":"2024-06-18T22:39:15.177961Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355"}
INFO[0116] Reconciling ETCDSnapshotFile resources
I0618 22:39:15.183024      55 event.go:307] "Event occurred" object="local-on-demand-k3s-server-1-1718750355-f4beb3" fieldPath="" kind="ETCDSnapshotFile" apiVersion="k3s.cattle.io/v1" type="Normal" reason="ETCDSnapshotCreated" message="Snapshot on-demand-k3s-server-1-1718750355 saved on k3s-server-1"
{"level":"info","ts":"2024-06-18T22:39:15.192288Z","caller":"v3rpc/maintenance.go:166","msg":"sending database sha256 checksum to client","total-bytes":2936832,"checksum-size":32}
{"level":"info","ts":"2024-06-18T22:39:15.192392Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":2936832,"size":"2.9 MB","took":"now"}
{"level":"info","ts":"2024-06-18T22:39:15.192535Z","logger":"etcd-client.client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-06-18T22:39:15.199162Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"2.9 MB","took":"now"}
ERRO[0116] Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355: no such file or directory)
INFO[0116] Reconciliation of ETCDSnapshotFile resources complete
INFO[0116] Reconciling ETCDSnapshotFile resources
I0618 22:39:15.214006      55 event.go:307] "Event occurred" object="local-on-demand-k3s-server-1-1718750355-dcf0a5" fieldPath="" kind="ETCDSnapshotFile" apiVersion="k3s.cattle.io/v1" type="Warning" reason="ETCDSnapshotFailed" message="Failed to save snapshot on-demand-k3s-server-1-1718750355 on k3s-server-1: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355: no such file or directory)"
INFO[0116] Reconciliation of ETCDSnapshotFile resources complete

ERRO[0116] Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718750355: no such file or directory)

The two snapshots use the same timestamp in the file name, and race to create and the same snapshot and temp files.

The full stack on the restore error is:

go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore
	/go/pkg/mod/github.com/k3s-io/etcd/etcdutl/v3@v3.5.9-k3s1/snapshot/v3_snapshot.go:254
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).Restore
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/etcd/etcd.go:1486
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).Reset
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/etcd/etcd.go:410
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).start
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/cluster/managed.go:71
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/cluster/cluster.go:91
github.com/k3s-io/k3s/pkg/daemons/control.prepare
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/daemons/control/server.go:261
github.com/k3s-io/k3s/pkg/daemons/control.Server
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/daemons/control/server.go:35
github.com/k3s-io/k3s/pkg/server.StartServer
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/server/server.go:56
github.com/k3s-io/k3s/pkg/cli/server.run
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/cli/server/server.go:498
github.com/k3s-io/k3s/pkg/cli/server.RunWithControllers
	/go/pkg/mod/github.com/k3s-io/k3s@v1.28.10-0.20240517003114-1c33ee685998/pkg/cli/server/server.go:48
github.com/rancher/rke2/pkg/rke2.Server
	/source/pkg/rke2/rke2.go:123
github.com/rancher/rke2/pkg/cli/cmds.ServerRun
	/source/pkg/cli/cmds/server.go:168
github.com/urfave/cli.HandleAction
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:524
github.com/urfave/cli.Command.Run
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:175
github.com/urfave/cli.(*App).Run
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:277
main.main
	/source/main.go:23
runtime.main
	/usr/local/go/src/runtime/proc.go:267
@brandond
Copy link
Contributor Author

brandond commented Jun 19, 2024

FYI this is NOT a regression introduced by #9816, older releases would do this too:

root@k3s-server-1:/# k3s --version
k3s version v1.29.2+k3s1 (86f10213)
go version go1.21.7

root@k3s-server-1:/# k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459
{"level":"info","ts":"2024-06-19T16:44:19.419289Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part"}
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459
{"level":"info","ts":"2024-06-19T16:44:19.419976Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part"}
{"level":"info","ts":"2024-06-19T16:44:19.42158Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-06-19T16:44:19.421614Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2024-06-19T16:44:19.422379Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-06-19T16:44:19.422565Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2024-06-19T16:44:19.444205Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-06-19T16:44:19.444299Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-06-19T16:44:19.453012Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"3.8 MB","took":"now"}
{"level":"info","ts":"2024-06-19T16:44:19.45302Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"3.8 MB","took":"now"}
{"level":"info","ts":"2024-06-19T16:44:19.453083Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459"}
ERRO[0000] Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-1-1718815459: no such file or directory)
INFO[0000] Reconciling ETCDSnapshotFile resources
INFO[0000] Reconciling ETCDSnapshotFile resources
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete
[1]+  Done                       k3s etcd-snapshot save

@fmoral2
Copy link
Contributor

fmoral2 commented Jun 21, 2024

Validated on Version:

-$  k3s version v1.30.2-rc3+k3s1 (aa4794b3)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
ubuntu
AMD

Cluster Configuration:
-3 node server
-1 node agents

Steps to validate the fix

  1. Install k3s etcd embedded
  2. Take etcd snapshot more than once at the same time
  3. validate error response

Reproduction Issue:

 

 k3s -v
k3s version v1.27.15-rc1+k3s1 (102e42a5)
go version go1.21.11


k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5

WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
WARN[0000] Unknown flag --tls-san found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --protect-kernel-defaults found in config.yaml, skipping 
WARN[0000] Unknown flag --selinux found in config.yaml, skipping 
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --node-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --secrets-encryption found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
FATA[0000] see server log for details: Internal error occurred: etcd-snapshot error ID 20267 
INFO[0000] Snapshot on-demand-ip- .us-east-2.compute.internal-1719002637 saved. 
[1]+  Done                    k3s etcd-snapshot save



$ journalctl -xeu k3s.service | grep "snapshot"

Jun 21 20:23:09 ip-1  k3s[13884]: time="2024-06-21T20:23:09Z" level=error msg="Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390: no such file or directory)"
Jun 21 20:23:09 ip-1  k3s[13884]: I0621 20:23:09.912916   13884 event.go:307] "Event occurred" object="local-on-demand-ip- .us-east-2.compute.internal-1719001390-c93945" fieldPath="" kind="ETCDSnapshotFile" apiVersion="k3s.cattle.io/v1" type="Warning" reason="ETCDSnapshotFailed" message="Failed to save snapshot on-demand-ip-172-31-1-64.us-east-2.compute.internal-1719001390 on ip-1 us-east-2.compute.internal: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-1 .us-east-2.compute.internal-1719001390.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-1 us-east-2.compute.internal-1719001390.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- us-east-2.compute.internal-1719001390: no such file or directory)"

Validation Results:

 
 




k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5
FATA[0000] see server log for details: Internal error occurred: etcd-snapshot error ID 14571 
INFO[0000] Snapshot on-demand-ip .us-east-2.compute.internal-1719002785 saved. 
[1]+  Done                    k3s etcd-snapshot save



$ journalctl -xeu k3s.service | grep "snapshot"

 Jun 21 20:20:01 ip-172-31-7-229 k3s[13083]: time="2024-06-21T20:20:01Z" level=info msg="Starting managed etcd snapshot ConfigMap controller"
Jun 21 20:20:02 ip-172-31-7-229 k3s[13083]: time="2024-06-21T20:20:02Z" level=info msg="Reconciling snapshot ConfigMap data"
Jun 21 20:22:52 ip-172-31-7-229 k3s[13083]: time="2024-06-21T20:22:52Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- us-east-2.compute.internal-1719001372"
Jun 21 20:22:52 ip-172-31-7-229 k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.098617Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-1 -229.us-east-2.compute.internal-1719001372.part"}
Jun 21 20:22:52 ip-  k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.101602Z","logger":"etcd-client.client","caller":"v3@v3.5.13-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 21 20:22:52 ip-  k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.101781Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 21 20:22:52 ip- k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.111894Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":5713920,"size":"5.7 MB"}
Jun 21 20:22:52 ip-  k3s[13083]: time="2024-06-21T20:22:52Z" level=error msg="etcd-snapshot error ID 11914: snapshot save already in progress"
Jun 21 20:22:52 ip- k3s[13083]: time="2024-06-21T20:22:52Z" level=error msg="Sending HTTP 500 response to 127.0.0.1:50686: etcd-snapshot error ID 11914"
Jun 21 20:22:52 ip-  k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.160236Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":5713920,"size":"5.7 MB","took":"now"}


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done Issue
Development

No branches or pull requests

3 participants