Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.28] - Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot #10374

Closed
brandond opened this issue Jun 19, 2024 · 1 comment
Assignees
Milestone

Comments

@brandond
Copy link
Member

Backport fix for Multiple simultaneous snapshots result in silent failure and/or corruption of at least one snapshot

@fmoral2
Copy link
Contributor

fmoral2 commented Jun 21, 2024

Validated on Version:

-$ k3s version v1.28.11-rc3+k3s1 (617b0e84)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
ubuntu
AMD

Cluster Configuration:
-3 node server
-1 node agents

Steps to validate the fix

  1. Install k3s etcd embedded
  2. Take etcd snapshot more than once at the same time
  3. validate error response

Reproduction Issue:

 

 k3s -v
k3s version v1.27.15-rc1+k3s1 (102e42a5)
go version go1.21.11


k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5

WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
WARN[0000] Unknown flag --tls-san found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --protect-kernel-defaults found in config.yaml, skipping 
WARN[0000] Unknown flag --selinux found in config.yaml, skipping 
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --node-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --secrets-encryption found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-apiserver-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kube-controller-manager-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --kubelet-arg found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
FATA[0000] see server log for details: Internal error occurred: etcd-snapshot error ID 20267 
INFO[0000] Snapshot on-demand-ip- .us-east-2.compute.internal-1719002637 saved. 
[1]+  Done                    k3s etcd-snapshot save



$ journalctl -xeu k3s.service | grep "snapshot"

Jun 21 20:23:09 ip- k3s[13884]: time="2024-06-21T20:23:09Z" level=error msg="Failed to take etcd snapshot: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- us-east-2.compute.internal-1719001390 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- us-east-2.compute.internal-1719001390.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- .us-east-2.compute.internal-1719001390: no such file or directory)"
Jun 21 20:23:09 ip-172-31-1-64 k3s[13884]: I0621 20:23:09.912916   13884 event.go:307] "Event occurred" object="local-on-demand-ip- .us-east-2.compute.internal-1719001390-c93945" fieldPath="" kind="ETCDSnapshotFile" apiVersion="k3s.cattle.io/v1" type="Warning" reason="ETCDSnapshotFailed" message="Failed to save snapshot on-demand-ip- .us-east-2.compute.internal-1719001390 on ip-172-31-1-64.us-east-2.compute.internal: could not rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-172-31-1-64.us-east-2.compute.internal-1719001390.part to /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-172-31-1-64.us-east-2.compute.internal-1719001390 (rename /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-172-31-1-64.us-east-2.compute.internal-1719001390.part /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-172-31-1-64.us-east-2.compute.internal-1719001390: no such file or directory)"

Validation Results:

 
 




k3s etcd-snapshot save & k3s etcd-snapshot save; sleep 5
 FATA[0000] see server log for details: Internal error occurred: etcd-snapshot error ID 11914 
INFO[0000] Snapshot on-demand-ip- .us-east-2.compute.internal-1719001372 saved. 
[1]+  Exit 1                  k3s etcd-snapshot save


$ journalctl -xeu k3s.service | grep "snapshot"

 Jun 21 20:20:01 ip-  k3s[13083]: time="2024-06-21T20:20:01Z" level=info msg="Starting managed etcd snapshot ConfigMap controller"
Jun 21 20:20:02 ip-1  k3s[13083]: time="2024-06-21T20:20:02Z" level=info msg="Reconciling snapshot ConfigMap data"
Jun 21 20:22:52 ip-  k3s[13083]: time="2024-06-21T20:22:52Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-1 us-east-2.compute.internal-1719001372"
Jun 21 20:22:52 ip- k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.098617Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-ip- us-east-2.compute.internal-1719001372.part"}
Jun 21 20:22:52 ip- k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.101602Z","logger":"etcd-client.client","caller":"v3@v3.5.13-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
Jun 21 20:22:52 ip-  k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.101781Z","logger":"etcd-client","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
Jun 21 20:22:52 ip-  k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.111894Z","caller":"v3rpc/maintenance.go:126","msg":"sending database snapshot to client","total-bytes":5713920,"size":"5.7 MB"}
Jun 21 20:22:52 ip-  k3s[13083]: time="2024-06-21T20:22:52Z" level=error msg="etcd-snapshot error ID 11914: snapshot save already in progress"
Jun 21 20:22:52 ip-172-31-7-229 k3s[13083]: time="2024-06-21T20:22:52Z" level=error msg="Sending HTTP 500 response to 127.0.0.1:50686: etcd-snapshot error ID 11914"
Jun 21 20:22:52 ip-172-31-7-229 k3s[13083]: {"level":"info","ts":"2024-06-21T20:22:52.160236Z","caller":"v3rpc/maintenance.go:175","msg":"successfully sent database snapshot to client","total-bytes":5713920,"size":"5.7 MB","took":"now"}


@fmoral2 fmoral2 closed this as completed Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants