Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s etcd-snapshot commands run against server specified in config file, instead of local server #10513

Closed
brandond opened this issue Jul 14, 2024 · 3 comments
Assignees
Labels
area/etcd kind/bug Something isn't working

Comments

@brandond
Copy link
Contributor

brandond commented Jul 14, 2024

Environmental Info:
K3s Version: v1.30.2+k3s2

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
Any cluster using embedded etcd with more than one server

Describe the bug:
This is a regression introduced by

When running k3s etcd-snapshot commands, the server flag defaults to the local server address, so etcd snaphots are created/listed/deleted on the local node. However, if the local server was joined to a cluster by specifying a server in the config file, the etcd-snapshot commands are executed against THAT server, instead of the local server.

This was reported in rancher/rke2#6284 but it took me a moment to realize what the user meant - I thought they were expecting the snapshot commands to be able to delete snapshots taken by other nodes (which is kind of what this is actually doing)

This is also likely the root cause of the multiple concurrent snapshot requests from #10371 - rancher's snapshot save commands were all being sent to the init node, instead of running locally on the individual servers.

Steps To Reproduce:

  1. Start a server with embedded etcd
  2. Start a second server, with the server: address of the first node specified in the config file.
  3. Take a snapshot on the second server
  4. Note that the snapshot is actually taken on the first server

Expected behavior:
etcd-snapshot commands work against the local server by default, even when a server address is present in the config file

Actual behavior:
As described above

Additional context / logs:

root@systemd-node-2:/# kubectl get node -o wide
NAME             STATUS   ROLES                       AGE     VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
systemd-node-1   Ready    control-plane,etcd,master   5m28s   v1.30.2+k3s2   172.17.0.8    <none>        openSUSE Leap 15.4   6.6.0-1001-aws   containerd://1.7.17-k3s1
systemd-node-2   Ready    control-plane,etcd,master   5s      v1.30.2+k3s2   172.17.0.9    <none>        openSUSE Leap 15.4   6.6.0-1001-aws   containerd://1.7.17-k3s1

root@systemd-node-2:/# k3s etcd-snapshot save
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0000] Snapshot on-demand-systemd-node-1-1720916780 saved.

root@systemd-node-2:/# k3s etcd-snapshot list
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
Name                                Location                                                                            Size    Created
on-demand-systemd-node-1-1720916780 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-systemd-node-1-1720916780 3588128 2024-07-14T00:26:20Z

root@systemd-node-2:/# k3s etcd-snapshot save --help | grep server
   --token value, -t value                                      (cluster) Shared secret used to join a server or agent to a cluster [$K3S_TOKEN]
   --server value, -s value                                     (cluster) Server to connect to (default: "https://127.0.0.1:6443") [$K3S_URL]

root@systemd-node-2:/# cat /etc/rancher/k3s/config.yaml
server: https://172.17.0.8:6443
token: token

root@systemd-node-2:/# k3s etcd-snapshot save --server https://localhost:6443
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0000] Snapshot on-demand-systemd-node-2-1720916809 saved.

root@systemd-node-2:/# k3s etcd-snapshot list --server https://localhost:6443
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
Name                                Location                                                                            Size    Created
on-demand-systemd-node-2-1720916809 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-systemd-node-2-1720916809 3584032 2024-07-14T00:26:49Z
@rancher-max
Copy link
Contributor

Some other testing considerations additional to what has already been listed

  1. Check the file location itself on both nodes
  2. Specify the --server arg in the command as the other node. For example, using the examples above, from node-2, run: k3s etcd-snapshot save --server https://172.17.0.8:6443

@brandond
Copy link
Contributor Author

brandond commented Jul 15, 2024

I am fixing this by changing the flags for server/token to etcd-server/etcd-token. The use case for this was primarily for folks that for some reason changed the bind address or supervisor port and needed to override the server address to match. We weren't REALLY expecting folks to run the command against other nodes.

@fmoral2
Copy link
Contributor

fmoral2 commented Jul 18, 2024

Validated on Version:

-$ k3s version v1.30.2+k3s-37830fe1 (37830fe1)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
ubuntu
AMD

Cluster Configuration:
-3 node server
-1 node agents

Steps to validate the fix

  1. Install k3s etcd embedded
  2. Take etcd snapshot on the second server
  3. Validate its taken on the correct place ( second server )

Reproduction Issue:

 
k3s version v1.30.2+k3s-58ab2592 (58ab2592)

Server 1 - ip 172-test1
Server 2 - ip 172-test2
Server 3 - ip 172-test3


on Server 2 
k3s etcd-snapshot save
INFO[0000] Snapshot on-demand-ip-172-test1.us-east-2.compute.internal-1721316392 saved. 

- Saved on first Server



k3s etcd-snapshot list
Name                                                             Location                                                                                                         Size    Created
on-demand-ip-172-test1.us-east-2.compute.internal-1721316392 file://{redacted}ip-172-test1.us-east-2.compute.internal-    2024-07-18T15:26:32Z

- List shows the snapshot is saved on the first server

Validation Results:

 
 k3s version v1.30.2+k3s-37830fe1 (37830fe1)

 Server 1 - ip 172-test1
Server 2 - ip 172-test2
Server 3 - ip 172-test3


- on Server 2 
$ sudo k3s etcd-snapshot save 

INFO[0001] Snapshot on-demand-ip-172-test2.us-east-2.compute.internal-1721321853 saved. 

k3s etcd-snapshot list

Name                                                             Location                                                                                                         Size    Created
on-demand-ip-172-test2.us-east-2.compute.internal-1721321853 file://{redacted}ip-172-test2.us-east-2.compute.internal-    2024-07-18T15:26:32Z


$ sudo k3s etcd-snapshot save --etcd-server https://localhost:6443
INFO[0000] Snapshot on-demand-ip-172-test2.us-east-2.compute.internal-1721322104 saved. 


Snapshot pointing to first server also works
~$ sudo k3s etcd-snapshot save --etcd-server https://172-test1:6443
INFO[0000] Snapshot on-demand-ip-172-test1.us-east-2.compute.internal-1721322170 saved. 




- on server 1
$ sudo k3s etcd-snapshot save 
INFO[0003] Snapshot on-demand-ip-172-test1.us-east-2.compute.internal-1721321767 saved. 


k3s etcd-snapshot list

Name                                                             Location                                                                                                         Size    Created
on-demand-ip-172-test1.us-east-2.compute.internal-1721321853 file://{redacted}ip-172-test2.us-east-2.compute.internal-    2024-07-18T15:26:32Z

@fmoral2 fmoral2 closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Something isn't working
Projects
Status: Done Issue
Development

No branches or pull requests

3 participants