Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s etcd-snapshot save fails on host with IPv6 only #9214

Closed
PeterBarczi opened this issue Jan 11, 2024 · 27 comments
Closed

k3s etcd-snapshot save fails on host with IPv6 only #9214

PeterBarczi opened this issue Jan 11, 2024 · 27 comments
Assignees
Milestone

Comments

@PeterBarczi
Copy link

PeterBarczi commented Jan 11, 2024

Environmental Info:
K3s Version:
v1.26.9+k3s1

Node(s) CPU architecture, OS, and Version:
Linux test-n1 6.1.13-gardenlinux-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.13-0gardenlinux~0 (2023-03-17) x86_64 GNU/Linux

Cluster Configuration:

root@test-n1:~# kubectl get no
NAME          STATUS   ROLES                       AGE    VERSION
test-n1   Ready    control-plane,etcd,master   209d   v1.26.9+k3s1

Describe the bug:
k3s etcd-snapshot save fails on host with IPv6
(automated snapshots are present)

Steps To Reproduce:
run k3s etcd-snapshot save on host with IPv6 networking only

Expected behavior:
on-demand snapshot successfully created

Actual behavior:
unable to create on-demand snapshot via etcd-snapshot save

Additional context / logs:

INFO[0000] Managed etcd cluster bootstrap already complete and initialized
INFO[0000] Applying CRD helmcharts.helm.cattle.io
INFO[0000] Applying CRD helmchartconfigs.helm.cattle.io
INFO[0000] Applying CRD addons.k3s.cattle.io
INFO[0001] Creating k3s-supervisor event broadcaster
{"level":"warn","ts":"2024-01-11T09:09:32.377709Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0014c9a40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"info","ts":"2024-01-11T09:09:32.377947Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused
@PeterBarczi PeterBarczi changed the title k3s etcd-snapshot fails on host with IPv6 k3s etcd-snapshot save fails on host with IPv6 Jan 11, 2024
@PeterBarczi PeterBarczi changed the title k3s etcd-snapshot save fails on host with IPv6 k3s etcd-snapshot save fails on host with IPv6 only Jan 11, 2024
@brandond
Copy link
Contributor

brandond commented Jan 11, 2024

We've made some changes to the snapshot functionality in recent releases, can you try with the latest v1.26 patch release?

@PeterBarczi
Copy link
Author

We've made some changes to the snapshot functionality in recent releases, can you try with the latest v1.26 patch release?

Thanks for the hint, will give it a try...

@PeterBarczi
Copy link
Author

Hi @brandond Checked with v1.26.12+k3s1 version, but still the same behaviour:

root@lab1-test-n1:~# kubectl get no
NAME           STATUS   ROLES                       AGE   VERSION
lab1-test-n1   Ready    control-plane,etcd,master   34m   v1.26.12+k3s1

root@lab1-test-n1:~# /opt/local/bin/k3s etcd-snapshot ls
Name Location Size Created

root@lab1-test-n1:~# /opt/local/bin/k3s etcd-snapshot save
{"level":"warn","ts":"2024-01-12T16:10:11.766974Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000b3c700/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"info","ts":"2024-01-12T16:10:11.767141Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

Any ideas? thnx

@brandond
Copy link
Contributor

brandond commented Jan 12, 2024

Can you show how your node is configured for ipv6-only? What is the output of kubectl get node -o yaml | grep node-args ?

Generally speaking, k3s should detect ipv4 or ipv6 and set the loopback address appropriately, but for some reason that's not happening for you.

@PeterBarczi
Copy link
Author

Here it is:

root@lab1-test-n1:~# kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","--cluster-init","--tls-san","etcd.kube-system.svc","--tls-san","api.dev-test-vm1.lab1.osc","--tls-san","2a00:da8:ffef:1428::","--kube-apiserver-arg","--enable-bootstrap-token-auth=true","--kube-controller-manager-arg","allocate-node-cidrs=false","--node-ip","2a00:da8:ffef:1428::","--service-cidr","2a00:da8:ffef::1040:0/112","--disable-cloud-controller","--disable","traefik","--disable","servicelb","--kube-apiserver-arg","--enable-bootstrap-token-auth=true","--flannel-backend","none","--disable-network-policy","--disable-kube-proxy","--data-dir","/media/data/k3s"]'

@brandond
Copy link
Contributor

brandond commented Jan 12, 2024

"--node-ip","2a00:da8:ffef:1428::"

Did you redact that, or are you really setting the node IP to that address? It is somewhat unusual to use a .0 or :: address for a host.

@PeterBarczi
Copy link
Author

No, it is not redacted. t's the way how we configure hosts.

@brandond
Copy link
Contributor

Hmm. That is a little odd. Can you show the output of ip addr for the interface associated with that address?

@PeterBarczi
Copy link
Author

PeterBarczi commented Jan 13, 2024

Hi @brandond to be on safe side, I spin new AWS EC2 instance with IPv6 only, (also the format of address is like you wish), but the symptoms are the same... See here:

Server info:

root@test:~# uname -a
Linux test 6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux


root@test:~# ip a s lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

root@test:~# ip a s ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0e:93:73:d8:a8:ab brd ff:ff:ff:ff:ff:ff
    altname enp0s5
    inet6 2600:1f18:4228:4600:3dc4:3e1:a89d:d892/128 scope global dynamic noprefixroute
       valid_lft 383sec preferred_lft 73sec
    inet6 fe80::c93:73ff:fed8:a8ab/64 scope link
       valid_lft forever preferred_lft forever

K3s cluster info: (latest k3s version used)

root@test:~# k3s kubectl get no
NAME   STATUS   ROLES                       AGE     VERSION
test   Ready    control-plane,etcd,master   7m18s   v1.28.5+k3s1

root@test:~# k3s kubectl get no -o wide
NAME   STATUS   ROLES                       AGE     VERSION        INTERNAL-IP                              EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
test   Ready    control-plane,etcd,master   7m24s   v1.28.5+k3s1   2600:1f18:4228:4600:3dc4:3e1:a89d:d892   <none>        Ubuntu 22.04.3 LTS   6.2.0-1017-aws   containerd://1.7.11-k3s2

root@test:~# k3s kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","-","--cluster-init","--disable","traefik","--etcd-snapshot-schedule-cron","*/30

On-demand Snapshot triggered, however the same error:

root@test:~# k3s etcd-snapshot save
{"level":"warn","ts":"2024-01-13T11:28:51.145689Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009156c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"info","ts":"2024-01-13T11:28:51.14581Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

Automated snapshots created properly:

root@test:~# ls -ltrh /var/lib/rancher/k3s/server/db/snapshots/
total 2.0M
-rw------- 1 root root 2.0M Jan 13 11:30 etcd-snapshot-test-1705145403

So it seems that this issue is not related to our environment only, but affects nodes with IPv6 in general.

@lukas016
Copy link

@brandond i checked the code and main problem is function getEndpoints

Problem in setup of @PeterBarczi is empty endpoints runtime.EtcdConfig.Endpoints and control.ServiceIPRange isn't set too. So he will always get ipv4 for etcd db as default value from control.Loopback method

Of course root cause will be something else.

@brandond
Copy link
Contributor

Hmm, that is interesting. If you don't set --service-cidr and --cluster-cidr to ipv6, wouldn't this leave you with an IPv4 cluster running on IPv6-only nodes? Does this work properly?

We can take a look at putting both the ipv4 and ipv6 loopback addresses in the endpoint list; I think grpc should do the right thing and connect to whichever works.

@PeterBarczi
Copy link
Author

PeterBarczi commented Jan 16, 2024

@brandond
Here you can see setup without specifying --service-cidr and --cluster-cidr on IPv6-only node:

root@test:~# kubectl get no -o wide
NAME   STATUS   ROLES                       AGE    VERSION        INTERNAL-IP                              EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
test   Ready    control-plane,etcd,master   2d9h   v1.29.0+k3s1   2a05:d01c:2c7:d803:b5a8:d794:4e1c:789a   <none>        Ubuntu 20.04.6 LTS   5.15.0-1051-aws   containerd://1.7.11-k3s2

root@test:~# k3s kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","-","--cluster-init","--disable","traefik","--etcd-snapshot-schedule-cron","*/30

with the same behavior:

root@test:~# k3s etcd-snapshot save
{"level":"warn","ts":"2024-01-13T11:28:51.145689Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009156c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"info","ts":"2024-01-13T11:28:51.14581Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
dial tcp 127.0.0.1:2379: connect: connection refused

@brandond
Copy link
Contributor

What do you get for kubectl get pod -A -o wide?

@PeterBarczi
Copy link
Author

Here is the output:

root@test:~# kubectl get po -A -o wide
NAMESPACE     NAME                                      READY   STATUS    RESTARTS        AGE    IP            NODE   NOMINATED NODE   READINESS GATES
kube-system   coredns-7b5bbc6644-ssfsn                  1/1     Running   4 (2m14s ago)   2d9h   fd00:42::41   test   <none>           <none>
kube-system   local-path-provisioner-687d6d7765-6jfrh   1/1     Running   4 (2m14s ago)   2d9h   fd00:42::3f   test   <none>           <none>
kube-system   metrics-server-667586758d-q5plb           1/1     Running   4 (2m13s ago)   2d9h   fd00:42::42   test   <none>           <none>

@brandond brandond self-assigned this Jan 16, 2024
@brandond
Copy link
Contributor

Ah right, I forgot that we made improvements to ipv6-only; it is only dual-stack that requires manual configuration.

@lukas016

This comment was marked as off-topic.

@brandond
Copy link
Contributor

brandond commented Feb 9, 2024

@lukas016 It looks like you have something else going on there... the data dir extraction suggests that you hadn't actually run this release of k3s before? You should upgrade the running k3s server instance to this release, before trying to use the CLI to take a snapshot.

The etcdsnapshotfiles.meta.k8s.io resource that it's complaining about is also odd, that is not the group and kind used by k3s.

Please open a new issue if you can reproduce problems when actually running a release of k3s that includes this change.

@fmoral2
Copy link
Contributor

fmoral2 commented Feb 16, 2024

Validated on Version:

-$  k3s version v1.29.2+k3s-085ccbb0 (085ccbb0)



Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
SUSE Linux Enterprise Server 15 SP4

Cluster Configuration:
1 server node

Steps to validate the fix

  1. Manually downloaded k3s bin to node due to lack of using s3 to download with curl
  2. Install k3s with node ipv6 only with args on config, not CLI
 k3s.io/node-args: '["server","--cluster-cidr","2001:cafe:42::/56","--service-cidr","2001:cafe:43::/108","--cluster-init","true","--node-ip","2600:1f1c:ab4:ee32:c44c:a8b3:4319:dad7","--write-kubeconfig-mode","644"]'
  1. Validate that etcd snapshot is working fine
  2. Validate nodes and pods

Reproduction Issue:

 
  k3s -v
k3s version v1.29.1+k3s-8224a3a7 (8224a3a7)
go version go1.21.6``

 kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","--cluster-cidr","2001:cafe:42::/56","--service-cidr","2001:cafe:43::/108","--cluster-init","true","--node-ip","2600:1f1c:ab4:ee32:c44c:a8b3:4319:dad7","--write-kubeconfig-mode","644"]'

 sudo k3s etcd-snapshot save 
WARN[0000] Unknown flag --cluster-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --service-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
^C{"level":"warn","ts":"2024-02-15T19:39:34.996891Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00136e000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"warn","ts":"2024-02-15T19:39:34.996862Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00136e000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}

 
 

Validation Results:

       
k3s -v
k3s version v1.29.2+k3s-085ccbb0 (085ccbb0)
go version go1.21.7



  kubectl get nodes
NAME                  STATUS   ROLES                       AGE   VERSION
i-041ae49edb4c36e85   Ready    control-plane,etcd,master   27s   v1.29.2+k3s-085ccbb0


$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE
kube-system   coredns-6799fbcd5-vj5mt                   1/1     Running     0          4m19s
kube-system   helm-install-traefik-6kt7z                0/1     Completed   1          4m19s
kube-system   helm-install-traefik-crd-tbtkf            0/1     Completed   0          4m19s
kube-system   local-path-provisioner-6c86858495-nt7jj   1/1     Running     0          4m19s
kube-system   metrics-server-67c658944b-wkf2p           1/1     Running     0          4m19s
kube-system   svclb-traefik-dd9befe3-r95jc              2/2     Running     0          4m2s
kube-system   traefik-f4564c4f4-g9g29                   1/1     Running     0          4m2s



 kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","--cluster-cidr","2001:cafe:42::/56","--service-cidr","2001:cafe:43::/108","--cluster-init","true","--node-ip","2600:1f1c:ab4:ee32:c44c:a8b3:4319:dad7","--write-kubeconfig-mode","644"]'

sudo k3s etcd-snapshot save
WARN[0000] Unknown flag --cluster-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --service-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --node-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-i-041ae49edb4c36e85-1708098230 
{"level":"info","ts":"2024-02-16T15:43:50.351029Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-i-041ae49edb4c36e85-1708098230.part"}
{"level":"info","ts":"2024-02-16T15:43:50.353419Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-02-16T15:43:50.353665Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://[::1]:2379"}
{"level":"info","ts":"2024-02-16T15:43:50.397491Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-02-16T15:43:50.405272Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://[::1]:2379","size":"2.9 MB","took":"now"}
{"level":"info","ts":"2024-02-16T15:43:50.405496Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-i-041ae49edb4c36e85-1708098230"}
INFO[0000] Reconciling ETCDSnapshotFile resources       
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete 


 

@fmoral2
Copy link
Contributor

fmoral2 commented Feb 16, 2024

Working as expected using config but not with args on CLI, talking with @brandond we are letting this behind now to release the whole fix

@brandond
Copy link
Contributor

confirming, moving out to next release to extend fix to CLI args, not just config.

@brandond
Copy link
Contributor

brandond commented Mar 7, 2024

Moving this out to next release; I am going to do some fairly invasive refactoring to move on-demand snapshots into a request/response model where the running server process actually does the snapshot, instead of trying to set up the complete server context within the on-demand snapshot CLI.

@brandond brandond modified the milestones: v1.29.3+k3s1, v1.30.0+k3s1 Mar 7, 2024
@PeterBarczi

This comment was marked as off-topic.

@brandond

This comment was marked as off-topic.

@PeterBarczi

This comment was marked as off-topic.

@brandond

This comment was marked as off-topic.

@PeterBarczi

This comment was marked as off-topic.

@fmoral2
Copy link
Contributor

fmoral2 commented Apr 15, 2024

ipv6 only validation
#9816 (comment)

@fmoral2 fmoral2 closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done Issue
Development

No branches or pull requests

7 participants