rke2-server isn't starting up after restarting cluster node #6644

kurborg · 2024-08-27T20:39:07Z

Environmental Info:
RKE2 Version:
rke2 version v1.26.15+rke2r1 (a413a7f)
go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Rocky 8.6

Cluster Configuration:
1 server. 5 agents.

Describe the bug:
API server seems like it's failing to come up at port 6443.

Steps To Reproduce:
sudo systemctl restart rke2-server

Expected behavior:
rke2-server starts up normally and connects to agent nodes.

Actual behavior:
Command:
netstat -ltn
Output:
Shows that etcd is listening on port 2379
tcp 129 0 127.0.0.1:2379 0.0.0.0:* LISTEN

Command:
curl -k https://127.0.0.1:2379/health
Output:
Times out

Command:
journalctl -u rke2-server -f
Output:
Aug 27 16:31:32 OurServer rke2[132272]: {"level":"warn","ts":"2024-08-27T16:31:32.345951-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d29500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Aug 27 16:31:32 OurServer rke2[132272]: time="2024-08-27T16:31:32-04:00" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded" Aug 27 16:31:32 OurServer rke2[132272]: time="2024-08-27T16:31:32-04:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Aug 27 16:31:35 OurServer rke2[132272]: time="2024-08-27T16:31:35-04:00" level=error msg="Sending HTTP 503 response to 192.168.1.111:53002: runtime core not ready" Aug 27 16:31:37 OurServer rke2[132272]: time="2024-08-27T16:31:37-04:00" level=error msg="Sending HTTP 503 response to 192.168.1.115:50932: runtime core not ready" Aug 27 16:31:37 OurServer rke2[132272]: time="2024-08-27T16:31:37-04:00" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error" Aug 27 16:31:40 OurServer rke2[132272]: time="2024-08-27T16:31:40-04:00" level=error msg="Sending HTTP 503 response to 192.168.1.111:53158: runtime core not ready" Aug 27 16:31:41 OurServer rke2[132272]: {"level":"warn","ts":"2024-08-27T16:31:41.599955-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000d29500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout\""} Aug 27 16:31:41 OurServer rke2[132272]: {"level":"info","ts":"2024-08-27T16:31:41.600009-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} Aug 27 16:31:41 OurServer rke2[132272]: time="2024-08-27T16:31:41-04:00" level=info msg="Waiting for etcd server to become available" Aug 27 16:31:41 OurServer rke2[132272]: time="2024-08-27T16:31:41-04:00" level=info msg="Waiting for API server to become available"

Additional context / logs:

We have disabled firewalld on all the server and agent nodes.

/etc/rancher/rke2/rke2.yaml

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <OUR_CERT>
    server: https://127.0.0.1:6443
  name: default
contexts:
- context:
    cluster: default
    user: default
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
  user:
    client-certificate-data: <OUR_CERT>
    client-key-data: <OUR_CERT>

/etc/rancher/rke2/registries.yaml

mirrors:
  docker.io:
    endpoint:
      - 'https://harbor.ourInstance.com'
  harbor.ourInstance .com:
    endpoint:
      - 'https://harbor.ourInstance.com'

configs:
  'harbor.ourInstance.com':
    auth:
      username: admin # this is the registry username
      password: $HARBOR_PASSWORD # this is the registry password
    tls:
      cert_file: /etc/docker/certs.d/harbor.ourInstance.com/harbor.ourInstance.com.cert   # path to the cert file used to authenticate to the registry
      key_file: /etc/docker/certs.d/harbor.ourInstance.com/harbor.ourInstance.com.key            # path to the key file for the certificate used to authenticate to the registry
      ca_file: /etc/docker/certs.d/harbor.ourInstance.com/ca.crt       # path to the ca file used to verify the registry's certificate
      insecure_skip_verify: false # may be set to true to skip verifying the registry's certificate

The text was updated successfully, but these errors were encountered:

brandond · 2024-08-27T21:37:03Z

The etcd and kube-apiserver static pods do not appear to be running. Check the logs under /var/log/pods to see why. You might also check kubelet.log.

kurborg · 2024-08-28T00:54:05Z

Yeah I didn't detect anything that was very helpful

Command:
tail /var/log/pods/kube-system_kube-apiserver-silverbox1_c5fc9ef130551800e3d08eba4103e251/kube-apiserver/14.log

Output:
2024-08-27T13:04:33.2026255-04:00 stderr F I0827 17:04:33.202398 1 trace.go:219] Trace[1654830898]: "Get" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:a507a912-3063-4c5e-8709-b4e32a0575bd,client:192.168.1.114,protocol:HTTP/2.0,resource:nodes,scope:resource,url:/api/v1/nodes/silverbox5,user-agent:kubelet/v1.26.15+rke2r1 (linux/amd64) kubernetes/1649f59,verb:GET (27-Aug-2024 17:04:23.199) (total time: 10002ms): 2024-08-27T13:04:33.202636751-04:00 stderr F Trace[1654830898]: [10.002934173s] [10.002934173s] END 2024-08-27T13:04:33.20263862-04:00 stderr F E0827 17:04:33.202441 1 timeout.go:142] post-timeout activity - time-elapsed: 2.8923ms, GET "/api/v1/nodes/silverbox5" result: <nil> 2024-08-27T13:04:33.217658989-04:00 stderr F E0827 17:04:33.217493 1 writers.go:122] apiserver was unable to write a JSON response: http: Handler timeout 2024-08-27T13:04:33.217667206-04:00 stderr F E0827 17:04:33.217502 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout 2024-08-27T13:04:33.218779791-04:00 stderr F E0827 17:04:33.218579 1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout 2024-08-27T13:04:33.218788137-04:00 stderr F I0827 17:04:33.218598 1 trace.go:219] Trace[1519742216]: "List" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:7ef246ec-c97d-41d8-9c85-2d86373a279e,client:192.168.1.114,protocol:HTTP/2.0,resource:csidrivers,scope:cluster,url:/apis/storage.k8s.io/v1/csidrivers,user-agent:kubelet/v1.26.15+rke2r1 (linux/amd64) kubernetes/1649f59,verb:LIST (27-Aug-2024 17:03:33.217) (total time: 60001ms): 2024-08-27T13:04:33.21878948-04:00 stderr F Trace[1519742216]: ["cacher list" audit-id:7ef246ec-c97d-41d8-9c85-2d86373a279e,type:csidrivers.storage.k8s.io 60001ms (17:03:33.217)] 2024-08-27T13:04:33.218790599-04:00 stderr F Trace[1519742216]: [1m0.001315607s] [1m0.001315607s] END 2024-08-27T13:04:33.218791911-04:00 stderr F E0827 17:04:33.218625 1 timeout.go:142] post-timeout activity - time-elapsed: 1.138035ms, GET "/apis/storage.k8s.io/v1/csidrivers" result: <nil>

Command:
tail /var/log/pods/kube-system_etcd-silverbox1_9160bfe2d04901447e9cc07e54611d49/etcd/11.log.20240827-130353

Output:
2024-08-27T13:03:51.598160944-04:00 stderr F {"level":"info","ts":"2024-08-27T17:03:51.597908Z","caller":"v3rpc/maintenance.go:96","msg":"finished defragment"} 2024-08-27T13:03:52.090591742-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.090291Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11895010693730220364,"retry-timeout":"500ms"} 2024-08-27T13:03:52.48278394-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.482555Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"1.999988543s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"} 2024-08-27T13:03:52.482792771-04:00 stderr F {"level":"info","ts":"2024-08-27T17:03:52.482589Z","caller":"traceutil/trace.go:171","msg":"trace[762688015] range","detail":"{range_begin:/registry/health; range_end:; }","duration":"2.000029581s","start":"2024-08-27T17:03:50.482552Z","end":"2024-08-27T17:03:52.482581Z","steps":["trace[762688015] 'agreement among raft nodes before linearized reading' (duration: 1.999987279s)"],"step_count":1} 2024-08-27T13:03:52.482794434-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.482606Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-08-27T17:03:50.482547Z","time spent":"2.000054422s","remote":"127.0.0.1:59482","response type":"/etcdserverpb.KV/Range","request count":0,"request size":18,"response count":0,"response size":0,"request content":"key:\"/registry/health\" "} 2024-08-27T13:03:52.48279553-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.482669Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.00031618s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"} 2024-08-27T13:03:52.4827968-04:00 stderr F {"level":"info","ts":"2024-08-27T17:03:52.482695Z","caller":"traceutil/trace.go:171","msg":"trace[1630550680] range","detail":"{range_begin:/registry/health; range_end:; }","duration":"2.000350502s","start":"2024-08-27T17:03:50.482339Z","end":"2024-08-27T17:03:52.48269Z","steps":["trace[1630550680] 'agreement among raft nodes before linearized reading' (duration: 2.000315157s)"],"step_count":1} 2024-08-27T13:03:52.483099162-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.48271Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-08-27T17:03:50.482334Z","time spent":"2.000372167s","remote":"127.0.0.1:59480","response type":"/etcdserverpb.KV/Range","request count":0,"request size":18,"response count":0,"response size":0,"request content":"key:\"/registry/health\" "} 2024-08-27T13:03:52.590909904-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:52.590649Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11895010693730220364,"retry-timeout":"500ms"} 2024-08-27T13:03:53.091226335-04:00 stderr F {"level":"warn","ts":"2024-08-27T17:03:53.090896Z","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":11895010693730220364,"retry-timeout":"500ms"}

brandond · 2024-08-28T18:41:16Z

Please just attach the files, a few lines from the tail isn't really enough to figure out what's going on.

The messages from etcd suggest that it is running EXTREMELY slowly though. What are the CPU and IO load on this node? If this is a single-node cluster, there should be no delays in RAFT operations.

pythonbdfl · 2024-08-30T21:00:02Z

I was having the same issue today, but I kept looking and found that I had been using the gzipped rke2 binary, which didn't have any suffix attached to it. Run the linux 'file' command on your rke2 binary to verify that you're not doing the same. If it's a gzip file, then the solution is to simply extract it and then move the bin/rke2 in place.

github-actions · 2024-10-15T20:13:55Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

github-actions bot added the status/stale label Oct 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rke2-server isn't starting up after restarting cluster node #6644

rke2-server isn't starting up after restarting cluster node #6644

kurborg commented Aug 27, 2024 •

edited

Loading

brandond commented Aug 27, 2024

kurborg commented Aug 28, 2024

brandond commented Aug 28, 2024

pythonbdfl commented Aug 30, 2024 •

edited

Loading

github-actions bot commented Oct 15, 2024

rke2-server isn't starting up after restarting cluster node #6644

rke2-server isn't starting up after restarting cluster node #6644

Comments

kurborg commented Aug 27, 2024 • edited Loading

brandond commented Aug 27, 2024

kurborg commented Aug 28, 2024

brandond commented Aug 28, 2024

pythonbdfl commented Aug 30, 2024 • edited Loading

github-actions bot commented Oct 15, 2024

kurborg commented Aug 27, 2024 •

edited

Loading

pythonbdfl commented Aug 30, 2024 •

edited

Loading