Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node is getting crashed continously after EKS node reboot #2133

Closed
panchm opened this issue Nov 9, 2022 · 3 comments
Closed

aws-node is getting crashed continously after EKS node reboot #2133

panchm opened this issue Nov 9, 2022 · 3 comments

Comments

@panchm
Copy link

panchm commented Nov 9, 2022

After EKS node reboot we have observed that the aws-node is not coming up (continuously crashing):

We have tried troubleshooting by following this user guide but
https://aws.amazon.com/premiumsupport/knowledge-center/eks-cni-plugin-troubleshooting/

Please find the attached logs
eks_i-0eb7ccae605e1767c_2022-11-09_0712-UTC_0.7.2.tar.gz

Environment:

  • Kubernetes version:
    Client Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.6-eks-7d68063", GitCommit:"f24e667e49fb137336f7b064dba897beed639bad", GitTreeState:"clean", BuildDate:"2022-02-23T19:32:14Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.13-eks-fb459a0", GitCommit:"55bd5d5cb7d32bc35e4e050f536181196fb8c6f7", GitTreeState:"clean", BuildDate:"2022-10-24T20:35:40Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

  • CNI Version: v1.10.4-eksbuild.1

  • OS:
    NAME="Ubuntu"
    VERSION="20.04.5 LTS (Focal Fossa)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 20.04.5 LTS"
    VERSION_ID="20.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=focal
    UBUNTU_CODENAME=focal

  • Kernel:
    Linux ip-10-0-100-90 5.15.0-1022-aws Kube-Scheduler Support for managing node's available IP addresses #26~20.04.1-Ubuntu SMP Sat Oct 15 03:22:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Userdata:

#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh
sysctl -w vm.nr_hugepages=4000
echo "vm.nr_hugepages=4000" >> /etc/sysctl.conf
echo 4000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

cp /etc/kubernetes/kubelet/kubelet-config.json /etc/kubernetes/kubelet/kubelet-config.json.back

jq '. += { "cpuManagerPolicy":"static"}' /etc/kubernetes/kubelet/kubelet-config.json.back > /etc/kubernetes/kubelet/kubelet-config.json

jq '. += { "reservedCpus": "0-1"}' /etc/kubernetes/kubelet/kubelet-config.json.back > /etc/kubernetes/kubelet/kubelet-config.json

rm /var/lib/kubelet/cpu_manager_state
systemctl restart snap.kubelet-eks.daemon.service

  • EKS version
    1.23
  • Virtual server type (instance type)
    m5.4xlarge
  • Firewall (security group)
    default
  • Storage (volumes)
    1 volume(s) - 20 GiB
  • US West (Oregon)
    us-west-2
  • Architecture
    x86_64

Addons:

  • coredns v1.8.7-eksbuild.2
  • kube-proxy v1.23.7-eksbuild.1
  • vpc-cni v1.10.4-eksbuild.1
kubectl get pods -nkube-system
NAME                       READY   STATUS              RESTARTS        AGE
aws-node-snlbt             0/1     CrashLoopBackOff    39 (4m5s ago)   156m
coredns-57ff979f67-87f65   0/1     ContainerCreating   0               153m
coredns-57ff979f67-8tlgg   0/1     ContainerCreating   0               153m
kube-proxy-wxk5n           1/1     Running             1 (21h ago)     21h
aws-node pod logs:
{"level":"info","ts":"2022-11-09T08:51:47.548Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-11-09T08:51:47.549Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-11-09T08:51:47.568Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-11-09T08:51:47.569Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-11-09T08:51:49.576Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:51:51.582Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:51:53.588Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:51:55.595Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:51:57.601Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:51:59.607Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:01.614Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:03.621Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:05.627Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:07.633Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:09.640Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:11.647Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:13.653Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:15.659Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:17.665Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:19.672Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:21.679Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:23.685Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:25.691Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:27.698Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:29.704Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:31.711Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:33.717Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:35.724Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:37.730Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:39.737Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:41.743Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:43.750Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:45.756Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:47.763Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:49.769Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:51.776Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:53.782Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:55.789Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:57.795Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:52:59.801Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:53:01.808Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:53:03.815Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:53:05.821Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:53:07.828Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-11-09T08:53:09.834Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

While debugging the node, we observed the 50051 port was listening by the process aws-k8s-agent before rebooting the system
After reboot 50051 port is not listening
In the entrypoint.sh It tries to reach the 50051 port, its failing in the absence of this port after timeout

124 # Check for ipamd connectivity on localhost port 50051
125 wait_for_ipam() {
126     while :
127     do
128         if ./grpc-health-probe -addr 127.0.0.1:50051 >/dev/null 2>&1; then
129             return 0
130         fi
131     log_in_json info "Retrying waiting for IPAM-D"
132     done
133 }
170 if ! wait_for_ipam; then
171     log_in_json error "Timed out waiting for IPAM daemon to start:"
172     cat "$AGENT_LOG_PATH" >&2
173     exit 1
174 fi
175

Kindly help us on the correcting right part of the configuration if required

@jayanthvn
Copy link
Contributor

I see the sock is not reachable. Do you have dockershim symlink pointing to /var/run/containerd/containerd.sock ?

{"level":"info","ts":"2022-11-09T07:10:08.333Z","caller":"ipamd/ipamd.go:509","msg":"Reading ipam state from CRI"}
{"level":"debug","ts":"2022-11-09T07:10:08.333Z","caller":"datastore/data_store.go:374","msg":"Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\""}

It looks similar to this issue - awslabs/amazon-eks-ami#921

@panchm
Copy link
Author

panchm commented Nov 9, 2022

Thanks for the quick response and for pointing to the root cause.

We are able to fix this issue by updating the userdata:

sudo mkdir -p /etc/containerd
sudo mkdir -p /etc/cni/net.d
mkdir -p /etc/systemd/system/containerd.service.d
cat <<EOF > /etc/systemd/system/containerd.service.d/10-compat-symlink.conf
[Service]
ExecStartPre=/bin/ln -sf /run/containerd/containerd.sock /run/dockershim.sock
EOF
systemctl daemon-reload
systemctl enable containerd
systemctl restart containerd

@panchm panchm closed this as completed Nov 9, 2022
@github-actions
Copy link

github-actions bot commented Nov 9, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants