-
Notifications
You must be signed in to change notification settings - Fork 519
Master nodes on upgraded cluster do not come ready after restart #3628
Comments
Are the master VMs running Ubuntu 16.04-LTS? |
Do you see the |
yes, its Ubuntu 16.04-LTS. "distro" in apimodel is "ubuntu" as the cluster was originally installed with ACS-Engine 0.21.2.
Access with kubectl is not possible, neither over the loadbalancer, nor directly on each of the three master nodes (connection refused). Here the docker ps output of one of the masters: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
I found the reason: etcd was stopped on all masters. After manually restarting etcd on all three master nodes the cluster formed and became ready again. But why did etcd not start automatically after the restart of the nodes? Does that have anything to do with the previous upgrade ? |
I did a shutdown and restart of all master nodes and agent scalesets again. Same problem. Etcd does not come up automatically: root@k8s-master-11480702-0:~# systemctl status etcd Jul 23 08:00:35 k8s-master-11480702-0 systemd[1]: Stopped etcd - highly-available key value store. After restarting etcd manually on all three masters the cluster is up and healthy again. journalctl -u etcd around reboot
So why does etcd not come up automatically after system restart? |
Yes, that is strange. Do you get this same result from this command?
|
Yes, its enabled, but it does not start:
Here again the log after reboot:
Only that one line, that etcd ist stopped. Here the status output:
|
I successfully upgraded the Cluster to 1.17.7 now. Had some problems with AAD auth on upgrade and afterwards (see #3637) but cluster is running. |
@jackfrancis Any idea how to solve this problem with etcd not starting after a master node reboot? Were still not confident to upgrade our PROD cluster with this problem active, and additionally the problem with AAD not working after upgrade to 1.17.7 (#3637) |
Just installed a brand new AKS cluster, k8s version 1.18.8 with AKS-Engine 0.55.1 using our API-Model (Private cluster, RBAC enabled, 3 master nodes as availability set, 3 agent nodepools as VMSS). Additionally, in this cluster AAD login also does not work (#3637) |
Just tried to reboot a single master instance (master-1) on this new cluster. Again, etcd does not come up on its own on the rebooted node and shows as dead in systemctl status:
A manual start of etcd does work. |
@chreichert The AAD not working starting w/ 1.17 is a known issue: #3637 I'll try to repro the etcd not coming up after reboot (we don't see this in our tests), though I'll exclude the AAD configuration. What else besides private cluster should I configure the cluster for? |
@jackfrancis I could repro it even with a brand new AKS cluster, k8s version 1.18.8 with AKS-Engine 0.55.1 using our API-Model (Private cluster, RBAC enabled, AAD enabled, network: own subnet in separate RG with kubenet plugin and calico policy, 3 master nodes as availability set, 3 agent nodepools as VMSS). The kubernetes.json used for aks-engine generate is as follows (redacted):
|
I wasn't able to repro on a private 1.17.11 cluster:
I'll build a cluster that more precisely looks like yours, except with no |
I built a private cluster 1.18.8 cluster in a cluster VNET w/ calico + kubenet, it looks like this:
I'll see what happens if I reboot the master VMs. |
Confirmed that the above cluster does not repro. If you'd like to prove it definitively, you can build out a cluster w/ out the RBAC + AAD config and see if that repros or not; I strongly suspect this symptom is related to that issue. |
@jackfrancis I will need some time to test that scenario, as my Azure test env is currently blocked with other tests. |
@jackfrancis I doubt it has something to do with the AAD config, as it started for me with updating from 1.16.4 to 1.16.11 (with 0.53.0) or 1.16.14 (with 0.55.1) respectively. Both versions do still support AAD logins. |
@jackfrancis Just did a test with our apimodel with removed AAD and RBAC setting. The problem could be reproduced. When I shut down all agent VMSS and master VMs (using the Azure portal) and then restart the master VMs and agent VMSS (also in the portal), the masters come up with etcd service dead. Here the exact model I used for aks-engine generate (redacted):
Cluster is deployed in North-Europe Region. What we also use, is a Basic Loadbalancer, as our Clusters were initially created with ACS-Engine, where this was the default. |
@jackfrancis Could this, and #3807 be related? |
I don't know for sure but with some more details, it could be. We found that it matters how the nodes are restarted and what parts of cloud-init run due to the mechanism of restart. A simple "soft" reboot (initiated by the OS itself), it tends to not have this problem, but a harder reboot via Azure where we end up causing more cloud-init operations (specifically to the point where it triggers the extra fsck of etcddisk and manual mounting of it) causes the requires chain to break and thus etcd fails to start due to systemd not thinking that the mount had been executed (since it was not executed by systemd) During upgrade, you most likely are doing more work like that and thus without a soft reboot, the problem stated could be it. Your symptoms look right from what you wrote here. (The key items were the logs of the /var/lib/etcddisk mount and the fsck of that disk in journalctl) |
@jackfrancis I just tested upgrading from 1.16.4 to 1.17.11 with newest AKS-Engine v0.56.0: |
@chreichert - Thanks for the verification! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Describe the bug
After upgrading an older cluster, that was initially created with ACS-Engine 0.21.2, from 1.16.4 to 1.16.11 first everything worked fine. Usually we shut down all master VM's and agent VMSS over the night and the weekends when we do not need the cluster for testing. After the first restart past the upgrade, the cluster was not reachable with kubectl anymore. Looking at docker ps on the master nodes, it looks like it did not start calico networking. You can only see api-server, controller-manager, scheduler and addon-manager in the list of running containers.
Steps To Reproduce
Latest Upgrade of the cluster has been done with AKS_Engine Version 0.45.0.
Resulting API-Model:
api-model
Cluster does not come up, can not be reached via kubectl. Agent nodes do not join the cluster.
Expected behavior
Cluster can be shut down and restarted without problems.
AKS Engine version
0.53.0
Kubernetes version
1.16.4
Additional context
The text was updated successfully, but these errors were encountered: