-
Notifications
You must be signed in to change notification settings - Fork 522
kubelet failed to start on windows nodes with accelerated networking #2281
Comments
Yes. I would bump to 3 minutes. |
@marosset Thanks a lot for making the fix. Just wondering when will it be released? Asking because we are currently blocked since we recreated our cluster using the latest 0.43.1 and we are seeing missing nodes behaviors in all our clusters. We were trying to revert back to 0.41.5 which is the last release before #2049 is introduced. However, 0.41.5 doesn't have the latest security patches for windows image, which is added in #2176. So it would be good to know when this fix will be released as well as any other options to take at this time. |
We are planning on releasing a v0.44.0 mid next week which should include the fix to increase the timeout. |
@pradipd have you learned anything from your investigations as to why this is taking a long time? |
@yixwang Once everything is up and running, can you do the following on a node? If that is not the issue, my guess is that it takes a while to create the vswitch. We are working with a few team to figure out why. But, that will not be a quick investigation and it won't be a quick fix. @marosset Is there anyway we can collect logs? We have a script we use to debug these issues: |
Awesome. Can we run starthnstrace.cmd? That will start hns tracing. Do we want to do that on all deployments? Or should we hide behind some sort of debug flag? |
I don't think we want to start hns tracing in all deployments. |
This is going to happen only on the initial deployment when we create the external switch. |
@yixwang can you share the entire kubelet.err.log . |
Creating the external switch happens in c:\k\kubeletstart.ps1 which runs whenever the kubelet service started. You should be able to stop the kubeproxy/kubelet service, delete the existing 'ext' hns network/switch and then start kubelet to repro this I believe. |
@marosset @pradipd So to collect the kubelet.err.log, my understanding is that i need to do the following:
Let me know if I am missing anything above. Once confirm, I will collect the log and upload here. |
@yixwang. You should already have kubelet.err.log. You mentioned it in your original post. |
Yes, here it is |
What is the output of "ipconfig /all" on the node |
Note, the output below is collected on a different node from the one where kubelet_log is collected since the previous cluster is already gone. Nevertheless, both nodes are having the same error where kubelet failed to start. If needed, I can collect the kubelet log from the same node where the
|
Can I have the kubelet logs from this node? |
@pradipd Here's the one from the same node, 2005k8s00000001: |
It looks like the script can't find the IP address for Ethernet 3. Can you run the following in powershell? |
Hi @pradipd , sorry for taking a bit long to get back. I upgraded to 0.44.0 which has the fix for increasing timeout. But I am still seeing 2 of 3 windows nodes missing. Note, I noticed that we do have Here's the kubelet error log from one of the missing node Here's the output of
Here's the output of the commands from your last comment:
|
This issue is not because of the timeout. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is still an issue for some teams. @jackfrancis and @marosset can we fix this |
The Windows team determined this was an issue with the Mellanox drivers and we need to get new drivers for accelerated networking to work reliably in these scenarios. Let me try and get the status of the drivers and see if there is a date the new drivers would be included in the Windows VHDs available in Azure. |
I have not. @daschott do you know when new Mellanox drivers will be available in the Azure marketplace images? |
ping @daschott |
The driver itself has been released: I forwarded our last discussions on this over e-mail. I unfortunately don't have many insights into Azure qualification procedure :( Let me know if there is anything in addition that is being asked. |
@marosset @craiglpeters @jsturtevant any update when this will be fixed since the Windows drivers are fixed? |
@immuzz I think we are waiting for the drivers to be inside the Images produced in the Azure Marketplace. Do you know if these components are in the those images? |
@jsturtevant any news on the new images containing the drivers for windows node pools where accelerate networking can be enabled? |
@romina2001 the driver has been released but has not been validated in k8s and doesn't not come installed in the base images AFAIK. If you have the ability you could give it a test run and report back. This is currently on our backlog to test via Azure cluster api |
Closing as this was fixed in #4585. |
Describe the bug
In one of our newly created cluster, we noticed that some windows nodes are missing from the cluster, when running
kubectl get nodes
. However, these nodes do exist and run healthy as shown on the Portal.Example:
Only one node show up. However, all three windows nodes in the node pool of VMSS on Portal are running ok.
After looking into kubelet.err.log on the windows nodes that are not recognized by Kubernetes. We found it has a lot of the following errors repeat many times:
Line 50 corresponds to the 3rd line in the following in the kubeletstart.ps1 file
Can someone help identify the root cause in this case?
Steps To Reproduce
Create the cluster with the following apimodel.json
Expected behavior
All nodes should be recongnized by the cluster
AKS Engine version
0.43.0
Kubernetes version
1.15.4
Additional context
The text was updated successfully, but these errors were encountered: