-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in Windows 11 Build 22621.x - pod networking can not reach any destination #322
Comments
This issue has been open for 30 days with no updates. |
1 similar comment
This issue has been open for 30 days with no updates. |
Looking into this. I created an internal ticket (#44339550) for tracking. |
This issue has been open for 30 days with no updates. |
1 similar comment
This issue has been open for 30 days with no updates. |
Any update on the internal ticket for this? Would be nice to know if there is an ETA for a fix, since locking installs to 22000 and not upgrading is not ideal for various reasons. |
This issue has been open for 30 days with no updates. |
As of 30/07/2023, locking a Windows installation to 22000 via the registry no longer works and you'll be force upgraded past the Is there any update on this issue @fady-azmy-msft? |
This issue has been open for 30 days with no updates. |
@hach-que, would you be able to try reproducing this issue on Windows Server 2022? This scenario isn't supported on client skus |
Windows Containers aren't supported on current Windows 11 clients? That's news to me. Last I heard, the rational for not publishing 22000 and later client versions of https://hub.docker.com/_/microsoft-windows was that it was no longer necessary - because Microsoft was saying that ltsc2022 images would run on Windows 11 (#117 (comment)). If the ltsc2022 image is no longer compatible with Windows 11, does that mean we'll get a 22621 client image published to Docker for use with Windows 11? |
@hach-que , you are correct Windows Containers are supported on windows 11 clients. I meant to say that overlay or l2bridge networking are both server specific scenarios. I'm assuming you are using one of these networking options with Flannel VXLAN. |
How would we test Kubernetes in a development scenario without Calico? Are we expected to license Datacenter just to do development on Kubernetes Windows support? Calico currently works on 22000 both server and client. There's really no practical reason for Calico not to work on 22621 - it's probably a pretty good indicator there's a regression in the kernel or networking components that would break Calico on Server 2025, which would thus mean it needs to get fixed anyway. |
Hi @hach-que , I tried cloning your repo to try reproducing the issue. However, I get permission errors (while trying both the https link and the git link). I tried both ssh (by adding a public key) and an access token. I am probably doing something wrong. Is there a wiki you have which I can refer to be able to clone your repo? |
Got the repo cloned. Built rkm in visual studio. Tried starting the dotnet application in Ubuntu VM. Getting the below error. Oct 19 17:40:40 systemd[1]: rkm.service: Scheduled restart job, restart counter is at 1. |
@hach-que, If you would still like us debugging this, to start with, could you please share the output of (Get-HnsNetwork | convertto-json) from one of the Windows VMs? |
I can help with this from the Linux side if needed |
@hach-que is this still an issue? I checked on the rkm project and noticed nothing has been committed in over 7 months. If this is an issue, do you have specific requirements on vxlan or could something else be used? |
Closing issue because it's going stale. |
Hi @fady-azmy-msft @MikeZappa87, apologies for the non-response here. RKM is pretty much on hold at the moment since containerd 1.7.0 / hcsshim broke the ability to mount virtual filesystem volumes in host-process containers (microsoft/hcsshim#1699). That's prevented me from upgrading our RKM-based cluster past containerd 1.6.x and blocked any further adoption of regular containers, since even host-process containers don't work on the newer versions. Even if this issue was fixed in the kernel, practical usage of regular containers is also blocked on winfsp/winfsp#498, and there hasn't been progress on that issue over 6 months either. So unfortunately I haven't had the time to try and replicate setting RKM up on a bunch of new boxes to see if I can replicate the issues you were having here. It's likely I won't get a chance to revisit RKM until mid-2024 at the earliest. |
(This bug was originally posted to microsoft/SDN#563, but I've moved it here because this feels like a more appropriate avenue to report it)
I've been experimenting with getting Windows 11 to work as a Kubernetes node for dev/testing purposes, and I've run into an issue where I've found a regression between Build 22000 and Build 22621 where pods can no longer reach any destination (Internet, service addrs, other pod addrs).
Background
I have been working on a tool called RKM which does all the required set up for Kubernetes in development mode (without requiring VMs). It does all of the configuration, starts components, configures network, etc. etc.
Now I started out trying to get things working with Flannel VXLAN; I got Linux working fine but no luck on Windows - pods could not reach anywhere. After a little discussion in the sig-windows Slack, I was convinced to try setting up Calico instead. All of the testing results from here on out are from the calico branch of RKM (linked above).
For "Does not work", it means "pods can not reach the outside world or any other services or pods (the only thing that works is pinging 127.0.0.1)".
For "Works", it means "everything works, including DNS resolution of service names inside the Kubernetes cluster".
Reproduction Steps
The reproduction steps are fairly simple, but you will need a Linux VM to act as the Kubernetes master. All of these VMs should be on the same subnet (e.g.
10.7.0.0/16
).Build RKM on the calico branch
net7.0
directory insidesrc\rkm-daemon\Debug
.Set up the Linux VM
conntrack
package.dotnet rkm-daemon.dll
and leave it running. It'll take a moment to start up all the components./opt/rkm/$(hostname)*/kubectl get pods --all-namespaces
. Once that's working and you can see the core pods, you're ready to move onto setting up the Windows VMs.Set up the Windows VMs
You're going to set up two VMs here to contrast the build numbers. How you get 22000 and 22261 on the machines isn't very important (I'm sure internally Microsoft has easier access to ISOs than I do), as long as you get one machine on 22000.x and the other on 22261.x.
Then on each machine:
dotnet rkm-daemon.dll
.dotnet rkm-daemon.dll
to continue the set up process.Checking that everything is working
Back on the linux VM, you should now be able to run
/opt/rkm/$(hostname)*/kubectl get nodes -o wide
and see output that looks similar to this (your hostnames will vary):Deploy the testing manifest
Copy this file to the Linux VM, and then apply it to the cluster with
/opt/rkm/$(hostname)*/kubectl apply -f ...
:You can then get the names of the pods for further testing steps with:
/opt/rkm/$(hostname)*/kubectl get pods -o wide
.Test that Linux networking is working
This should pass easily, but as a sanity check, use
/opt/rkm/$(hostname)*/kubectl attach -it <name of busybox pod>
and then runping 1.1.1.1
. You should be able to get to the Internet.You could also deploy an Ubuntu container and install
curl
and test connectivity that way, but the Linux side of things is stable and doesn't really need further confirmation that it works beyond running a simple ping.Compare Windows networking
Ok, time for the part where we actually see the problem. When you ran
/opt/rkm/$(hostname)*/kubectl get pods -o wide
you will have seen which nodes each of the nano containers are running on. You need to pair this up with the kernel versions shown in/opt/rkm/$(hostname)*/kubectl get nodes -o wide
to know which pod is running under which kernel version.For the pod that is running on 22000.x, use
/opt/rkm/$(hostname)*/kubectl attach -it <pod name>
and then run the following commands:This should all be working, and you should get responses.
For the pod that is running 22261.x, repeat the test and commands above. You will see that you just get timeouts and no connectivity, and thus you have reproduced the issue.
Workaround
There's no known workaround at the moment, because it's impossible to downgrade Windows 11 (and I'm not sure you can even hold off updates that long if Windows decides it wants to update).
The text was updated successfully, but these errors were encountered: