-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dual stack: unable to communicate between nodes via ipv6 #8794
Comments
@manuelbuil any ideas? |
Hey, could you provide the following output: |
Hey @manuelbuil, of course, thank you for your help. For completeness, I'll include what you ask for both nodes. Node YAMLNode 1
Node 2
/run/flannel/subnet.envNode 1
Node 2
RoutesNode 1
Node 2
|
Thanks for the output! One thing that stands out is that both are k3s servers but I don't see how they are being connected together to create an HA control plane. How did you deploy both nodes? |
I have an ansible playbook that does it, using an approach inspired by this. One node runs |
When installing k3s you get a systemd service running with the configured parameters. If I understand correctly, you stop that service, change the config parameters and restart it or create a different systemd service, right? Why don't you stay with the created systemd service and the original config parameters?
I have asked internally and if etcd db files exist on disk, those parameters get indeed ignored. Do you see any flannel log that provides extra information? It seems the flannel instance of the node is not aware of being part of a cluster |
@manuelbuil right, it's at the end of that ha-embedded doc:
As for your question:
Because I don't like to treat any of my nodes as special, so they all end up with exactly the same systemd unit. Leaving bootstrap-only options like Note that kubectl shows all the nodes:
And again, if I swap the order of ipv6,ipv4 in these params (such that ipv4 becomes the primary), this works just fine (although ipv6 still doesn't work, to be clear 😛 ). I'm happy to try adding the bootstrapping options back though, if you feel like it will change anything.
The only place I know to look is the journal for my systemd unit, can you confirm? I enabled debug mode, and when I fire it up on node 1 I see this:
It doesn't appear to be unhappy. Is there a way to get more information from flannel? |
I was trying to reproduce the issue but I have been unsuccessful so far. In my env I always see the route your are missing in ipv6. Even when I stop k3s, remove BTW, I'd recommend you moving to |
Okay I upgraded to v1.27.7+k3s2, but I'm afraid I must report no change. From node 1 I can ping node 2's flannel.1 address, but not node 2's flannel-v6.1 address (that traffic is still going to the gateway). Here are the routes: node 1:
node 2:
|
I wonder if maybe this could have something to do with the routes we add to our private interface. Our public interface is the one with the gateway, so we use routes on the private interface. Could flannel be picking up on those and deciding not to install a more specific route? Here are my network configs, in case it's useful: bond0 (private interface)
bond1 (public interface)
|
The flannel IP routes for multinode communication are not there. You should see something like this:
Something weird is happening when executing https://github.com/flannel-io/flannel/blob/master/pkg/backend/vxlan/vxlan_network.go#L141. Could you check What OS are you running? I can't reproduce the issue with opensuse or ubuntu 22. |
Could you also try adding the route manually? Let's see if we get more information. In node1:
In node2:
|
Not that I see. I'm not sure what to grep for, though. "route" gave me no results. Neither did "flannel". Here's some stuff about cni that doesn't look unhappy:
Sorry about that, I should have included it in the OP (I've updated it now): Debian 12 (Bookworm).
Actually no, this is interesting. Here's the result of trying to do that on node 1:
That's coming from here in the kernel. I don't quite know what it means, I'm researching. |
Okay I still can't claim full understanding of what's happening here, but I made some good progress today. That line in the kernel had two important commits involved in it that added some good context in its commit messages (I love a good commit message). It really made me start wondering about that Node 1
Node 2
So what is the deal, here? Why can't I use a subset of my private network for the cluster/service networks? Assuming there's some technical limitation that I don't understand, is there something flannel could do to better communicate this issue to me, instead of just not adding routes? |
Thanks for the investigation and sharing it here :)
I already discussed with a colleague that we should improve flannel logs because when strange kernel stuff happens we are blind |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
This is still an issue. |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
@manuelbuil this is still an issue, but I have a workaround so I won't stand in your way if you want to ignore it. |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Environmental Info:
K3s Version:
$ k3s -v
k3s version v1.27.6+k3s1 (bd04941)
go version go1.20.8
Node(s) CPU architecture, OS, and Version:
Debian 12 (Bookworm)
Linux s1 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
Cluster Configuration:
3 servers
Describe the bug:
The cluster is dual stack: both ipv4 and ipv6. Each node has two NICs: one public (bond1), one private (bond0). I'm using
--node-ip
and--node-external-ip
to specify which is which. I'm also using--flannel-ipv6-masq
, since the ipv6 block I'm using isn't publicly routable (I can make it that way, I'm just experimenting at this point). As an example, here's the args for one of my nodes:While the pods can communicate with each other via ipv4, they cannot via ipv6. To simplify, let's talk about nodes 1 and 2:
flannel.1
: 10.3.128.0/32flannel-v6.1
: fda5:8888:9999:311::/128cni0
: fda5:8888:9999:311::1/80,10.3.128.1/24flannel.1
: 10.3.130.0/32flannel-v6.1
: fda5:8888:9999:311:2::/128cni0
: fda5:8888:9999:311:2::1/80,10.3.130.1/24Ignoring pods entirely, from node 1, I can ping node 2's
flannel.1
IP address:However, I cannot ping node 2's
flannel-v6.1
IP address:Interestingly, note the response is coming from fda5:8888:9999:310::1, which is bond0's gateway. It seems like this should stay within flannel, no? ipv4 does:
But ipv6, obviously, does not:
Here are the routes on Node 1:
It seems pretty clear that flannel isn't putting all the routes in here that it should. I expect I've made a mistake in my configuration, but I'm not sure how to debug this any further. Note that I have no firewall enabled. Does anyone have some insight into what's happening here?
The text was updated successfully, but these errors were encountered: