Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the documentation examples and advice for native IPv6 Docker networking #19556

Open
1 task done
daryll-swer opened this issue Mar 2, 2024 · 28 comments
Open
1 task done
Assignees
Labels
area/engine Issue affects Docker engine/daemon area/guides area/networking Relates to anything around networking lifecycle/frozen

Comments

@daryll-swer
Copy link

daryll-swer commented Mar 2, 2024

Is this a docs issue?

  • My issue is about the documentation content or website

Type of issue

Other

Description

Few issues:

  1. The bare minimum “example” should generally be a /64 (whereby we route a /64 or larger to the Docker host) instead of /112 or potentially some other outside nibble-bit boundary prefix length.
  2. The IPv6 notation used in the documentation is not correct, in direct violation of RFC5952
  3. ULAs shouldn't be promoted or advised, sources below:
  1. An additional problem with ULA is enforcing the need for NAT66 which in turn breaks native IPv6 networking and the benefits of end-to-end principle, the usage of NAT66 also introduces the need for ALG (in Linux world, called NAT Helpers) which renders the benefits of IPv6 to zero, on par with traditional NAT44 or NAT444 environments.
  2. Overall lacks network engineering point of view and operational insights

Location

https://docs.docker.com/config/daemon/ipv6/

Suggestion

For point 1:

We simply replace the “2001:0DB8::/112” string with “2001:db8::/64”
The idea for a minimum /64 came from the fact that IPv6 networking was to be based on prefix-length, and not “number of addresses” as the address space is 128-Bits. This was reflected in the original SLAAC specifications, in addition to additional operational information on BCOP-690. We should not be promoting archaic IPv4-centric mentality in native IPv6 networking.

For point 2:

Firstly, the alphabets in an IPv6 address is lower-case always, secondly we always remove all leading zeros in the compressed IPv6 notation format, meaning in effect:

  • “2001:0DB8::/64” is not recommended
  • “2001:db8::/64” is recommended

Please refer to section 4 of RFC5952

For point 3, 4 & 5:

I am willing to help improve this aspect of the Docker IPv6 documentation by integrating network engineering perspective and operational insights directly into the Docker docs.

The basic idea of native IPv6 networking is: No NAT66/NPTv6 or ULA.

I'm of course aware of poorly implemented IPv6 in popular cloud providers/IaaS companies, whereby the user is forced to rely on ULA/NAT66 or some hacks with NDP Proxy or MACVLAN, but of course, this is not a valid reason to push for ULA/NAT66/NPTv6 from the official documentation of Docker.

I authored an extensive native IPv6 best practices guide below, that folks may want to give a read on to get fully thorough information that simply cannot be reproduced in a tiny GitHub issue:
https://blog.apnic.net/2023/04/04/ipv6-architecture-and-subnetting-guide-for-network-engineers-and-operators/

I've written extensively on various topics of network engineering, in particular IPv6. I'm personally willing to help improve the Docker docs to push for native IPv6 networking using some realistic examples. I'm not sure how the Docker docs writing/improvement process is handled, but if I could get in a direct discussion with the relevant folks, it would be much appreciated.

On a side note, I'm happily using Docker for a few years now, with native IPv6 networking (no ULA/NAT66/NPTv6) 🙂

Tasks

No tasks being tracked yet.
@daryll-swer daryll-swer added the status/triage Needs triage label Mar 2, 2024
@dvdksn dvdksn added area/engine Issue affects Docker engine/daemon area/networking Relates to anything around networking labels Mar 2, 2024
@dvdksn
Copy link
Collaborator

dvdksn commented Mar 2, 2024

Thank you @daryll-swer for raising this issue, and for providing so much detail and offering to help out!

Let me loop in @robmry and @akerouanton, maintainers of libnetwork in moby. I think describing some of these concept from a network engineer's perspective sounds like a great enhancement, if we could pull that off!

@NiKiZe
Copy link

NiKiZe commented Mar 3, 2024

I can only agree to what @daryll-swer has brought up.
And that current Docker documentation has caused conflict more than once when users/developers has said that Docker documentation is to be followed instead of RFC and actuall best practice.

Thanks!

@robmry
Copy link
Contributor

robmry commented Mar 4, 2024

Yes, these sound like good changes!

@dvdksn - what's the best approach, I guess Daryll can create PRs for us to review and discuss?

@daryll-swer
Copy link
Author

I guess Daryll can create PRs for us to review and discuss?

While a PR would make sense for the actual documentation change. I think we should probably get on a call or something first, to discuss some ideas and concepts I have in mind about native IPv6 networking on Docker, and see if we can then, from there, make a plan of action to update the Docker IPv6 docs (through the PRs?).

@robmry
Copy link
Contributor

robmry commented Mar 4, 2024

Yes, sure - let's do that. I'll try to sync with Albin, and we'll get something set up ...

@caliban511
Copy link

So how can the IPv6 network of the docker host communicate with the IPv6 subnet in docker? I haven't seen any documentation on this.

@daryll-swer
Copy link
Author

So how can the IPv6 network of the docker host communicate with the IPv6 subnet in docker?

I'm not sure what you mean @caliban511.

@dvdksn dvdksn added lifecycle/frozen and removed status/triage Needs triage labels Mar 30, 2024
@caliban511
Copy link

I'm not sure what you mean @caliban511.

What I mean is that the IPv6 subnet inside docker cannot communicate with the outside.
Although the containers inside docker have obtained IPv6 subnet addresses by adding IPv6 subnets, they still cannot access pure IPv6 websites on the public network.
I have not found a way to let the IPv6 subnet communicate with the public network.

@NiKiZe
Copy link

NiKiZe commented Mar 30, 2024

By using DHCPv6-PD every subnet gets its own /64
Traffic is routed thru the host.

The host should be able to request at least a /60 for such usage. And yes that requires that the network is setup correctly, with v6 you need to get more than a /64.

@daryll-swer
Copy link
Author

daryll-swer commented Mar 30, 2024

What I mean is that the IPv6 subnet inside docker cannot communicate with the outside. Although the containers inside docker have obtained IPv6 subnet addresses by adding IPv6 subnets, they still cannot access pure IPv6 websites on the public network. I have not found a way to let the IPv6 subnet communicate with the public network.

What you are referring to is just routing. Docker's job is not network architecture and design, Docker's job is container abstraction tooling and deployment. So it's not a surprise that they didn't document how to design an IPv6-native network and how to route a prefix to the Docker host.

That being said, I am talking to Docker net dev team via emails and I did suggest they add some docs on routing basic for IPv6. This is my copy/paste from the email I sent them:

As I explained in the GitHub issue, the minimum (smallest, longest) prefix length should be a /64. This /64 is to be routed to the Docker host; There are 3 methods for routing a prefix to the host:

  1. By using DHCPv6 ia_pd

    • Somewhat, fairly simple to configure and deploy. For both the DHCPv6 server upstream and the DHCPv6 client daemon running on the host.
    • Downside? The network operator (whoever is providing the network underlay to the Docker host owner) now needs to maintain and operate a DHCPv6 server and on top that, need to maintain AAA to ensure the “host” gets the correct prefix, statically, and survive reboots etc.
  2. By using BGP (eBGP with private ASNs) with FRR or BIRD (FRR seems to be more popular)

    • More scalable method, allows for advanced traffic engineering with BGP features such as BGP communities, path manipulation etc.
    • The Docker host (in a properly architected network) could actually influence the path a packet from host, takes towards the upstream network (or internet) through BGP communities. Think of a large data centre network that spans from Amsterdam all the way to Spain. BGP communities will allow the app developer to influence (not 100% hard control) the path a packet takes based on network analytics (bandwidth, latency, packet loss, customer patterns etc)
    • Also my most favourite method.
  3. By using static routes

    • Doesn't scale, prone to errors/failures when a next-hop dies/flaps. Not recommended for production.
    • I would go as far to say, not recommended even for a home office network, bare minimum a person should opt for OSPF (if they aren't capable of BGP operations).

This problem is more generalised and not limited to Docker situation, I don't know why but network engineering knowledge seems to be considered obsolete/unnecessary by many people in this industry to the point that everyone just knows “NAT” [as they don't know how to route from edge router all the way down to the host using eBGP driven design (for example)] and overlapping RFC1918 ranges (as they do not even know how to subnet).

The host should be able to request at least a /60 for such usage. And yes that requires that the network is setup correctly, with v6 you need to get more than a /64.

@NiKiZe for Docker-specific-only host, you don't really need more than a single /64 routed to that host, because you will anyways, just create a Linux bridge and slap a /64 on that bridge and all containers then reside behind that bridge and get a /128 out of the /64 routed address space. I've never found a use-case for Docker networking to go beyond one bridge or beyond a single cleanly routed /64.

Docker Swarm is something I've not used, but I suppose there may be a use case for more than one /64 there.

@daryll-swer
Copy link
Author

Additional recommendation, use docker-compose to configure the networking instead of daemon.json, more flexible and allows you to fine-tune the networking to your needs. Example of what I do (I routed the /64 with eBGP using private ASN numbering to the host with FRR on the host):
image

@caliban511
Copy link

Additional recommendation, use docker-compose to configure the networking instead of daemon.json, more flexible and allows you to fine-tune the networking to your needs.

Thank you for your patience

I am not a network engineer, but I am just an ordinary Docker user. I try to create an IPv6 subnet for Docker according to some online tutorials, but it is always impossible to pass when the IPv6 test connected to the container connected to the IPv6 subnet. What can I pass the test?

Is 2400: 7060 a prefix for your public network? Except for directly using the public network prefix to build a subnet, can there be no other way to connect the container to the IPv6 public network?

Because our IPv6 public network prefix will be changed in a few days as a cycle. Using this method means that you need to manually delete and rebuild the IPv6 subnet every few days,

This is obviously not the right way ,,,

@NiKiZe
Copy link

NiKiZe commented Mar 30, 2024

Additional recommendation, use docker-compose to configure the networking instead of daemon.json, more flexible and allows you to fine-tune the networking to your needs.

Thank you for your patience

I am not a network engineer, but I am just an ordinary Docker user. I try to create an IPv6 subnet for Docker according to some online tutorials, but it is always impossible to pass when the IPv6 test connected to the container connected to the IPv6 subnet. What can I pass the test?

Is 2400: 7060 a prefix for your public network? Except for directly using the public network prefix to build a subnet, can there be no other way to connect the container to the IPv6 public network?

Because our IPv6 public network prefix will be changed in a few days as a cycle. Using this method means that you need to manually delete and rebuild the IPv6 subnet every few days,

This is obviously not the right way ,,,

That is exactly the right way, you have a prefix from your ISP that you route internally, no more private IP series, and no more NAT. The issue here is that many have learnt a IPv4 NAT mindset, which is not the right way.
With IPv6 you also always have multiple IP addresses, take this to your advantage, don't be afraid of multiple IPs to make your transition between ISPs.

@daryll-swer
Copy link
Author

I am not a network engineer, but I am just an ordinary Docker user. I try to create an IPv6 subnet for Docker according to some online tutorials, but it is always impossible to pass when the IPv6 test connected to the container connected to the IPv6 subnet. What can I pass the test?

Is 2400: 7060 a prefix for your public network? Except for directly using the public network prefix to build a subnet, can there be no other way to connect the container to the IPv6 public network?

2400:7060::/32 is a Global Unicast Address block that I own. If I were to use this block in a production, I would subnet it according to my architecture model here.

Internet<>Edge router<>Layer 3 spine<>layer 3 leaf<>server node
Server node has a /48 routed to it, from there, subnet into /56 per VMs or /64 per VM and route that to the VM using one of the methods I described here.

Because our IPv6 public network prefix will be changed in a few days as a cycle. Using this method means that you need to manually delete and rebuild the IPv6 subnet every few days

IPv6 prefixes are NOT supposed to "changed in a few days as a cycle" - This is not a Docker problem, this is a network problem. Your network provider failed to adopt IPv6 standards and best practices.

Ask them to deploy BCOP-690 compliant IPv6 in addition to asking them to read my IPv6 guide. If they do not comply with BCOP-690, you'll forever have broken IPv6.

@daryll-swer
Copy link
Author

The issue here is that many have learnt a IPv4 NAT mindset, which is not the right way.
With IPv6 you also always have multiple IP addresses, take this to your advantage, don't be afraid of multiple IPs to make your transition between ISPs.

He's talking about lack of BCOP-690 compliance, forcing the imposition of NAT66 upon the user by the network operator who's providing him IPv6 connectivity.

@caliban511
Copy link

caliban511 commented Mar 30, 2024

IPv6 prefixes are NOT supposed to "changed in a few days as a cycle" - This is not a Docker problem, this is a network problem. Your network provider failed to adopt IPv6 standards and best practices.

Thank you,

It seems that I can't use IPv6 for the container of Docker for the time being, that's it, thank you again.

@caliban511
Copy link

He's talking about lack of BCOP-690 compliance, forcing the imposition of NAT66 upon the user by the network operator who's providing him IPv6 connectivity.

Thank you,
It seems that the network environment I am in it has caused such problems. This is temporarily unavailable. That's it.

@daryll-swer
Copy link
Author

It seems that the network environment I am in it has caused such problems. This is temporarily unavailable. That's it.

Hostile “expert” network providers are typically the norm in this industry, unfortunately. From Tier 1 cloud providers to small networks. A lot of app developers or end-users have complained about hostile network team for a very long time.

I'm of the opinion that software engineers, network engineers and app devs should work together in an organisation with a Venn diagram intersection — But sadly, that's wishful thinking.

@olljanat
Copy link
Contributor

olljanat commented Apr 4, 2024

Important topic. In my opinion, IPv6 is not actually that complex like people think. It is IPv4/IPv6 dual-stack which causes all the problems.

That why I would start tackle this issue by adding support to create Docker networks where IPv4 is disabled.
On that scenario we can then also disable NAT (--ip-masq=false) and PAT (--publish).

Docker Swarm is something I've not used, but I suppose there may be a use case for more than one /64 there.

With default settings Swarm reserves 10.0.0.0/8 and takes /24 slices from that for overlay networks.

However, service discovery inside of Swarm is DNS based so IP addresses does not matter that much.
It would make sense that Swarm workers would just tell for Swarm managers that "I have subnet xx/64" and virtual networks subnetting would be done from that.

So instead of having just one subnet, each Swarm scoped network would have as many subnets than there is nodes.
In K8s world, Calico uses /122 for this purpose and allows configuration between /116 and /128. but they also allow one host to have multiple subnets.

That way there needs to be always just one /64 route to each host.

By using BGP (eBGP with private ASNs) with FRR or BIRD (FRR seems to be more popular)

There is also GoBGP which might be easier integrate with Docker.

@daryll-swer
Copy link
Author

Important topic. In my opinion, IPv6 is not actually that complex like people think. It is IPv4/IPv6 dual-stack which causes all the problems.

That why I would start tackle this issue by adding support to create Docker networks where IPv4 is disabled.
On that scenario we can then also disable NAT (--ip-masq=false) and PAT (--publish).

Unless I'm missing something, Docker IPv6 is default NAT/PAT-disabled, is it not? So that's a non-issue for starters anyway. I have Docker running right now with native IPv6, no NAT/PAT for v6.

Not sure what you mean by dual stack causing problems, dual stack is independent of each other by default anyway.

With default settings Swarm reserves 10.0.0.0/8 and takes /24 slices from that for overlay networks.

However, service discovery inside of Swarm is DNS based so IP addresses does not matter that much.
It would make sense that Swarm workers would just tell for Swarm managers that "I have subnet xx/64" and virtual networks subnetting would be done from that.

So instead of having just one subnet, each Swarm scoped network would have as many subnets than there is nodes.
In K8s world, Calico uses /122 for this purpose and allows configuration between /116 and /128. but they also allow one host to have multiple subnets.

That way there needs to be always just one /64 route to each host.

I never worked with K8s. In theory, I think, regarding Docker swarm, you could simplify this further by making use of VXLAN/EVPN, whereby all swarm worker nodes' “Docker bridge” are members of the same layer 2 domain across all nodes. On a layer 3 basis, this would mean the single /64 is anycast routing, however this would mean, that the nodes need hypercube interconnection in order for the layer 3 routing to be natively routed between the hosts based on the routing table (FRR or GoBGP etc).

A single /64 can hold billions of containers, so if we had 9000 containers across 100 nodes, it's not an issue as each unique container has non-overlapping addresses anyway, and you could always advertise the /128 unique IP addresses over BGP back to the layer 3 leaf (ToR) switch - Or you could avoid hypercube as well this way, as each unique container on each unique host is advertised with the /128 and more specific address wins in the routing table, so single /64 for all nodes.

However, this is just my understanding of swarm, and it needs labbing to actually verify this concept works.

@olljanat
Copy link
Contributor

olljanat commented Apr 4, 2024

Not sure what you mean by dual stack causing problems, dual stack is independent of each other by default anyway.

I mean that as long you have IPv4 included (dual stack) then you need choose between:

  1. Inconsistent behavior for IPv4 and IPv6 clients. IPv4 with NAT and PAT and IPv6 without those.
    1.1 This have also huge affect to security because IPv4 addresses by default Docker are internal and IPv6 addresses are by default available for everyone.
  2. Force system to use consistent configuration (either disable NAT and PAT from IPv4 or enable those to IPv6).
    2.1 Because people are more familiar with IPv4, most of them will first try configure IPv6 match with their IPv4 networking.

That why I see that it would be much simpler to have pure IPv6 only configuration option.

In additionally, based on documentation experimental setting ip6tables "enables additional IPv6 packet filter rules, providing network isolation and port mapping." which is once again confusing because people should not use port mapping in IPv6 but sure they should use packet filtering and network isolation.

I have Docker running right now with native IPv6, no NAT/PAT for v6.

But internally Docker still adds IPv4 addresses to all those networks and each container have also IPv4 address right?
This is also background for those Swarm mode IPv6 issues. Because there is no option to totally disable IPv4 there can be code which expect those addresses to be always available.

I never worked with K8s.

When it comes to Kubernetes world. I found that there was same limitation in k3s and k0s and fixed those with k3s-io/k3s#4450 and k0sproject/k0s#1292 so now it is possible to run on both of those systems in way that there is no any IPv4 address at all on host or any of the containers. That is what I call for native IPv6.

EDIT: Just remembered that I actually tried implement IPv6 only mode for Docker two years ago but failed because too many place in code expected IPv4 addresses to be available. Old draft is available in (not sure if useful) https://github.com/olljanat/moby/commits/disable-ipv4/ But might be different story now because there have been some refactoring in libnetwork.

EDIT2: Made rebased + updated version of that which looks to be working quite well (still needs some work to make sure that it does not break IPv4 use case). Can be found from olljanat/moby@4534a75 and commit message contains examples how to use locally and with Swarm.

@daryll-swer
Copy link
Author

I mean that as long you have IPv4 included (dual stack) then you need choose between:

Inconsistent behavior for IPv4 and IPv6 clients. IPv4 with NAT and PAT and IPv6 without those.
1.1 This have also huge affect to security because IPv4 addresses by default Docker are internal and IPv6 addresses are by default available for everyone.
Force system to use consistent configuration (either disable NAT and PAT from IPv4 or enable those to IPv6).
2.1 Because people are more familiar with IPv4, most of them will first try configure IPv6 match with their IPv4 networking.
That why I see that it would be much simpler to have pure IPv6 only configuration option.

I disagree on point 1 — How are they “inconsistent”? They are completely independent protocols and therefore in the case of legacy exhausted IPv4, there's NAT, and v6, there's none, that is consistent.

On security, v4 may be NATted, v6 should just have accept established, related, untracked (I explained why we should NoTrack control plane and MGMT traffic and BUM traffic here), accept ICMPv6 and finally accept only “exposed” port, say 80/443 or whatever service you're running. Problem solved.

I would strongly note that NAT is not and never was a security feature:
https://www.f5.com/resources/white-papers/the-myth-of-network-address-translation-as-security

That why I see that it would be much simpler to have pure IPv6 only configuration option.

I agree that IPv6-only mode should be a feature for both Docker and Docker Swarm. However, you cannot force everyone to use IPv6-only mode, as IPv4 is still part of the internet ecosystem whether we like it or not. Please remind yourself, I'm a public IPv6 advocate and wrote a detailed IPv6 guide from a network architectural standpoint, I love IPv6, but I'm not delusional about an IPv6-only world in 2024.

In additionally, based on documentation experimental setting ip6tables "enables additional IPv6 packet filter rules, providing network isolation and port mapping." which is once again confusing because people should not use port mapping in IPv6 but sure they should use packet filtering and network isolation.

This is/was part of my original plan to correct the wordings/naming/concepts in the docs. I agree “port mapping” nonsense shouldn't be part of ANY IPv6 doc/talk/implementation at all.

But internally Docker still adds IPv4 addresses to all those networks and each container have also IPv4 address right?
This is also background for those Swarm mode IPv6 issues. Because there is no option to totally disable IPv4 there can be code which expect those addresses to be always available.

Like I said earlier:

I love IPv6, but I'm not delusional about an IPv6-only world in 2024.

When it comes to Kubernetes world. I found that there was same limitation in k3s and k0s and fixed those with k3s-io/k3s#4450 and k0sproject/k0s#1292 so now it is possible to run on both of those systems in way that there is no any IPv4 address at all on host or any of the containers. That is what I call for native IPv6.

Can we do load-balancing with native-only IPv6 without any Destination-NAT/NAT66 on K8s implementations? I'm interested to see some docs on this or examples if you have any. On the underlay network, I would of course have BGP peering with each hosts and make good use of BGP multipathing and BGP link bandwidth community.

refactoring in libnetwork

I was (is?) communicating with the Docker net dev team via email, and I was supposed to get an invitation to their IPv6-specific meeting(s) and discuss this further, however I haven't yet received the invite.

@olljanat
Copy link
Contributor

olljanat commented Apr 4, 2024

finally accept only “exposed” port, say 80/443 or whatever service you're running. Problem solved.

Ah, now I got what you mean. That is good idea and I would like to have support for that in IPv4 too, especially when NAT is not used and there is direct routes between hosts.

That would provide consistent --publish parameter behavior even for dual stack.
However, it would be breaking change for existing use cases so some kind of new setting is needed to enable this.

I would strongly note that NAT is not and never was a security feature:

Agree that it was not supposed to be security feature but technically it some places is and it is used like that by many.

In Docker it actually depends on which network driver is used. When bridge is used, it is possible to skip those PAT rules by adding route to source machine and it will be able to connect directly to all the ports which container is listening even when ports are not published.

However, when you are using overlay driver, it is not possible to directly communicate with those containers expect through ports which are published (PAT).

you cannot force everyone to use IPv6-only mode,

I said that adding support for IPv6 would be one way to start solving this issue, not that we would force everyone using only that configuration. Also if NAT is needed then I prefer to have some network device doing IPv4 -> IPv6 NAT instead of configuring of all them with dual stack but that is just my preference.

However, your idea is better, if --publish / (or some new simalar flag) can be used to control iptables instead of port map then we can improve security also for those who are using IPv4 only.

Can we do load-balancing with native-only IPv6 without any Destination-NAT/NAT66 on K8s implementations?

You can at least get very near by of that with this Calico feature and having externalTrafficPolicy: Local configured to services. Then you will see /128 routes distributed with BGP for your infra. Technically there is still kube-proxy running on all K8s nodes which you get rid of by switching to eBPF dataplane.

I was (is?) communicating with the Docker net dev team via email, and I was supposed to get an invitation to their IPv6-specific meeting(s) and discuss this further

I mean all those libnetwork pull requests in https://github.com/moby/moby/pulls?q=is%3Apr+libnetwork which started from moby/moby#42262 and after that there have been a lot of refactoring done.
Don't know more about IPv6 support except rumors that plan is improve it so you are here in right time.

@daryll-swer
Copy link
Author

Ah, now I got what you mean. That is good idea and I would like to have support for that in IPv4 too, especially when NAT is not used and there is direct routes between hosts.

That would provide consistent --publish parameter behavior even for dual stack.
However, it would be breaking change for existing use cases so some kind of new setting is needed to enable this.

Exactly, yes. --publishv6 or something? Or re-write the whole stack to be cleaner, and create a migration guide for users? I'm not a developer and definitely don't deal with layer 8 issues of dev work, so this will be a call the devs will need to make. As long as the underlying network behaviour is similar to my proposed method, it will work cleanly, PMTUD will work (ICMPv6 is permitted), conn_track table won't flood (NoTrack is in use) and established, related are default-accepted.

Agree that it was not supposed to be security feature but technically it some places is and it is used like that by many.

In Docker it actually depends on which network driver is used. When bridge is used, it is possible to skip those PAT rules by adding route to source machine and it will be able to connect directly to all the ports which container is listening even when ports are not published.

Yes and yes, but IPv4 is scare and nobody will really route a /24 or whatever, to each node, they will just NAT IPv4 and route a /64 (or larger) to the host and put it on the bridge — This is what I do anyway.

I would suggest avoiding using the term “PAT” as it confuses many people, even though yes I know, NAT and PAT are two different things, but in today's world they've come synonymous in terms of actual config on the system. I haven't seen stateless NAT (real NAT) anywhere for IPv4 in production, short of NPTv6 for Provider Aggregateble address space.

However, when you are using overlay driver, it is not possible to directly communicate with those containers expect through ports which are published (PAT).

Do you have access to a network infrastructure that's IPv6-native and you have control over the entire network? I would like to potentially work with you and run a Docker/Docker Swarm lab to test out some ways to acheieve IPv6-native-only load balancing etc. It would require the underlay network to support VXLAN/EVPN and for at least two physical server nodes to be plain bare-metal so that we could run FRR (or GoBGP or whatever) between the node and the layer 3 ToR switch (leaf). We could also play with hyerpcube network topology and run BGP between the nodes directly.

I currently no longer work in DC networks and work full-time in SP networks, so I do not have access to a DC env.

However, your idea is better, if --publish / (or some new simalar flag) can be used to control iptables instead of port map then we can improve security also for those who are using IPv4 only.

Yes, you can create 1:1 iptables rules for v4/v6. Accept established, related untracked, ICMPv4/v6 and finally “exposed” port 80/443 1:1 on both tables. This removes the false NAT-as-a-security-service model completely from Docker paradigm. However, this requires Docker net devs to come together and work on such a re-work of the underlying code base.

You can at least get very near by of that with this Calico feature and having externalTrafficPolicy: Local configured to services. Then you will see /128 routes distributed with BGP for your infra. Technically there is still kube-proxy running on all K8s nodes which you get rid of by switching to eBPF dataplane.

We are going off-topic here, but I'd like to discuss more on K8s IPv6-native-only load balancing using eBGP driven networking (underlay network + host + network topology design), can you please email me at contact@daryllswer.com?

@olljanat
Copy link
Contributor

I'd like to discuss more on K8s IPv6-native-only load balancing using eBGP driven networking (underlay network + host + network topology design)

I have used AS Per Rack model.

Anyway, I got inspiration from this discussion to try build similar solution with Docker and it is now available in https://github.com/olljanat/docker-bgp-lb and I just added also IPv6 support tracking issue to there. Feel free to try it and provide feedback there so we don't go too much off-topic in here.

@daryll-swer
Copy link
Author

I have used AS Per Rack model.

The problem with that Meta+Arista RFC is that it complicates route filters and potential loop avoidance issues at scale, it leads down the horrible path of eBGP over eBGP and iBGP over eBGP, it's further explained here (also read comments).

Anyway, I got inspiration from this discussion to try build similar solution with Docker and it is now available in https://github.com/olljanat/docker-bgp-lb and I just added also IPv6 support tracking issue to there. Feel free to try it and provide feedback there so we don't go too much off-topic in here.

How does this differ from simply using FRR for advertising the ranges to the upstream router or layer 3 leaf/ToR switch? And is your plugin for Docker Swarm mode or something else? Needs better readme with network topology example.

@daryll-swer
Copy link
Author

It appears, Docker v27.x.x, including their IPv6-specific documentation have been updated and improved upon. Happy to see IPv6-native fixes and changes for Docker. I hope it keeps getting better and better over time, as more and more IPv6-specific features/sub-protocols gets introduced into the network world.

@daryll-swer
Copy link
Author

Just a quick update to let everyone in the community know, that I am collaborating with Docker Inc.'s team on writing the networking (IPv6) documentation that will cover some key implementation details. However, due to a busy schedule with my consulting work, it may take some time before I can fully focus on this.

It will happen, and I will do my best to make it happen. I’ll keep you all posted as things progress—thanks for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine Issue affects Docker engine/daemon area/guides area/networking Relates to anything around networking lifecycle/frozen
Projects
None yet
Development

No branches or pull requests

7 participants