-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation #2
Comments
Most data centre networks are homogenous in designs, because use-cases, scalability needs, and traffic flow patterns are similar if not identical. They all use variants of clos design. However, there's a design I worked with that simplified things further, it is similar to an SP network. But again, this is outside the scope of either GitHub issues here as we are now discussion network architecture problems, unrelated to the application at hand (Docker or K8s).
IP subnetting model has nothing to do with RFC7938 or eBGP driven designs in general, it is completely independent. And no, people did not misunderstand it. What this RFC promotes is eBGP-only every including to establish neighbour adjacency, where every device has a unique private ASN number, as we can see in section 5.2.1. This is a design approach I personally do not recommend, and I have worked full-time in a data centre network spanning countries. We use eBGP design, but not the way Meta and Arista sold it. I am a big promoter of eBGP-driven designs, but never eBGP-only, this is terrible for scale and overhead, this design is the same one mentioned in the beginning of this comment, it's SP-like. |
Ah, I had missed that they actually propose own ASN for every device. Model in Calico documentation looks to be already modified version of it as they use ASN per rack and I actually did go even farther and used only one ASN per datacenter as those which I'm dealing with are quite small, just couple of racks per DC anyway.
Looks that we are actually ended up to similar conclusion about best possible configuration for this kind of environments. I'm just a bit less sure about it at the moment.
Have you wrote publicly something about this? I would assume that IPv6 have part of that as it is from where our discussion started?
I'm not actually sure if it outside of scope, at least in here as I would like to document examples based on best practices. In both small and larger scale. For clarity, this plugin is about standalone Docker hosts and Docker Swarm mode. For K8s there is already enough written documentation on Calico, etc sites but K8s is very complex solution so I would like to see same simplified networking in Docker side which why this plugin now exist. |
Yeah, they used a modified version, easier than the RFC variant. But not something I'd recommend or ever use myself.
Similar yes.
I wrote a short comment here, but this is not enough to build a network from. Writing network architecture guides takes a lot of time (years) when you add up job time/personal time/blog writing time/no work time etc, I'm currently trying to finish an article piece on comprehensive carrier-scale out-of-band network design that's never been shared in the public domain before (vendors/network consultants don't share knowledge of such intricate nature for free, they bill you to design such things). My plan was something like this:
But as you can probably understand, I'm just a guy who writes this stuff in my free time, it pays me no money, and takes years to write. My IPv6 Architecture guide took about 15 months to complete for reference.
Everything is IPv6-native, IPv6-first and IPv6-only (where possible) in all of my professional works and writings and designs. The out-of-band piece is the most IPv6-specific blog post on network architecture that probably I've ever written (yet) or ever will. And yes in the case of Docker/K8s (aka enterprise and data centre networking), all of these elements of the network ties together down to layer 7 on the hosts, from edge routers, to core, to OOB/MGMT to DWDM/Fibre transportation networks to website.com running on nginx in server node 1 over Docker/K8s. However, I should make it clear, I'm not an expert on K8s/Docker/Application-specific software, but the network concept is largely the same (everything is TCP/IP, OSI model etc). I don't write code (too much work/too lazy), but can tell you if a network implementation in the code is broken or not if it's within my area of expertise. This work of IPv6-only/IPv6-native (NAT-less) K8s/Docker will likely take months if not years to properly realise over long-term discussions.
You are right. K8s/Docker networking requires a properly architected underlay network to be truly NAT-Free, Overlay-Free, L3-only, IPv6-Native/First. So I think, we cannot solve this problem without discussion network architecture and routing. |
I know the feeling. Most of the my contributions to open source projects happens also in free time. It is same with this plugin, it started with free time project but luckily looks that it would be good enough to solve some real world challenges on customers who I'm working with so maybe I will be able to charge at least part of that work.
Network related question then. What you think about my way to avoid NAT in this solution? At least it works and avoids NAT completely but not sure if that is correct way to solve it. Idea which I'm even a bit proud of is that those routes will are only added after container is up and running (and in healthy state if health check is enabled to container) and that in shutdown routes are removed first and only after that container is stopped which gives application opportunity to finalize processing which it was doing and make sure that there is zero downtime from user point of view (assuming that there is another working copy of application running on another Docker host). |
You cannot avoid NAT for IPv4, because how many people have a /8? For IPv6, there's no NAT on the default Docker behaviour, other than just configuring it without NAT as is, using the bridge, every container is attached to the bridge, one veth, and it gets a /128 GUA. One bridge per host. All containers veth connect to that bridge. Bridge has /64 GUA assigned to it. There's no need to custom code of advertising /128s, FRR or GoBGP or any BGP daemon would do that if you configure the route filters correctly. This means, if a /64 is routed to Docker cost you simply create a route filter for that /64 to advertise only /128s and obviously those /128s will only get advertised if the container is live, if the container isn't, then the /128 does not exist in the local routing table of the host. Something like this using FRR:
This probably the only part that needs custom implementation. |
Hmm, this is not behavior which I see. There is only one /64 route in host local routing table to
That is exactly what I currently do with my solution for host specific subnet and that part you can replace with any BGP daemon. But then there is those /128 load balancer subnets in top of that which is same on all hosts which have running copy of that container. Alternative for this of course would be to have static IPs on those containers and external load balancer but then you need NAT... |
Ah, I missed that, yes, we can't see /128 because the /64 is a connected route to a layer 2 domain (the bridge) and anything behind that is just a “connected route” behind the /64-bridge. What could be done is for your plugin to artificially inject the /128s into the local routing table and set next-hop as the bridge itself, that should in theory work. I'm not labbing any of this out and working off theory. But it needs to be tested. Container comes live > inject the /128 route: I personally do not use “default Docker bridge”, IIRC, in their official docs, they recommend you use custom-made bridge so that you can control how the IPAM works if required, it's especially useful for native IPv6, because I like to assign the bridge's IP as ::1 instead of a random address. I use Docker compose like below for my personal use:
I'm not following, what do you mean by “host specific subnet” and “those /128 load balancer subnets”? As far as I know, even on K8s, there's no true NAT-less way to do ECMP/Load balancing across multiple nodes like that, because the pod's IP is ephemeral (just like on Docker container). |
I really enjoy about our discussion, however I got the feeling that it might be beneficial to take couple of days break to it and meantime study each others work. You have wrote a lot of good stuff about IPv6 but it takes a while and multiple times of reading to really understand those. Yes, I used only one week to write this plugin but it was possible only because of years of studying this stuff. That been said, some comments to your latest message:
This is exactly what it does at the moment. In load balancing bridge, there is no IP at all in host, just route to that interface.
Me either normally but got feeling from your last message that there might be some feature in it which I have missed.
True, in K8s there always is kube-proxy which eventually does NAT from LB IP to pod IP. I didn't ever found solution to it when played with with K8s. The challenge is that route to container/pod does not help if it does not listen that IP. However as you asked in docker/docs#19556 (comment)
it made me thinking that it must be possible and matter of fact it is in Docker but not in K8s. That is because K8s pods are limited to one NIC per pod when Docker containers can have multiple NICs and I utilize that feature in my plugin. So here LB is possible without NAT at all.
In general on Linux it is possible to have just one NIC inside with two IPs. Normal bridge IP with /64 and LB IP with /128. However it would need heavy refactoring to logic how Docker does networking so I don't see that as realistic option. Have nice weekend 😃 |
Your readme doesn't explain the underlying network assumption at all that you've made in your mind. Let's break this down:
Why is router-id 192.168.8.137? Why is the router ID within the prefix 192.168.8.0/24 to begin with? What is 192.168.8.0/24? Is this the link-prefix of the Docker host? Or is this a routed prefix that's routed from the upstream network to the Docker host and this is utilised for Docker container? There's no explanation of the assumption you've made. If this is a routed prefix, then router-ID should really just be the link-prefix's actual address between the Host and the underlying network.
What's the relationship between 172.23.1.0/24 and 192.168.8.0/24? Is the former “local-only” un-routed NATting pool (Default Docker behaviour)? What is 2001:0db8:0000:1001::/64? Is this a routed prefix from the underlying network to the Docker host? If the host is IPv6-capable anyway, why is BGP session IPv4? Use RFC8950. You need to explain your assumptions clearly. And also I would encourage you to share all/any examples with Docker compose.
Calico DSR might be the solution. Maybe you can test and confirm.
Yes, that's the fundamental nature of IP networking. You can always assign multiple link-prefixes or IP addresses to an interface. |
Honestly, now I don't understand at all. I thought that I'm talking with network expert who can read simple BGP lab config without need explain every single parameter.
Based on their documentation there is still DNAT in that scenario too.
Yes but it is not allowed by Docker. Container can have multiple networks but only one IP per network which you can easily see by checking |
“Network expert” cannot read minds and any genuine network engineer does not like mind reading, we rely on network documentation/topology examples/scenario examples and our favourite PCAPs (when troubleshooting) on what goes where and how and generally (but not always) why. Don't believe me? Try asking anyone else to read undocumented assumptions/scenarios/topologies based off config-only dump/example. If you want to resort to personal attacks, then this “collaboration” stops here, I don't build apps and I get nothing from this project of yours anyway. Good luck. |
@daryll-swer picking your comment from docker/docs#19556 (comment) and continuing in here as it is a bit off-topic in there (can add summary there later when we have some conclusion).
Topology + workload examples will came later as I'm once again studying that what would be best possible topology nowadays and it anyway differs a bit depending used applications.
Biggest advantage is direct integration with Docker so user don't needs to deal with FRR configuration.
In additionally before this plugin you wasn't able to advertise load balancer IP only from nodes which have working copy of container running like it was in K8s.
It should be possible to use this with and without Swarm mode (need add Swarm mode examples later). Most importantly you easily get L3 connectivity between multiple Docker nodes, even when they are in different Docker Swarms. That is needed by applications which require multi-datacenter setup as it is not supported by Swarm moby/moby#38748 (comment) Technically it works but you end up having VXLAN overlay which I'm trying to get rid of.
Also it have been long running issue that you cannot reserve static IP for service in Swarm mode moby/moby#24170 which is now possible with this.
There might be some large scale challenges for sure which I cannot image but if we follow your IPv6 subnetting guide line
route filters should quite simple. You only need allow those and some subnet for load balancer IPs which is common to all of them.
Looking for my old config from real world implementation which was fully functional but didn't actually ever go live with IPv6, I had reserved /56 for K8s cluster and took /64 for each node from it. For load balancing with ECMP there was one cluster level /112 subnet (because of some bug/limitation in K8s it cannot be bigger IIRC).
Sounds that people have misunderstood the concept. RFC mentioned possibility to use overlay technologies for those applications which requires layer 2 connectivity, not recommending it. However, all that EVPN stuff goes over my head as I don't need deal with that.
The text was updated successfully, but these errors were encountered: