Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation #2

Closed
olljanat opened this issue Apr 12, 2024 · 11 comments
Closed

Improve documentation #2

olljanat opened this issue Apr 12, 2024 · 11 comments

Comments

@olljanat
Copy link
Owner

@daryll-swer picking your comment from docker/docs#19556 (comment) and continuing in here as it is a bit off-topic in there (can add summary there later when we have some conclusion).

Needs better readme with network topology example.

Topology + workload examples will came later as I'm once again studying that what would be best possible topology nowadays and it anyway differs a bit depending used applications.

How does this differ from simply using FRR for advertising the ranges to the upstream router or layer 3 leaf/ToR switch?

Biggest advantage is direct integration with Docker so user don't needs to deal with FRR configuration.
In additionally before this plugin you wasn't able to advertise load balancer IP only from nodes which have working copy of container running like it was in K8s.

And is your plugin for Docker Swarm mode or something else?

It should be possible to use this with and without Swarm mode (need add Swarm mode examples later). Most importantly you easily get L3 connectivity between multiple Docker nodes, even when they are in different Docker Swarms. That is needed by applications which require multi-datacenter setup as it is not supported by Swarm moby/moby#38748 (comment) Technically it works but you end up having VXLAN overlay which I'm trying to get rid of.

Also it have been long running issue that you cannot reserve static IP for service in Swarm mode moby/moby#24170 which is now possible with this.

The problem with that Meta+Arista RFC is that it complicates route filters and potential loop avoidance issues at scale

There might be some large scale challenges for sure which I cannot image but if we follow your IPv6 subnetting guide line

Top-of-Rack (ToR) switches – one /48 for all ToRs in a site, /55 per rack, then /56 per ToR device.

route filters should quite simple. You only need allow those and some subnet for load balancer IPs which is common to all of them.

Looking for my old config from real world implementation which was fully functional but didn't actually ever go live with IPv6, I had reserved /56 for K8s cluster and took /64 for each node from it. For load balancing with ECMP there was one cluster level /112 subnet (because of some bug/limitation in K8s it cannot be bigger IIRC).

, it leads down the horrible path of eBGP over eBGP and iBGP over eBGP

Sounds that people have misunderstood the concept. RFC mentioned possibility to use overlay technologies for those applications which requires layer 2 connectivity, not recommending it. However, all that EVPN stuff goes over my head as I don't need deal with that.

@daryll-swer
Copy link

Topology + workload examples will came later as I'm once again studying that what would be best possible topology nowadays and it anyway differs a bit depending used applications.

Most data centre networks are homogenous in designs, because use-cases, scalability needs, and traffic flow patterns are similar if not identical. They all use variants of clos design.

However, there's a design I worked with that simplified things further, it is similar to an SP network. But again, this is outside the scope of either GitHub issues here as we are now discussion network architecture problems, unrelated to the application at hand (Docker or K8s).

There might be some large scale challenges for sure which I cannot image but if we follow your IPv6 subnetting guide line

route filters should quite simple. You only need allow those and some subnet for load balancer IPs which is common to all of them.

Looking for my old config from real world implementation which was fully functional but didn't actually ever go live with IPv6, I had reserved /56 for K8s cluster and took /64 for each node from it. For load balancing with ECMP there was one cluster level /112 subnet (because of some bug/limitation in K8s it cannot be bigger IIRC).

Sounds that people have misunderstood the concept. RFC mentioned possibility to use overlay technologies for those applications which requires layer 2 connectivity, not recommending it. However, all that EVPN stuff goes over my head as I don't need deal with that.

IP subnetting model has nothing to do with RFC7938 or eBGP driven designs in general, it is completely independent. And no, people did not misunderstand it. What this RFC promotes is eBGP-only every including to establish neighbour adjacency, where every device has a unique private ASN number, as we can see in section 5.2.1.

This is a design approach I personally do not recommend, and I have worked full-time in a data centre network spanning countries. We use eBGP design, but not the way Meta and Arista sold it. I am a big promoter of eBGP-driven designs, but never eBGP-only, this is terrible for scale and overhead, this design is the same one mentioned in the beginning of this comment, it's SP-like.

@olljanat
Copy link
Owner Author

What this RFC promotes is eBGP-only every including to establish neighbour adjacency, where every device has a unique private ASN number, as we can see in section 5.2.1.

Ah, I had missed that they actually propose own ASN for every device. Model in Calico documentation looks to be already modified version of it as they use ASN per rack and I actually did go even farther and used only one ASN per datacenter as those which I'm dealing with are quite small, just couple of racks per DC anyway.

I am a big promoter of eBGP-driven designs, but never eBGP-only, this is terrible for scale and overhead, this design is the same one mentioned in the beginning of this comment, it's SP-like.

Looks that we are actually ended up to similar conclusion about best possible configuration for this kind of environments. I'm just a bit less sure about it at the moment.

However, there's a design I worked with that simplified things further, it is similar to an SP network.

Have you wrote publicly something about this? I would assume that IPv6 have part of that as it is from where our discussion started?

But again, this is outside the scope of either GitHub issues here as we are now discussion network architecture problems, unrelated to the application at hand (Docker or K8s).

I'm not actually sure if it outside of scope, at least in here as I would like to document examples based on best practices. In both small and larger scale. For clarity, this plugin is about standalone Docker hosts and Docker Swarm mode. For K8s there is already enough written documentation on Calico, etc sites but K8s is very complex solution so I would like to see same simplified networking in Docker side which why this plugin now exist.

@daryll-swer
Copy link

daryll-swer commented Apr 12, 2024

Ah, I had missed that they actually propose own ASN for every device. Model in Calico documentation looks to be already modified version of it as they use ASN per rack and I actually did go even farther and used only one ASN per datacenter as those which I'm dealing with are quite small, just couple of racks per DC anyway.

Yeah, they used a modified version, easier than the RFC variant. But not something I'd recommend or ever use myself.

Looks that we are actually ended up to similar conclusion about best possible configuration for this kind of environments. I'm just a bit less sure about it at the moment.

Similar yes.

Have you wrote publicly something about this?

I wrote a short comment here, but this is not enough to build a network from. Writing network architecture guides takes a lot of time (years) when you add up job time/personal time/blog writing time/no work time etc, I'm currently trying to finish an article piece on comprehensive carrier-scale out-of-band network design that's never been shared in the public domain before (vendors/network consultants don't share knowledge of such intricate nature for free, they bill you to design such things).

My plan was something like this:

  1. Publish blog post on OOB Design
  2. Publish blog post of layer 3/eBGP driven SP networks with SR-MPLS/EVPN
  3. Possibly blog post on DC networks that are SP-Like, i.e. SR-MPLS/EVPN is used for interconnects of DCs for example.

But as you can probably understand, I'm just a guy who writes this stuff in my free time, it pays me no money, and takes years to write.

My IPv6 Architecture guide took about 15 months to complete for reference.

I would assume that IPv6 have part of that as it is from where our discussion started?

Everything is IPv6-native, IPv6-first and IPv6-only (where possible) in all of my professional works and writings and designs. The out-of-band piece is the most IPv6-specific blog post on network architecture that probably I've ever written (yet) or ever will.

And yes in the case of Docker/K8s (aka enterprise and data centre networking), all of these elements of the network ties together down to layer 7 on the hosts, from edge routers, to core, to OOB/MGMT to DWDM/Fibre transportation networks to website.com running on nginx in server node 1 over Docker/K8s.

However, I should make it clear, I'm not an expert on K8s/Docker/Application-specific software, but the network concept is largely the same (everything is TCP/IP, OSI model etc). I don't write code (too much work/too lazy), but can tell you if a network implementation in the code is broken or not if it's within my area of expertise.

This work of IPv6-only/IPv6-native (NAT-less) K8s/Docker will likely take months if not years to properly realise over long-term discussions.

I'm not actually sure if it outside of scope, at least in here as I would like to document examples based on best practices. In both small and larger scale. For clarity, this plugin is about standalone Docker hosts and Docker Swarm mode. For K8s there is already enough written documentation on Calico, etc sites but K8s is very complex solution so I would like to see same simplified networking in Docker side which why this plugin now exist.

You are right. K8s/Docker networking requires a properly architected underlay network to be truly NAT-Free, Overlay-Free, L3-only, IPv6-Native/First. So I think, we cannot solve this problem without discussion network architecture and routing.

@olljanat
Copy link
Owner Author

But as you can probably understand, I'm just a guy who writes this stuff in my free time, it pays me no money, and takes years to write.

I know the feeling. Most of the my contributions to open source projects happens also in free time. It is same with this plugin, it started with free time project but luckily looks that it would be good enough to solve some real world challenges on customers who I'm working with so maybe I will be able to charge at least part of that work.

However, I should make it clear, I'm not an expert on K8s/Docker/Application-specific software, but the network concept is largely the same (everything is TCP/IP, OSI model etc). I don't write code (too much work/too lazy), but can tell you if a network implementation in the code is broken or not if it's within my area of expertise.

Network related question then. What you think about my way to avoid NAT in this solution?
I mean that containers have two NICs, one for normal communication and second with /32 + /128 masks and then local route in host same way to that bridge interface which is connected to container? And eventually advertise that route with BGP.

At least it works and avoids NAT completely but not sure if that is correct way to solve it.

Idea which I'm even a bit proud of is that those routes will are only added after container is up and running (and in healthy state if health check is enabled to container) and that in shutdown routes are removed first and only after that container is stopped which gives application opportunity to finalize processing which it was doing and make sure that there is zero downtime from user point of view (assuming that there is another working copy of application running on another Docker host).

@daryll-swer
Copy link

daryll-swer commented Apr 13, 2024

Network related question then. What you think about my way to avoid NAT in this solution?
I mean that containers have two NICs, one for normal communication and second with /32 + /128 masks and then local route in host same way to that bridge interface which is connected to container? And eventually advertise that route with BGP.

At least it works and avoids NAT completely but not sure if that is correct way to solve it.

You cannot avoid NAT for IPv4, because how many people have a /8?

For IPv6, there's no NAT on the default Docker behaviour, other than just configuring it without NAT as is, using the bridge, every container is attached to the bridge, one veth, and it gets a /128 GUA. One bridge per host. All containers veth connect to that bridge. Bridge has /64 GUA assigned to it.

There's no need to custom code of advertising /128s, FRR or GoBGP or any BGP daemon would do that if you configure the route filters correctly.

This means, if a /64 is routed to Docker cost you simply create a route filter for that /64 to advertise only /128s and obviously those /128s will only get advertised if the container is live, if the container isn't, then the /128 does not exist in the local routing table of the host.

Something like this using FRR:

ipv6 route 2400:7060:2:108::/64 blackhole 254
ipv6 prefix-list eBGP-OUT seq 1 permit 2400:7060:2:108::/64 le 64
ipv6 prefix-list eBGP-OUT seq 2 permit 2400:7060:2:108::/64 le 128
route-map eBGP-OUT permit 10
match ipv6 address prefix-list eBGP-OUT

Idea which I'm even a bit proud of is that those routes will are only added after container is up and running (and in healthy state if health check is enabled to container) and that in shutdown routes are removed first and only after that container is stopped which gives application opportunity to finalize processing which it was doing and make sure that there is zero downtime from user point of view (assuming that there is another working copy of application running on another Docker host).

This probably the only part that needs custom implementation.

@olljanat
Copy link
Owner Author

This means, if a /64 is routed to Docker cost you simply create a route filter for that /64 to advertise only /128s and obviously those /128s will only get advertised if the container is live, if the container isn't, then the /128 does not exist in the local routing table of the host.

Hmm, this is not behavior which I see. There is only one /64 route in host local routing table to docker0 (default bridge). No /128 routes even when there is containers running for BGP daemon in hosts can only advertise that /64 (Docker daemon is started with parameters --ipv6 --fixed-cidr-v6 2001:db8:1::/64

$ ip route show table all | grep docker0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
local 172.17.0.1 dev docker0 table local proto kernel scope host src 172.17.0.1 
broadcast 172.17.255.255 dev docker0 table local proto kernel scope link src 172.17.0.1 
2001:db8:1::/64 dev docker0 proto kernel metric 256 pref medium
fe80::/64 dev docker0 proto kernel metric 256 pref medium
anycast 2001:db8:1:: dev docker0 table local proto kernel metric 0 pref medium
local 2001:db8:1::1 dev docker0 table local proto kernel metric 0 pref medium
anycast fe80:: dev docker0 table local proto kernel metric 0 pref medium
local fe80::1 dev docker0 table local proto kernel metric 0 pref medium
local fe80::42:8dff:fe64:ece8 dev docker0 table local proto kernel metric 0 pref medium
multicast ff00::/8 dev docker0 table local proto kernel metric 256 pref medium

$ docker exec -it test ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.2  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 2001:db8:1::242:ac11:2  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::42:acff:fe11:2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:ac:11:00:02  txqueuelen 0  (Ethernet)
        RX packets 16  bytes 1824 (1.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12  bytes 936 (936.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

That is exactly what I currently do with my solution for host specific subnet and that part you can replace with any BGP daemon.

But then there is those /128 load balancer subnets in top of that which is same on all hosts which have running copy of that container. Alternative for this of course would be to have static IPs on those containers and external load balancer but then you need NAT...

@daryll-swer
Copy link

daryll-swer commented Apr 13, 2024

Ah, I missed that, yes, we can't see /128 because the /64 is a connected route to a layer 2 domain (the bridge) and anything behind that is just a “connected route” behind the /64-bridge.

What could be done is for your plugin to artificially inject the /128s into the local routing table and set next-hop as the bridge itself, that should in theory work. I'm not labbing any of this out and working off theory. But it needs to be tested.

Container comes live > inject the /128 route:
ip route add 2400:7060:2:108::4/128 dev br-4a5a71f92364
If this works, there's no need for any double veth/bridge etc.

I personally do not use “default Docker bridge”, IIRC, in their official docs, they recommend you use custom-made bridge so that you can control how the IPAM works if required, it's especially useful for native IPv6, because I like to assign the bridge's IP as ::1 instead of a random address. I use Docker compose like below for my personal use:

networks:
  docker_bridge:
    driver: bridge
    enable_ipv6: true
    ipam:
      driver: default
      config:
        - subnet: 2400:7060:2:108::/64
          gateway: 2400:7060:2:108::1

#Watchtower (auto-update of containers)
services:
    watchtower:
        restart: unless-stopped
        image: containrrr/watchtower
        container_name: watchtower
        networks:
            docker_bridge:
                ipv6_address: 2400:7060:2:108::5
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock

That is exactly what I currently do with my solution for host specific subnet and that part you can replace with any BGP daemon.

But then there is those /128 load balancer subnets in top of that which is same on all hosts which have running copy of that container. Alternative for this of course would be to have static IPs on those containers and external load balancer but then you need NAT...

I'm not following, what do you mean by “host specific subnet” and “those /128 load balancer subnets”?

As far as I know, even on K8s, there's no true NAT-less way to do ECMP/Load balancing across multiple nodes like that, because the pod's IP is ephemeral (just like on Docker container).

@olljanat
Copy link
Owner Author

I really enjoy about our discussion, however I got the feeling that it might be beneficial to take couple of days break to it and meantime study each others work.

You have wrote a lot of good stuff about IPv6 but it takes a while and multiple times of reading to really understand those.
Me however has spent a lot of time in Docker and K8s codebases and know their capabilities and limitations quite well and are probably better solving those in form of code than explaining the concepts.

Yes, I used only one week to write this plugin but it was possible only because of years of studying this stuff.
So perhaps you should read readme a little bit more carefully, really test it and pay attention to interface and route configs in different states.

That been said, some comments to your latest message:

What could be done is for your plugin to artificially inject the /128s into the local routing table and set next-hop as the bridge itself, that should in theory work.

This is exactly what it does at the moment. In load balancing bridge, there is no IP at all in host, just route to that interface.

I personally do not use “default Docker bridge”,

Me either normally but got feeling from your last message that there might be some feature in it which I have missed.

As far as I know, even on K8s, there's no true NAT-less way to do ECMP/Load balancing across multiple nodes like that, because the pod's IP is ephemeral (just like on Docker container).

True, in K8s there always is kube-proxy which eventually does NAT from LB IP to pod IP. I didn't ever found solution to it when played with with K8s. The challenge is that route to container/pod does not help if it does not listen that IP. However as you asked in docker/docs#19556 (comment)

Can we do load-balancing with native-only IPv6 without any Destination-NAT/NAT66 on K8s implementations?

it made me thinking that it must be possible and matter of fact it is in Docker but not in K8s. That is because K8s pods are limited to one NIC per pod when Docker containers can have multiple NICs and I utilize that feature in my plugin. So here LB is possible without NAT at all.

If this works, there's no need for any double veth/bridge etc.

In general on Linux it is possible to have just one NIC inside with two IPs. Normal bridge IP with /64 and LB IP with /128. However it would need heavy refactoring to logic how Docker does networking so I don't see that as realistic option.

Have nice weekend 😃

@daryll-swer
Copy link

daryll-swer commented Apr 14, 2024

So perhaps you should read readme a little bit more carefully, really test it and pay attention to interface and route configs in different states.

Your readme doesn't explain the underlying network assumption at all that you've made in your mind.

Let's break this down:

[global.config]
  as = 64512
  router-id = "192.168.8.137"

[[dynamic-neighbors]]
  [dynamic-neighbors.config]
    prefix = "192.168.8.0/24"
    peer-group = "bgp-lb"

Why is router-id 192.168.8.137? Why is the router ID within the prefix 192.168.8.0/24 to begin with? What is 192.168.8.0/24? Is this the link-prefix of the Docker host? Or is this a routed prefix that's routed from the upstream network to the Docker host and this is utilised for Docker container? There's no explanation of the assumption you've made.

If this is a routed prefix, then router-ID should really just be the link-prefix's actual address between the Host and the underlying network.

docker network create \
  --driver bridge \
  --subnet 172.23.1.0/24 \
  --gateway 172.23.1.1 \
  --ipv6 \
  --subnet 2001:0db8:0000:1001::/64 \
  --gateway 2001:0db8:0000:1001::1 \
  -o com.docker.network.bridge.name=bgplb_gwbridge \
  -o com.docker.network.bridge.enable_icc=false \
  -o com.docker.network.bridge.enable_ip_masquerade=false \
  --label bgplb_advertise=true \
  bgplb_gwbridge

What's the relationship between 172.23.1.0/24 and 192.168.8.0/24? Is the former “local-only” un-routed NATting pool (Default Docker behaviour)?

What is 2001:0db8:0000:1001::/64? Is this a routed prefix from the underlying network to the Docker host?

If the host is IPv6-capable anyway, why is BGP session IPv4? Use RFC8950.

You need to explain your assumptions clearly. And also I would encourage you to share all/any examples with Docker compose.

True, in K8s there always is kube-proxy which eventually does NAT from LB IP to pod IP.

Calico DSR might be the solution. Maybe you can test and confirm.

In general on Linux it is possible to have just one NIC inside with two IPs. Normal bridge IP with /64 and LB IP with /128.
However it would need heavy refactoring to logic how Docker does networking so I don't see that as realistic option.

Yes, that's the fundamental nature of IP networking. You can always assign multiple link-prefixes or IP addresses to an interface.

@olljanat
Copy link
Owner Author

Honestly, now I don't understand at all. I thought that I'm talking with network expert who can read simple BGP lab config without need explain every single parameter.

192.168.8.0/24 just happens to be be my home network where DHCP has given IP 192.168.8.137 for server which acts as BGP router when I'm developing this plugin and with that simple config it just accept anyone in same network connect to it and advertise any prefixes for it. Another server where I do actual coding/testing the plugin has got IP 192.168.8.40 from DHCP and for simplicity I use servers' own IP as router ID on both them. Config is jut shared as is so it gives more context for those BGP debug messages.

172.23.1.0/24 and 2001:0db8:0000:1001::/64 are just randomly selected networks which I used for testing. And as mentioned in #1 (comment) I don't have IPv6 capable test environment at the moment but that is irrelevant as this plugin will accept any networks user chooses as long those prefixes allowed by BGP peer. Actual magic happens with those driver options which are already explained.

Calico DSR might be the solution.

Based on their documentation there is still DNAT in that scenario too.

Yes, that's the fundamental nature of IP networking. You can always assign multiple link-prefixes or IP addresses to an interface.

Yes but it is not allowed by Docker. Container can have multiple networks but only one IP per network which you can easily see by checking docker inspect from any container. So I stick in my conclusion that solution which I chose is only possibility to completely avoid NAT.

@daryll-swer
Copy link

daryll-swer commented Apr 14, 2024

Honestly, now I don't understand at all. I thought that I'm talking with network expert who can read simple BGP lab config without need explain every single parameter.

“Network expert” cannot read minds and any genuine network engineer does not like mind reading, we rely on network documentation/topology examples/scenario examples and our favourite PCAPs (when troubleshooting) on what goes where and how and generally (but not always) why. Don't believe me? Try asking anyone else to read undocumented assumptions/scenarios/topologies based off config-only dump/example.

If you want to resort to personal attacks, then this “collaboration” stops here, I don't build apps and I get nothing from this project of yours anyway. Good luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants