-
Notifications
You must be signed in to change notification settings - Fork 19
Spike
Spike is a load balancer based on Google's Maglev (https://research.google.com/pubs/pub44824.html). There is as yet no open-source software that does this for us, so we're writing Spike ourselves.
For performance reasons, packets are randomly dispersed to a set of Spike nodes (equal-cost multi-path routing), which then route the packets to containers (called backends) over flannel. Spike's job is to ensure that packets associated with the same connection are routed to the same backend. This is necessary to maintain connection coherency: for example, an HTTP request is often split up into multiple TCP packets.
This is not a trivial task, since a service can run on multiple backends. When a packet is sent to a service, it is sent to a virtual IP address that identifies the service, not the backend. Spike needs a way to consistently choose the same backend from the backend pool associated with the service, despite the fact that packets from the same connection could be sent to different Spike nodes.
Theoretically, we can do this by coordinating the connection-backend assignment table across all Spike nodes. This is, however, computationally expensive. Maglev solves this by hashing the connection identifying information (a five-tuple consisting of the source IP address and port, the destination IP address and port, and whether IPv4 or IPv6 is used), and using that to choose a backend. The backend choices are cached to improve performance.
There is, however, a problem with this approach. Simply mapping the hash to the backend by a modulo would cause this mapping to change almost completely when the number of backends changes. This means that many connections will be drop whenever we assign or remove containers from a service, activities we expect to be doing often. Maglev solves this by using a special mapping algorithm that preserves as large a fraction of the mappings as possible when the number of backends is changed.
Configuration changes, such as the backends associated with each backend pool, are propagated to all the Spike nodes from the supervisor node as YAML files over an as yet non-existent protocol.
When a machine receives a packet, it usually goes through a packet processing pipeline in the operating system. Since Spike operates in a rather non-traditional fashion, it has no need for this automatic packet processing. To improve performance, Spike bypasses the kernel, operating on the raw packets as they are received by the machine. This is implemented through Snabb, which also provides other packet processing utilities.
Health checking is done by Spike itself. Periodically, each Spike node will poll each backend to check if it is still online. If there is no response, Spike will automatically remove the backend from the routing map.
[If for example two backends go offline and online again, and two Spike nodes learn about it in a different order, could this mess up the routing table?]
An annoying caveat to Maglev's base connection coherency algorithm is IP fragmentation. Maglev relies on source and destination port information to identify connections, which are stored not in the IP header but in the TCP/UDP header. In the case of IP fragmentation, however, IP payloads can be split into multiple IP packets. Since the IP payload includes the TCP/UDP header, this means that the source and destination port data would only be visible to the Spike that receives the first fragment.
To overcome this, Maglev treats IP fragments as a special case. It hashes only the three-tuple (the five-tuple without the port information) and uses that to choose a Spike to forward the packet to. After this hop, all IP fragments associated with the same connection would have been routed to the same Spike, which can then link these up to obtain the full packet.