Fix machine-controller CNI setup for fully joining nodes to kind control plane #1462
Labels
kind/cleanup
Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/failing-test
Categorizes issue or PR as related to a consistently or frequently failing test.
This is a follow-up from the issues worked around in #1459.
Background
In #1304, we changed our CI setup to get rid of the Hetzner VMs built for each CI job and to mirror what we have been doing in KKP CI for some time now. The big difference between the KKP tests and the MC tests is that in KKP, the KKP control plane runs in kind only. In the MC tests, we join Machines to the kind control plane directly. Because of that, using the default CNI (kindnet) did not work and we opted for using flannel as CNI. This appeared to work fine, tests were passing, Nodes were marked as ready upon joining the kind control plane.
Current Problem
Recently, we upgraded our CI environment and the underlying container runtime switched from docker to containerd. That apparently broke the kind control plane using flannel as CNI. In specific, requests from
machine-controller-webhook
to cloud provider APIs failed, and after some investigation the problem appeared to be DNS resolution through the in-cluster DNS service IP. This only happens with flannel and it's not clear why, but our nested container setup is probably fairly unique with its problems.So the idea was to replace the CNI. Both Calico and Cilium were tried. After some more investigation, the following underlying problem was identified in the test architecture:
The Kubernetes API is not accessible as in-cluster service from any of the nodes, because the
Endpoint
resource in the "kubernetes"Service
points to a 172.16.0.0/16 address. That means calling thekubernetes.default.svc.cluster.local
endpoint from a Pod on a Node joined to the kind control plane cannot work. Overriding the advertised IP address is also not possible, because the "kubernetes"Service
is exposed as a NodePort to allow the wholecluster-exposer
logic we do to make it accessible to Nodes in the first place. If you update the advertised IP to something publicly accessible, you create a loop (service endpoint points to public IP, public IP + Port point to "kubernetes"Service
, which uses the public IP as endpoint, and so on).CNI pods can therefore not talk to the Kubernetes API to properly initialise
Nodes
into the pod overlay network, and thereforeNodes
cannot ever be ready, which is something we want to verify in machine-controller e2e tests (but will be removed via #1459).Why is this not a problem with KKP tests?
For KKP tests, no Nodes are joined to the kind cluster. Instead, kind is used to host KKP user cluster control planes, which are built for this purpose and can be used from the outside without the same set of problems (because the Kubernetes API endpoints are routed directly to the kube-apiserver instances run for a user cluster control plane).
Why does this work with some CNIs?
Pretty good question, the gist seems to be that CNIs do the Node initialisation in different ways. Some CNIs appear to work but don't provide functional network. The differences in how CNI initialise seem to be the cause for different behaviour. But networking to services running on the kind control plane cannot work in the current setup, so we never had functional Nodes, even if they were marked as ready. Calico just uncovers the problem by crashing early.
How to solve
We need to solve exposing the control plane properly. There might be options to that with the current kind setup, but an alternative would be to launch a user cluster control plane via KKP. The question here being, do we want to make MC e2e jobs depend on KKP functionality, to which the answer is probably no.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: