Use an alternative machine bootstrap flag probing strategy (no SSH) #230

BarthV · 2023-03-31T15:56:33Z

What steps did you take and what happened:

At our company, we are building an infrastructure "highly secure" with several contraints imposed by French/European sovereign Cloud Label. Running these constraints made us locate CAPI management cluster in a specific network and managed clusters in other ones.

Working around these rules, we recently tried to block all the traffic between CAPI cluster network and target managed clusters networks (and also disabled SSH daemons in all our VMs).
In order to schedule and manage clusters lifecycle, we expected CAPI/CAPK to only reach managed cluster's apiserver using exposed loadbalanced endpoints, which is open the rest of the network using underlying Kubevirt LB capabilities.

In fact we discovered (here) that CAPK requires a direct SSH access to the VM IP in order to validate CAPI Machine bootstrap success (using CAPI sentinel file convention).
Also, this seems to be the unique SSH command I've found in the whole CAPK source code.

At the end with this restriction, CAPK is never able to correctly provision a single kubevirt VM, because VM bootstrap is never acknowledged.

$ kubectl get machine -n capknossh 
NAME                            CLUSTER   NODENAME   PROVIDERID   PHASE          AGE     VERSION
capknossh-cp-lxpzz              capknossh                         Provisioning   9m41s   v1.26.2
capknossh-wk-6996b7555c-98sgs   capknossh                         Pending        9m41s   v1.26.2
capknossh-wk-6996b7555c-fhm8z   capknossh                         Pending        9m41s   v1.26.2

The CAPI specification leaves the infrastructure providers free to choose the sentinel file verification methodology.
So I'd like to start a topic and try finding solutions to avoid doing such SSH connections, which are a very sensitive topic for us.
At the end, I'd love to have a more "read only" & auditable/secure way to check Machine bootstrap status

Possible anwsers could be :

Add a flag to completely skip SSH bootstrap file check

"Always return true" / bypass sentinel file check: We consider VM always bootstraped...it might break CAPI sentinel contract , and probably produce undesireable side-effects for further reconciliation loops (but I could also work after several retries, this might bé thé simplest solution).

Use cloud-init to inject a simple HTTP daemon into the VM (and reserve a port for it, in the same way that current SSH daemon). And use this endpoint to poll bootstrap status

this http server would only serve Sentinel file (and might also use a simple password , or even stay clear)
CAPK would poll sentinel file content though it.
since the VM is supposed to always run a kubelet whatever the VM OS, this web service could perfectly be configured using a static pod.

Use any other less sensitive strategy or protocol to expose and retrieve CAPI sentinel file (remote kubevirt pod exec, or any other smart idea ...)

What did you expect to happen:
In order to comply "governement-tier" security rules, we'd expect CAPK not to use any SSH remote access to check machine bootstrap success. Allowing a single component to hold SSH keys & reach every VM of the kube infrastructure breaks our required legal security compliance.
We think that retrieving the sentinel file status should rather be done with a read-only remote strategy using a less "privileged" & interactive protocol than SSH.

Environment:

Cluster-api version: 1.3.3
Cluster-api-provider-kubevirt version: 0.1.6
Kubernetes version: (use kubectl version): 1.26.2
KubeVirt version: 0.59.0
OS (e.g. from /etc/os-release): Ubuntu 22.04.2 LTS

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

BarthV · 2023-03-31T16:01:02Z

Since this issue is a huge blocker for us, I'd be really happy to discuss on this topic and actively work with you to find any possible Sentinel File Check alternative / optional strategy.

BarthV · 2023-04-01T09:24:19Z

This comment seems to show that the authors are aware of the limitation posed by the SSH strategy :
Is it possible that you have already thought about a generic substitute check for it ?

func (m *Machine) SupportsCheckingIsBootstrapped() bool {
	// Right now, we can only check if bootstrapping has
	// completed if we are using a bootstrapper that allows
	// for us to inject ssh keys into the guest.

	if m.sshKeys != nil {
		return m.machineContext.HasInjectedCapkSSHKeys(m.sshKeys.PublicKey)
	}
	return false
}

I'm working now on a proposal to (first) allow CAPK to optionally check a generic http endpoint for every VM (using TLS 1.3 / PSK key for each VM, replicating the same model used for SSH keys ). And define the a new (optional) TLS contract between CAPK and the VM.

We'll (first) skip the server side implementation & injection in the VM, and only focus on a CAPK check feature. Letting the server side up to integration teams & final users.

BarthV · 2023-04-14T11:08:52Z

Hey, i'm back after a (not so) long silence :) !

We spent some time investigating more ways to poll Sentinel file status, and I think we have now a really elegant candidate to propose . With no SSH involved, no HTTP either or any remote call to the VM to make it work !

If you dive enough into kubevirt you'll see that it provides some sort of direct probing into the VM : guest-agent ping & exec probes. It relies on the existence of qemu-guest-agent in the VM ( kv guest agent ). By using this feature, the pod wrap & relays the probe exection though virt-launcher pod up to the VM.

My proposal is now to make CAPK controller asking kubevirt apiserver to execute the exact same kind of check to probe the Sentinel File right into the VM. Everything is handled by the apiserver with nothing else involved. This is really Simple & elegant IMO.

You can already validate the feasibility by running virt-probe cli directly inside the running virt-launcher pod : (and it works)

bash-5.1$ virt-probe --command cat /run/cluster-api/bootstrap-success.complete --domainName kaas_capiovn-cp-2dnzx                 

success

And it's also perfectly working directly with kubectl exec :

> kubectl exec -n kaas virt-launcher-capiovn-cp-2dnzx-hp9zj -- virt-probe --command cat /run/cluster-api/bootstrap-success.complete --domainName kaas_capiovn-cp-2dnzx 

success

Do you think this approach is relevant ?
Let's talk about this on next community syncup meeting.

Milestones to be achieved to make this proposal OK :

modifying capi standard image build to include qemu-guest-agent by default ( image builder )
Make CAPK capable of running remote commands inside Kubevirt cluster's machine pods. (aka. "kubectl exec as code")
code the business logic to parse results & manage errors.

k8s-triage-robot · 2023-07-13T16:58:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

bzub · 2023-12-04T19:14:56Z

Another use-case: We are interested in using the talos bootstrap/control-plane providers with this infrastructure provider. Since Talos does not use SSH any dependency on SSH would be a hurdle for this idea.

BarthV · 2023-12-04T20:55:39Z

maybe it's time to go forward on this topic and finally remove any ssh requirement in capk ?

bzub · 2023-12-06T18:40:47Z

For now I am setting checkStrategy: none in my KubevirtMachineTemplates which has allowed me to continue trying out Talos + kubevirt provider.

k8s-triage-robot · 2024-01-20T05:14:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

agradouski · 2024-02-06T16:03:04Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-05-06T16:32:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-05T17:05:30Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-07-05T17:34:42Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-07-05T17:51:28Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2023

BarthV mentioned this issue Apr 2, 2023

Add a new optional http VM bootstrap check method #231

Closed

BarthV mentioned this issue Apr 25, 2023

Provide an alternate qemu-guest-agent strategy to check VM's CAPI Sentinel file #234

Closed

aamoyel mentioned this issue Jun 5, 2023

Support Ignition Bootstrap for flatcar #241

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 6, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 6, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 5, 2024

pierrecregut mentioned this issue Jul 5, 2024

Definition of the HostClaim resource metal3-io/metal3-docs#408

Open

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use an alternative machine bootstrap flag probing strategy (no SSH) #230

Use an alternative machine bootstrap flag probing strategy (no SSH) #230

BarthV commented Mar 31, 2023 •

edited

Loading

BarthV commented Mar 31, 2023 •

edited

Loading

BarthV commented Apr 1, 2023 •

edited

Loading

BarthV commented Apr 14, 2023 •

edited

Loading

k8s-triage-robot commented Jul 13, 2023

bzub commented Dec 4, 2023

BarthV commented Dec 4, 2023

bzub commented Dec 6, 2023

k8s-triage-robot commented Jan 20, 2024

agradouski commented Feb 6, 2024

k8s-triage-robot commented May 6, 2024

k8s-triage-robot commented Jun 5, 2024

k8s-triage-robot commented Jul 5, 2024

k8s-ci-robot commented Jul 5, 2024

Use an alternative machine bootstrap flag probing strategy (no SSH) #230

Use an alternative machine bootstrap flag probing strategy (no SSH) #230

Comments

BarthV commented Mar 31, 2023 • edited Loading

BarthV commented Mar 31, 2023 • edited Loading

BarthV commented Apr 1, 2023 • edited Loading

BarthV commented Apr 14, 2023 • edited Loading

k8s-triage-robot commented Jul 13, 2023

bzub commented Dec 4, 2023

BarthV commented Dec 4, 2023

bzub commented Dec 6, 2023

k8s-triage-robot commented Jan 20, 2024

agradouski commented Feb 6, 2024

k8s-triage-robot commented May 6, 2024

k8s-triage-robot commented Jun 5, 2024

k8s-triage-robot commented Jul 5, 2024

k8s-ci-robot commented Jul 5, 2024

BarthV commented Mar 31, 2023 •

edited

Loading

BarthV commented Mar 31, 2023 •

edited

Loading

BarthV commented Apr 1, 2023 •

edited

Loading

BarthV commented Apr 14, 2023 •

edited

Loading