Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of the HostClaim resource #408

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pierrecregut
Copy link

The HostClaim resource is introduced to address multi-tenancy in Metal3 and the definition of hybrid clusters with cluster-api.

Introduces four scenarios representing the three main use cases and how pivot is made.

Presentation of the design will be addressed in another merge request.

@metal3-io-bot
Copy link
Contributor

Hi @pierrecregut. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 3, 2024
@dtantsur
Copy link
Member

dtantsur commented Apr 3, 2024

/ok-to-test

@metal3-io-bot metal3-io-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2024
The HostClaim resource is introduced to address multi-tenancy in
Metal3 and the definition of hybrid clusters with cluster-api.

Introduces four scenarios representing the three main use cases and
how pivot is made.

Presentation of the design will be addressed in another merge request.

Co-authored-by: Pierre Crégut <pierre.cregut@orange.com>
Co-authored-by: Laurent Roussarie <laurent.roussarie@orange.com>
Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this!
I see many things that I like. My main concern now is that we need to clarify the interaction between HostClaims and BareMetalHosts. Where each lives and which controllers would handle them. This is something I feel we must focus on before we can think about potential other resources (non-baremetal).


There is another model where a single management cluster is used to create and
manage several clusters across a set of bare-metal servers. This is the focus
<!-- cSpell:ignore Sylva Schiff -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add these in the cSpell config instead?

Comment on lines 38 to 43
So far, the primary use case of cluster-api-baremetal is the creation of a
single target cluster from a temporary management cluster. The pivot process
transfers the resources describing the target cluster from the management
cluster to the target cluster. Once the pivot process is complete, the target
cluster takes over all the servers. It can scale based on its workload but it
cannot share its servers with other clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can say this is the primary use case. It is just they bootstrap process to get a "self managed" management cluster. Once you have a management cluster (no matter if it is self managed or not) it can be used to create multiple other clusters. These can be created from BareMetalHosts in the same or other namespaces.

What I think we should bring up here instead is the following.
The Metal3 infrastructure provider for Cluster API as of today, does not implement the multi-tenancy contract. While some hacks and workarounds exists (e.g. running multiple namespaced instances), they are far from ideal and do not address all use cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the reference to the multi-tenancy contract is the right one because this contract has been designed for disposable compute resources not recyclable ones. With disposable compute resource, you have an API to create them usually protected by credentials and the problem is just to be able to isolate each set of credentials and make it usable for example in a single namespace. With recyclable resources the set of credentials is attached to a long living object.

You could trade tenant credentials to gain access to the real object. This is what was done in Kanod with the baremetalpool approach. It is powerful but also very complex and we depended on Redfish.

Here we rather hide the credentials and use the HostClaim controller to exchange control and information between the BareMetalHost in a protected namespace and a Host in the tenant namespace. It is true that we need limits on the binding performed:

  • the BareMetalHost should be able to limit the namespaces of the HostClaims that can be associated
  • at some point a mechanism to limit the number of compute resources a namespace can acquire is needed.

I think the second point can be addressed later. It can be implemented through a quota mechanism (see https://gitlab.com/Orange-OpenSource/kanod/host-quota in the scope of Kanod).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a new version of motivations. It also solve the previous remark as I have removed explicit mention to Sylva and das Schiff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the reference to the multi-tenancy contract is the right one because this contract has been designed for disposable compute resources not recyclable ones. With disposable compute resource, you have an API to create them usually protected by credentials and the problem is just to be able to isolate each set of credentials and make it usable for example in a single namespace. With recyclable resources the set of credentials is attached to a long living object.

My goal would be to achieve the exact same process for both disposable and recyclable resources. The API for creating is the Kubernetes API of the cluster holding the BareMetalHosts. The credentials are the credentials for accessing that API. In this way we achieve the same way of working as the other infrastructure providers, which simplifies a lot and also gives us multi-tenancy.

The proposal for how to handle interaction between namespaces (with HostClaims in one namespace and BareMetalHosts in another) worries me. It is generally not a good idea to allow cross namespace references. If we have to do this I think it must be one way only. Then all BareMetalHosts would be in a special namespace and no namespace reference is needed from the HostClaim side. This would be similar to ClusterIssuers from cert-manager.

If I understand correctly, the need to have the HostClaim in a separate namespace is so that we could keep it together with the Metal3Machine. I don't understand why this is needed though. Would it not make more sense to keep the HostClaim with the BareMetalHost and propagate the status back to the Metal3Machine. This is much closer to other providers (CAPA/CAPO, etc). The user has no insights at all into the actual cloud resources outside of what is propagated to the InfrastructureMachine (unless they check it through a separate API). I think propagating more of the BareMetalHost status back to the Metal3Machine makes perfect sense and can hopefully simplify this a lot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the need to have the HostClaim in a separate namespace is so that we could keep it together with the Metal3Machine. I don't understand why this is needed though. Would it not make more sense to keep the HostClaim with the BareMetalHost and propagate the status back to the Metal3Machine. This is much closer to other providers (CAPA/CAPO, etc). The user has no insights at all into the actual cloud resources outside of what is propagated to the InfrastructureMachine (unless they check it through a separate API). I think propagating more of the BareMetalHost status back to the Metal3Machine makes perfect sense and can hopefully simplify this a lot.

Except that if you do this, you have:

  • a clusterapi only solution (ie I cannot use it for a workload that is not defining a node).
  • a solution that will only work with bmh and cannot be so easily extended to other resources (even if this can be challenged: see other comment).

Comment on lines +221 to +222
associated with the cluster. The associated BareMetalHost is not in the same
namespace as the HostClaim. The exact definition of the BareMetalHost remains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we should clarify I think.
How should the BareMetalHost be picked? Across all namespaces? Should the HostClaim specify where they are? We should be able to have the BareMetalHosts in a completely different cluster than the Metal3Machines IMO. Where should the HostClaims be then?

My idea would be to add an identityRef to the Metal3Cluster. It references a Secret with credentials that would be used for accessing the BareMetalHosts or HostClaims. The same identityRef field could be added to Metal3MachineTemplate also. These could be propagated to the HostClaims so we get something like this.

apiVersion: metal3.io/v1alpha1
kind: HostClaim
metadata:
  name: my-host
spec:
  identityRef:
    name: my-credentials

Then the HostClaim can be in the same namespace as the Metal3Machine. However, it means that the user must be in possession of credentials that can access and modify the BareMetalHosts! This is because the controller (CAPM3) would need to use the credentials of the HostClaim to reach out to the BareMetalHost and modify it.

The other option is that the HostClaim itself lives next to the BareMetalHost. We then set the identityRef on the Metal3Cluster and Metal3MachineTemplate. Then CAPM3 creates HostClaims with the help of the provided credentials. It does not require access to modify the BareMetalHosts directly. This option seems more reasonable to me, but it has the drawback the the user cannot easily inspect the HostClaim.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another approach. For your approach a HostClaim represents a right to use a specific host and is the private part in the BMH realm. It is associated to a set of credentials representing the cluster. Then you will need a mechanism to create new HostClaims. I believe that your proposal is not so far from the Kanod BareMetalPool approach except that you never give access to the real credentials of the BMH: you only expose the standard API offered by the BMH.

I think there are different objectives that can be addressed in different ways in both solutions:

  • Having the BareMetalHosts in a different cluster. It is a legitimate need as hardware is often handled by different teams. But I did not put it in the objective of this document. It can be addressed later as experimented in Kanod. We use the capability of HostClaim to target different kind of compute resources (hybrid) to implement virtual HostClaim synchronized with a remote HostClaim in another cluster (https://gitlab.com/Orange-OpenSource/kanod/host-remote-operator). There is an equivalent of the identityRef in that solution. You have HostClaims on both sides. They must be synchronized. The HostClaim in the cluster namespace has its spec copied to the remote HostClaim and it inherit the status from the remote and yes the contract is tricky on metadata (labels and annotations).
  • Ensuring multi-tenancy isolation. This is done by the fact that HostClaim are not in the same namespace as BareMetalHost and only offer an API with the credentials hidden. The user and the metal3machine controller can completely inspect the HostClaim object. When there is a remote HostClaim, the one on the BMH cluster could be in the same namespace, but it may be harder to avoid name collision between objects and if we want to use quotas, namespaces are also useful to distinguish tenants.
    Your solution will not address the first use case (use without capm3) because your version of HostClaim is invisible to the end user and you rely on capm3 controller to talk directly with the service hiding the BareMetalHost.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lentzi90 I think the main point now is to decide if having BMH in different clusters is a goal and needs a scenario or a non goal. If it is a goal, then I also need to bring the remote hostclaim in the scope of this document, or we keep it as a non goal and mention that it will be addressed later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the capability of HostClaim to target different kind of compute resources (hybrid) to implement virtual HostClaim synchronized with a remote HostClaim in another cluster

This does not make sense to me and sounds too complicated. Why not just propagate the status back to the Metal3Machine instead? Then there is no need for the remote HostClaim.

Your solution will not address the first use case (use without capm3) because your version of HostClaim is invisible to the end user and you rely on capm3 controller to talk directly with the service hiding the BareMetalHost.

Can you explain this a bit more? The way I see it, we can propagate the status of the HostClaim back to the Metal3Machine. Then the user will get the information. If the user wants to manually handle HostClaims and skip CAPM3 then they already have access to the HostClaim and can see the status there.

I think the main point now is to decide if having BMH in different clusters is a goal and needs a scenario or a non goal. If it is a goal, then I also need to bring the remote hostclaim in the scope of this document, or we keep it as a non goal and mention that it will be addressed later.

I think it must be a goal because it is ultimately a consequence of multi-tenancy. It is possible to implement it so that it only works in one cluster, but it should be trivial to support a remote cluster if the local cluster already works. Then I do not see a reason to limit it.

Copy link
Author

@pierrecregut pierrecregut Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First HostClaim as I defined it is very similar to Metal3Machine. In fact the only difference in the Spec is the providerID (only in m3m) and online (only in host) but in practice it is a little more than that as there are a few things handled through annotations that are specific to cluster-api in the metal3 machine controller. HostClaim are a way to decouple what is specific to cluster-api from what is generic to a compute resource.

Multi-tenancy for BareMetalHost is not specific to the use with cluster-api. I would like to be able to launch workloads on servers that are shared with other users of the infrastructure even if the workload is not being a node of a cluster. So I think that the implementation of multi-tenancy should be between Hostclaim and BareMetalHost, so in the Hostclaim controller.

To just implement multi-tenancy in the same cluster, I just need to make sure that I hide the credentials but provide sufficient feed back on the baremetal host. If I want to use it without capm3, this feedback cannot be propagated to the m3machine that does not exist in that scenario.

I can extend the solution to work with BareMetalHost on a remote cluster. As in your proposal, it involves having a set of credentials for a remote api that gives access to the bmh. Again I could implement it directly directly in the HostClaim controller but if I want to support hybrid clusters and if I need the same mechanism to access kubevirt VM on a remote cluster, then I prefer an implementation that let me control any kind of HostClaim on a remote cluster. That is exactly what remote hostclaim do.

Your solution trivially supports remote cluster because it uses the mechanisms to access a remote cluster even when it is not needed.

On the other hand, we could also decide that the real controllers are always implemented by a service implementing the HostClaim API. The notion of kind we have could disappear: you would simply target another endpoint. Technically this would be almost the same implementation but the HostClaim on the remote cluster would only exist "on the wire" and not as a resource. This means that the service implementation must be able to rebuild the model (internally you need it) from the compute resource only (through labels and annotations added to mark how they are associated to the source HostClaim resources).

This is probably closer to what you had in mind but the resource would still be in the cluster (tenant) namespace not in the baremetalhost namespace.

The only drawback from my point of view is the complexity of this server when it is not needed (single cluster). Especially with metadata, the flow of information between clusters is not easy and having a single implementation may ensure more coherence. We also loose the ability to implement Quota for HostClaims as this must be done on a resource existing on the target cluster, not on the tenant cluster.

@pierrecregut
Copy link
Author

@lentzi90 I think I need to write down the alternative/risks sections right now. I have a draft but it is not ready yet (pierrecregut@5ed5f5f). I will push it here when it is ready and I will try to faithfully incorporate your approach.

Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
@metal3-io-bot metal3-io-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 12, 2024
@pierrecregut
Copy link
Author

I have added the 'alternative solutions' part and 'risk and mitigation' so that it is more clear how HostClaim compare with alternative solutions. I think I have captured @lentzi90 proposal in "HostClaims as a right to consume BareMetalHosts" but I probably missed part of it. If some obvious alternatives are also missing it is time to add them. The advantage/drawbacks are biased and I am open to improvements. Unfortunately the size of the PR is now clearly XL.

* security risks that must be addressed
* alternatives for the different use cases (multi tenancy or hybrid).

Co-authored-by: Pierre Crégut <pierre.cregut@orange.com>
Co-authored-by: Laurent Roussarie <laurent.roussarie@orange.com>

Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
Copy link
Member

@lentzi90 lentzi90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the work on this!

As mentioned in comments above, my main motivation for a HostClaim is to achieve multi-tenancy the way CAPI defines it. I think we need to shape it so that we get a similar flow as other infrastructure providers. This means that CAPM3 would deal with disposable resource (HostClaims) through an API (the Kubernetes API of the cluster holding the BareMetalHosts) protected by credentials. This was the original goal of the architecture but then CAPM3 grew to become completely coupled to BMO and this is something I hope to fix now.

One thing that is not clear to me is how you intend things to work with VMs instead of BareMetalHosts? Are you expecting to create a separate controller instead of BMO?

@pierrecregut
Copy link
Author

This means that CAPM3 would deal with disposable resource (HostClaims) through an API (the Kubernetes API of the cluster holding the BareMetalHosts) protected by credentials. This was the original goal of the architecture but then CAPM3 grew to become completely coupled to BMO and this is something I hope to fix now.

I completely agree with this shift from recyclable to disposable resource. The main question is where we expose the API and what is kept and manipulated as a resource and what is just message over a link.

One thing that is not clear to me is how you intend things to work with VMs instead of BareMetalHosts? Are you expecting to create a separate controller instead of BMO?

Yes and we have already done it with the Host resource in Kanod (which is a HostClaim with a few defects that we intend to correct for the final version). The implementation for kubevirt VM is https://gitlab.com/Orange-OpenSource/kanod/host-kubevirt, the version for having hosts on a remote cluster is https://gitlab.com/Orange-OpenSource/kanod/host-remote-operator. The implementation for BMH is in https://gitlab.com/Orange-OpenSource/kanod/host-operator but mixing the generic definition of the custom resource (with quota webhook) and the specific implementation for BMH was not necessarily the best idea.

@lentzi90
Copy link
Member

Ok I think I understand now how you want the HostClaim to work. Let me see if I can suggest something that would work both for the hybrid, non-CAPI case and for CAPI without making things too complicated.

This is what I want to avoid. It is too complicated, and I see no reason (from CAPI/CAPM3 perspective) to have the LocalClaim.

                
┌────────┐      
│Machine │      
├────────┴┐     
│M3Machine│     
├─────────┴┐    
│LocalClaim│    
└──────────┘    
                
                
┌─────────────┐ 
│RemoteClaim  │ 
├─────────────┤ 
├─────────────┤ 
│BareMetalHost│ 
└─────────────┘ 

Previously I suggested something like this, where the M3Machine fills the same need as the LocalClaim. I understand that this is not ideal for your use-case because there is no M3Machine then.

 ┌────────┐      
 │Machine │      
 ├────────┴┐     
 │M3Machine│     
 └─────────┘     
                 
                 
 ┌─────────────┐ 
 │RemoteClaim  │ 
 ├─────────────┤ 
 ├─────────────┤ 
 │BareMetalHost│ 
 └─────────────┘ 

Now I suggest instead this:

 ┌────────┐                                              
 │Machine │                                              
 ├────────┴┐         ┌─────────┐        ┌─────────┐      
 │M3Machine│         │HostClaim│        │HostClaim│      
 └────┬────┘         └────┬────┘        └────┬────┘      
      │                   │                  │           
      │                   │                  │           
 ┌────▼────────┐     ┌────▼────────┐    ┌────▼─────────┐ 
 │BMHClaim     │     │BMHClaim     │    │VMClaim       │ 
 ├─────────────┤     ├─────────────┤    ├──────────────┤ 
 ├─────────────┤     ├─────────────┤    ├──────────────┤ 
 │BareMetalHost│     │BareMetalHost│    │VirtualMachine│ 
 └─────────────┘     └─────────────┘    └──────────────┘ 

This way you can implement the HostClaim controller separately and get all functionality you want. For Metal3, we would add to BMO the BareMetalHostClaim. No separate controller is needed for this. BMO can reconcile them and match them with BareMetalHosts.
CAPM3 would some similar parts to the HostClaim controller but only for BareMetalHostClaims. I don't see that we would make it work with VMs for example.

What do you think about this?

@pierrecregut
Copy link
Author

pierrecregut commented Apr 18, 2024

This is what I want to avoid. It is too complicated, and I see no reason (from CAPI/CAPM3 perspective) to have the LocalClaim.

LocalClaim serves two purposes:

  • It avoids duplication of work between an M3M controller and what you call a HostClaim controller in last figure.
  • It freezes the contract offered by the API accessing the BMH side which is important if we want alternative implementations.

As an additional bonus point, I think it would give node-reuse a cleaner implementation because it means that we keep the
host binding rather than creating a new one.

CAPM3 would some similar parts to the HostClaim controller but only for BareMetalHostClaims. I don't see that we would make it work with VMs for example.

The whole point of supporting hybrid is to make CAPM3 work with VM. And if the contract exposed by HostClaim is clear this is exactly what we have. With our PoC, we can create workload cluster with either control-plane or some of the machine deployments on kubevirt VM (we use multus and a bridge on each node of the management cluster to attach VM to the network of the baremetal part of target cluster). Then you can even change the compute resource kind of machine deployments between upgrades seamlessly.

If we are not sharing the CR on compute resources, I would rather avoid the BMHClaim/VMClaim parts. The only thing you need to know is for each user, a mapping from LocalClaim ids to BMH (credentials being associated to the user). The reason we have a claim on the remote part is that we distinguish the implementation of the wire protocol (and with it the notions of users) from the handling of compute resources.

In our implementation of remote HostClaim, we have a very crude model of user (just a token representing more or less a namespace). A production grade version would probably distinguish a declarative part (which cluster, which namespace) from the authenticated part (the user) and would provide interface to identity management frameworks (keycloak others) that handle the definition of users and credential protection. We have not looked seriously at that part and one of the advantage of our approach is that it was very easy to replace our Local HostClaim <-> Remote HostClaim controller... or not use any when BMH and clusters are in the same management cluster.

The compromise I suggested yesterday was rather the following where circles represent services not resources. The notion of user is not represented. The important part is that the dialog between HostClaim controller and BMH / VM servers is standardized.

graph TD;
   m3m1[Metal3Machine] --> hc1[HostClaim 1];
   m3m2[Metal3Machine] --> hc2[HostClaim 2];
   hc1 --> bms((BMH Server));
   hc3[HostClaim 3] --> bms;
   hc2 --> vms((VM server));
   hc4[HostClaim 4] --> vms;
   subgraph BareMetal Domain
   bms --hc1--> bmh1[BMH 1];
   bms --hc3--> bmh2[BMH 2];
   bms --> bmh3[BMH 3];
   end
   subgraph KubeVirt Domain
   vms --hc2--> vm1[VM 1];
   vms --hc4--> vm2[VM 2];
   end
Loading

A last warning is that the pivot of the bootstrap cluster which is still simple when we have M3M, Host and BMH in the same namespace during initialization becomes much more complex if we always have a service that migrates.

@lentzi90
Copy link
Member

I'm not convinced that we need a separate HostClaim controller. That is why I see no issue with duplicating the work of that controller in CAPM3.

If the whole point is to get CAPM3 to work with VMs and especially KubeVirt, what about this? It seems like a very tempting alternative to me. We just add support for BMCs in KubeVirt and the whole Metal3 stack can work with it directly.
This is what I hope to use for our CI in the future.

The multi-tenancy / decoupling of CAPM3 from the BMHs is still very much needed, but for that I see no need for the LocalClaim. The Metal3Machines are supposed to fill that same function, just like OpenStackMachines does in CAPO. They are the local resources that the user can inspect.

@pierrecregut
Copy link
Author

pierrecregut commented Apr 19, 2024

Kubevirt is just an example. You can add a BMC to most VM. This exists for Openstack for testing purpose for a long time. Sylva project has already developed an equivalent of the mentioned blueprint for Kubvirt for their CI but going directly to the libvirt layer: : https://gitlab.com/sylva-projects/sylva-elements/container-images/libvirt-metal#libvirt-metal . Unfortunately Redfish/IPMI manages hardware, it does not create it. So we completely loose the advantages of being a disposable resource or we need again a notion of pools synchronized with cluster size. Things quickly become very complex again and that is probably why nobody has pushed it further. Networking and VM definitions are really mixed in Kubevirt and as networking must be reconfigured for each cluster, pre-definition of VM is also harder.

@lentzi90
Copy link
Member

Ok, makes sense. So you will need some code in CAPM3 to make it possible. (Yes it could be in the HostClaim controller instead, but I hope you see why we may want to include that in CAPM3/BMO directly.)

I think the core issue remaining here is about how the HostClaim is bound to the BareMetalHost (or VM). If I understand correctly you would like a HostClaim controller (HCC) that is independent of BMO and CAPM3. It would bind the HostClaims to BMHs or VMs and propagate status back. I think that should be part of BMO directly. It would be such an integral part of how to work with BMO.

I'll be back with more comments on how to bind the HostClaim later

@pierrecregut
Copy link
Author

If I understand correctly you would like a HostClaim controller (HCC) that is independent of BMO and CAPM3. It would bind the HostClaims to BMHs or VMs and propagate status back.

In the proposal, there is no piece of code that handles both BMH and VM at the same time. I will try to summarize what I think our mental models. Please correct me what I misrepresent. If we forget about the single cluster case where my proposal did not need an intermediate protocol, the core of what we agree on is:

graph TD;
  Metal3Machine --> i1[...];
  i1 -->p[/REST API for Host: ↓ selector + workload -  ↑ resource status + most metadata/];
  p --> c((server));
  c --> i2[...];
  i2 --> BareMetalHost
Loading

Rectangles are custom resources. Parallelograms are API and circles, http endpoint implementing the API.

My intial proposal was:

graph TD;
   Metal3Machine --> hc0[HostClaim - remote];
   hc0 --> p[/REST API/];
   p --> c((HostClaim server));
   c --> hc1[HostClaim - baremetal];
   c --> hc2[HostClaim - kubevirt];
   hc1 --> BareMetalHost;
   hc2 --> vm[Kubevirt VM];
Loading

The word after HostClaim is something that can act as a compute kind selector for controllers. To do the selection on the K8S APIserver side, a label should be used but it is just a code optimization. We have as many controllers as there are selectors.

If I undestand well, your proposal is rather something like:

graph TD;
  Metal3Machine --> p[/REST API/];
  p --> c((BMH server));
  c --> BareMetalHost;
Loading

You have a resource between the BMH server and the BareMetalHost, but if I am not wrong, it only either uniquely identify a user or create a unique password for the BareMetalHost depending on the way you restrict the access. Note that you probably want
a notion of user to limit the scope of "watch" (something you need to get a single channel to be notified of status changes). With the user approach, I would just use a custom resource representing the user as consumerRef and use the name of the BMH as id (I don't think it reveals anything secret to the other end).

A custom resource hides both an API and the K8S apiserver that implements this API and delegates actions to the resource controller watching the resource.

So, what I am not sure about is if, in the above drawing, the pair protocol and server is kubernetes apiserver (you mentionned directly using kubeconfig at some point) or if, as recommended in k8s documentation, you implement your own server. With Kubernetes RBAC, it may be difficult to restrict the access by a given user to only the BMH it owns unless you have separate namespaces.

If we agree on a public, versionned API, I think we can gain back more or less what is needed for hybrid clusters and non k8s workload with:

graph TD;
  Metal3Machine --> p[/REST API/];
  Metal3Host --> p;
  p --> c((BMH server));
  c --> BareMetalHost;
  p --> c2((VM server));
  c2 --> vm[Kubevirt VM];
Loading

There is no real difference between Metal3Host and HostClaim - remote from a previous drawing (I have used the name given in a previous design proposal).

@lentzi90
Copy link
Member

Sorry for my slow response. There are too many other things on-going.
I think we can get away with something simpler still.

With Kubernetes RBAC, it may be difficult to restrict the access by a given user to only the BMH it owns unless you have separate namespaces.

Yes, to get proper isolation we would need separate namespaces. However, that does not mean we will necessarily be limited to separate BMH pools. I suggest BMO would be responsible to associating BMHs to HostClaims. This means that we can give the power to control how that happens to the same entity that creates the BMHs.

  • CAPM3 user needs access to create HostClaims in the cluster where the BMHs are
  • BMO owner creates BMHs and policies for what HostClaims they should be associated with
flowchart LR;
  CAPM3 --creates--> c[HostClaim];
  BMO--associates--> c;
  c --associated with--> BareMetalHost;
  c --approved by--> BareMetalHostPolicy;
Loading

This way, the BMO owner can write a policy to approve or deny HostClaim associations. (I stole this idea from cert-manager approver policies.) They can allow associations from multiple namespaces to get a common pool or they can allow only one specific namespace. We could even the policies more fine grained if needed.

@pierrecregut
Copy link
Author

Yes, to get proper isolation we would need separate namespaces. However, that does not mean we will necessarily be limited to separate BMH pools. I suggest BMO would be responsible to associating BMHs to HostClaims. This means that we can give the power to control how that happens to the same entity that creates the BMHs.

If we have separate namespaces and the HostClaim resource contains the requirements and the selectors, I think that is exactly the HostClaim resource as it is described. The only question where your suggested implementation and our differs is the location of the controller. I totally agree for putting the controller in the BMO project. The only reason we did it differently is that we tried to limit places where we forked an existing metal3 project in the scope of the PoC.

@pierrecregut
Copy link
Author

  • CAPM3 user needs access to create HostClaims in the cluster where the BMHs are

I think that is the real point where we differed. In the case of cross cluster, you assume that the user will somehow give a kubeconfig with a restricted scope in a secret to the capm3 controller. I guess that if we are in the same cluster, we could just assume that the HostClaim must be created in the same namespace as the M3M and in that case we are again back to the existing proposal.

We differ because:

  • we implemented the cross cluster as a separate operator so that we could use it directly for the case where capm3 is not involved. There is no real problem to push that code inside capm3 except that we loose the ability to share it with other custom resources. This is less a priority if we already have HostClaim as a usable resource on a given cluster.
  • we do not rely on kubeconfig. We want to automate user and cluster creation. We do not want to have an entity that can create a role over any namespace and a user. This is equivalent to have a controller with full admin power. But my implementation is not great in the sense that it reimplements a lot of apiserver in a hacky way.

A colleague suggested implementing a proxy. That would be more or less transparent to the capm3 side and would solve our security concern. The proxy would have full privileges on hostclaim resources in any namespace. It would replace user identities with its own after verifying that the user is entitled to do the request it asks for. That way we are sure that we do not create a security hole that goes beyond the scope of Host and BMH management.

I will modify the proposal to introduce the kubeconfig credential that will be used in place of the standard service account. It could use the namespace defined in the context as target namespace. We will keep the implementation of a proxy in mind to avoid giving an agent too many rights when we want to automate the management of users.

@pierrecregut
Copy link
Author

This way, the BMO owner can write a policy to approve or deny HostClaim associations. (I stole this idea from cert-manager approver policies.) They can allow associations from multiple namespaces to get a common pool or they can allow only one specific namespace. We could even the policies more fine grained if needed.

BMOpolicies are closer to quota on pods than cert managers policies: quantitative clearly associated to a user/namespace. That is why Laurent first restricted HostQuotas to namespaces using k8s quota implementation as a reference (https://gitlab.com/Orange-OpenSource/kanod/host-quota). Openshift has a notion of cluster wide quotas for projects but I believe that it expresses constraints on each namespace.

@lentzi90
Copy link
Member

I think that is the real point where we differed. In the case of cross cluster, you assume that the user will somehow give a kubeconfig with a restricted scope in a secret to the capm3 controller. I guess that if we are in the same cluster, we could just assume that the HostClaim must be created in the same namespace as the M3M and in that case we are again back to the existing proposal.

We can default to use "in-cluster credentials" (i.e. service account and RBAC). It would not have to be in the same namespace but it would probably make sense for single cluster scenarios anyway.

we do not rely on kubeconfig. We want to automate user and cluster creation. We do not want to have an entity that can create a role over any namespace and a user. This is equivalent to have a controller with full admin power. But my implementation is not great in the sense that it reimplements a lot of apiserver in a hacky way.

I struggle to understand this. If you create the cluster you are usually cluster admin anyway? An entity that manages the users does not need to be coupled to Kubernetes API and RBAC depending on what you use for authentication. You could handle the users separately and use OIDC or JWT with user groups to get correct privileges in the cluster.

I don't mean that we have to use kubeconfigs directly, but considering that the format is well tested and contains everything we need I think it is the obvious choice. We can consider other alternatives of course, but BMO provides a Kubernetes API and then kubeconfig files makes a lot of sense.

A colleague suggested implementing a proxy. That would be more or less transparent to the capm3 side and would solve our security concern. The proxy would have full privileges on hostclaim resources in any namespace. It would replace user identities with its own after verifying that the user is entitled to do the request it asks for. That way we are sure that we do not create a security hole that goes beyond the scope of Host and BMH management.

Not sure I understand this. To me it sounds like the proxy is the security issue here, with full access to all BMHs through HostClaims. Who would control this proxy? Where would it run? How would authentication work and is there a reason to do that in the proxy instead of in the Kubernets API server?

I will modify the proposal to introduce the kubeconfig credential that will be used in place of the standard service account. It could use the namespace defined in the context as target namespace.

Sounds good!

BMOpolicies are closer to quota on pods than cert managers policies: quantitative clearly associated to a user/namespace. That is why Laurent first restricted HostQuotas to namespaces using k8s quota implementation as a reference (https://gitlab.com/Orange-OpenSource/kanod/host-quota). Openshift has a notion of cluster wide quotas for projects but I believe that it expresses constraints on each namespace.

I'm not after limiting the amount or type of hosts at this point. I do see the usefulness of that and we should definitely add it. However, I think we must first make sure that authorization is in place. Not "you cannot use so much", rather "you have no access here".

I think there are two scenarios, with one special case:

  1. Full isolation: BMHs are grouped per user in separate namespaces or even clusters. BMO/BMHs potentially configured to only associate HostClaims in the same namespace. If they are in separate clusters there is no need for that.
  2. Common pool(s): A pool of BMHs in one namespace are shared by multiple users, but not all. A policy is needed to tell BMO what HostClaims are allowed to be associated. E.g. BMHs in namespace A can be associated with HostClaims in namespaces X, Y and Z. No other HostClaims are allowed.
  3. Free for all: All BMHs are in a single namespace with no limitations on what HostClaims can be associated. This is a special case of scenario 2.

@pierrecregut
Copy link
Author

The case I want to address is when you have an infrastructure wide system for managing users and identities. You create a new user X that should be able to describe a new cluster in management cluster A using BareMetalHosts in cluster B. How do I automate the creation of a namespace and the associated kubeconfig in B without a process that have full admin rights in B which should not be strictly necessary.

The idea was to have just a proxy in cluster B that has only rights on HostClaims (so more or less what a HostClaim controller already has) and that uses the centralized authentication/authorization management system to check if the credentials it has received can be used to perform an action in a specific namespace. We still need some glue around to create and distribute such credentials but at least there is a path where we do not create new kubernetes user and a RoleBinding (to a cluster role granting control over HostClaims).

The issue can be left open for the moment as it can be addressed separately.

@pierrecregut
Copy link
Author

However, I think we must first make sure that authorization is in place. Not "you cannot use so much", rather "you have no access here".

I think this is addressed by section Security Impact of Making BareMetalHost Selection Cluster-wide with the use of an annotation on BareMetalHost. One of the advantage is that it is mandatory, so when we upgrade a deployment that did not use HostClaim, there is no risk that BareMetalHost are bound to HostClaims in another namespace.

@pierrecregut
Copy link
Author

I have pushed a new commit that should reflect the fact that capm3 controllers handle the access to the remote cluster using a kubeconfig. There is a new scenario and I have modified goals/non goals. My approach with a "remote" HostClaim is now an alternative implementation.

@pierrecregut
Copy link
Author

I have modified the last commit. I overlooked the fact that HostClaim uses secrets to define the cloud-init configuration although those secrets are an important part of the protocol implemented for "remote" HostClaim. This means that the role binding must give admin rights over secrets in the remote namespace.

@Rozzii
Copy link
Member

Rozzii commented May 6, 2024

It is a very long proposal and discussion so bare with me if I have missed something :D .
First of all thank you for driving this discussion!

  • I would like to separate this discussion into two different topics, if possible into two different proposal.

  • One topic would be the multytenancy where I agree with the approach of @lentzi90 that we would
    need a combination of "HostClaim" paired with a sort of "Policy object" to facilitate and indirection.
    Indirection would be implemented such a way that neither the CAPM3 nor the BMO would have access to each others
    Namespaces but they would have access to the Namespace where the HostClaim is created and based on the "Policy object"
    BMO would either pair the claim with a BMH or not.

  • I do not like the "Proxy approach", that proxy would be a "third party" with very high privileges.

  • But in general I agree to support Multy tenancy for BMHs and I would like to also advocate for implementing the
    https://cluster-api.sigs.k8s.io/developer/architecture/controllers/multi-tenancy#contract + "HostClaim" + "PolicyObject"



  • The second topic is whether we should allow CAPM3 to create M3Machines on top of VMs via the "HostClaim", I very much
    oppose this approach. I oppose it for multiple reason, Metal3 has a very specific scope of managing e2e life cycle of
    bare metal machines, there are CAPI providers for VMware, Kubevirt, BYOH and other providers that manage either
    "BM hosts that don't require life cycle management" or VMs.
  • I understand that using more than 1 provider for creating 1 homogeneous cluster is not trivial or even fully supported use-case in the CAPI ecosystem, but extending Metal3 to handle VMs is not the way to solve this problem IMO.
  • My counter proposal would be that if you need a Multy tenant environment where users can create clusters from a homogeneous set of BMHs based on VMs and BMs then let's implement the "Multy tenancy" first, then for your specific use-case you should use something called "Virtual Baremetal solution". Virtual baremetal is supported already with Openstack via Sushy-tools and it is planned to be supported in kubevirt (https://github.com/kubevirt/community/blob/main/design-proposals/kubevirtbmc.md). Without going into the implementation details of the different virtual baremetal solutions the main takeaway is that from BMO perspective these virtual baremtal solutions provide BMC addresses and they emulate the behavior of BMCs thus BMO can handle the VMs the same way as it would handle regular BMs such a way that no extra logic is needed in the Metal3 stack.

@pierrecregut
Copy link
Author

  • One topic would be the multytenancy where I agree with the approach of @lentzi90 that we would
    need a combination of "HostClaim" paired with a sort of "Policy object" to facilitate and indirection.
    Indirection would be implemented such a way that neither the CAPM3 nor the BMO would have access to each others
    Namespaces but they would have access to the Namespace where the HostClaim is created and based on the "Policy object"
    BMO would either pair the claim with a BMH or not.

I mostly agree on this part. The scope of the Policy object is more complex. Hardware owners need to restrict the access to their BMH objects. Cluster owners want to to control on which hardware they are deployed. Finally at some point we will also need a way to do accounting and restrict the resources a cluster owner can acquire. If we forget the last point we still have two stake owners with different needs and one object. We also want to make sure that servers used to build clusters without hostclaim are not automatically promoted to be used by hostclaims from other namespaces. So I still do not know what is the right Policy object. In the proposal, only the first part is handled with an annotation on BMH to authorize the other namespaces to which they can be bound. The content could be a regular expression, or something like a list with catchall expansion.

@pierrecregut
Copy link
Author

  • I do not like the "Proxy approach", that proxy would be a "third party" with very high privileges.

The proxy does not need higher privileges than the regular controller (view over BMH and... their associated resources). Without the proxy at some point, you will need to create kubernetes users and their associated roles. This is a high privilege task and it needs to be automated at some point. I think it is safer to implement a custom protocol (or at least have a very custom proxy that restrict secrets to a given type) because the associated resources of the BMH are secrets and it may be hard to restrict which secrets can be viewed by the proxy but still a proxy is better than an agent creating roles on the fly.

In any case, we do not necessarily need it in the first implementation. Some tasks will be manual...

@pierrecregut
Copy link
Author

  • The second topic is whether we should allow CAPM3 to create M3Machines on top of VMs via the "HostClaim", I very much
    oppose this approach. I oppose it for multiple reason, Metal3 has a very specific scope of managing e2e life cycle of
    bare metal machines, there are CAPI providers for VMware, Kubevirt, BYOH and other providers that manage either
    "BM hosts that don't require life cycle management" or VMs.

  • I understand that using more than 1 provider for creating 1 homogeneous cluster is not trivial or even fully supported use-case in the CAPI ecosystem, but extending Metal3 to handle VMs is not the way to solve this problem IMO.

  • My counter proposal would be that if you need a Multy tenant environment where users can create clusters from a homogeneous set of BMHs based on VMs and BMs then let's implement the "Multy tenancy" first, then for your specific use-case you should use something called "Virtual Baremetal solution". Virtual baremetal is supported already with Openstack via Sushy-tools and it is planned to be supported in kubevirt (https://github.com/kubevirt/community/blob/main/design-proposals/kubevirtbmc.md). Without going into the implementation details of the different virtual baremetal solutions the main takeaway is that from BMO perspective these virtual baremtal solutions provide BMC addresses and they emulate the behavior of BMCs thus BMO can handle the VMs the same way as it would handle regular BMs such a way that no extra logic is needed in the Metal3 stack.

As long as hostclaims are supported for the multi tenancy approach, the proposal does not "extend" Metal3 beyond the support for a rather opaque field "kind" that can be used to choose which controller handles which hostclaim. The important point is that HostClaim implemented correctly necessarily extract the life-cycle management part out of the capm3 provider.

In the previous answers, I already explained why "virtual baremetal" is a complex solution that requires a lot of additional parts like a kind of autoscaler to have the right number of VMs with the appropriate characteristics. Also Sushy used as a virtual BMC has always been considered as somewhat experimental and only good for CI from a security perspective. And the fact that other providers "know" how to handle the network does not help.

There is an additional reason: having capm3 as a unique "node manager" simplifies the life of the cluster managers. There are difference in the way that capm3, capo, capk handle things such as node reuse, node validation, etc. and these difference do not provide anything useful to projects like Sylva or Kanod that try to provide clusters on demand on various compute resources.

@lentzi90
Copy link
Member

How about we split out the CAPI multi-tenancy contract to its own proposal? I could push that hopefully this week already. Perhaps it will be easier then to make progress on HostClaims here separately?
I would focus the multi-tenancy proposal on just implementing the CAPI contract, without any HostClaim or changes to how CAPM3 interacts with BMHs. That should be a smaller change and with it in place I hope it will be easier to reason about the HostClaims here.

@lentzi90
Copy link
Member

Pushed the multi-tenancy proposal: #429

Copy link
Member

@mboukhalfa mboukhalfa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts and questions

OS image with an initial configuration (typically cloud-init or ignition
configuration files) on a compute resource that meets a set of requirements
(host selectors). These requirements can be interpreted either as labels
for a recyclable resource (such as a bare-metal server) or as characteristics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scope of metal3 is bare-metal servers that's make all our infra recyclable we can have some disposable resource to simulate the bare-metal but cannot treat them from metal3 (BMO or CAPM3) as disposable all resource seen in CAPM3 is recyclable if we want hybrid clusters then this might be solved in higher abstraction layer or we manage the infra in lower layer and encapsulate the disposable under a layer that hide the disposable features from metal3 and virtualize it as recyclable (baremetal3)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is not to change how BareMetalHost work and the proposal does not modify the BMH controller. The idea is to split capm3 between a part that generate the workload to execute (with all the templating required to create a cloud-init) and the association part that binds the machine by introducing an object in the middle. After what amounts to mostly code shuffling you have a new resource HostClaim which is disposable and abstract enough. This is essentially code shuffling that introduces new nice capabilities and the other approach section states the limitations of the other approach you mention.

be put in different namespaces and the BareMetalHost objects must be defined
in those namespaces. In this setup, the tenants are the owners of the servers
and it becomes difficult to share the same server between different clusters if
they belong to different tenants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the second approach the tenant can reuse (share) the same Hardware how we define from the hardware perspective can two different tenant share Hardware behind the same BMC how the Hardware network would be defined what is isolated between two tenants and what can be shared

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sharing only happen when one cluster scale down release a resource and another one uses it. As said previously, network isolation between clusters is outside the scope of the proposal and BMC network must not be accessible to tenants.


In order to improve server usage, we would like to have a pool of servers that
clusters can lease depending on their workload. If we maintain the Metal3
constraints, all clusters must be defined in the same namespace. Unless very
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constraints, all clusters must be defined in the same namespace. Unless very
constraints, all clusters must be defined in the same namespace.

Does not seem to be precise I would say cluster must be defined in the same namespace hosting BMHs !
Are we here trying to solve the selecting BMHs from pool without considering the multi-tenancy I would refer to some initiatives :
metal3-io/cluster-api-provider-metal3#795 (comment)
metal3-io/cluster-api-provider-metal3#1071

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add that this is the BMH namespace. The main difference with the proposals/issues mentioned is that the hostClaim is in the same namespace as the m3m and is owned by it but not the BMH. And it works because BMH are not pivoted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment is wrong in the current version of the proposal. When credentials are used, the hostClaim is not owned by the m3m. This was only true when there was a separate notion of remote hostClaim pointing to the hostClaim where the BMH lives, but we agreed to put the capability to access remote resources in capm3. So the hostClaim are not pivoted.

Due to the unique challenges of managing bare-metal, the Metal3 project has
developed a set of abstractions and tools that could be used in different
settings. The main mechanism is the selection process that occurs between
the Metal3Machine and the BareMetalHost which assigns a disposable workload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the Metal3Machine and the BareMetalHost which assigns a disposable workload
the Metal3Machine and the BareMetalHost which assigns a disposable workload

Now that confuses me disposable here is higher level k8s node same as capi machine

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OS image and the cloud-init created by CAPM3 defines a workload for the underlying baremetal server which has a single purpose : being a K8S node of the target cluster. When the cluster is deleted/scaled down, this workload is removed from the server that goes back to ready state so it is disposable.


This proposal introduces a new resource called HostClaim that solves
both problems by decoupling the definition of the workload performed by
the Metal3 machine controller from the actual compute resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hiding the bmh can protect the the tenant I think we are not conceding the uniqueness of the baremetal3 resource hiding the BMH resource does not protect the real server the baremetal3 uniqueness AFAIK as you own the hardware how can it be protect since it can access the infra layer eventually! Except if it is physically isolated can't be protected if the hardware network is connected to other servers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean access to the BMC ? It must not be given to tenants. In the same way that they must not see the BMC credentials. An important note is that if you pivot a cluster with hostclaim, you are not pivoting ironic (unless everybody is in the same namespace and this is essentially for bootstrap). Because bootstrap is a special case, there is a dedicated scenario that tells essentially that for bootstrap, the old behaviour is kept because everything is moved.

Do you mean network isolation between servers: maintaining it is outside the scope of this proposal. I think DT has a proposal where each cluster ensure its own isolation through "signed BGP" advertisment. Somehow the newly booted server comes with some credentials that make that it will have access to/publish some routes. In our integration project, we will probably try to make an evolution of the networking solution we had with Kanod baremetalpools: https://orange-opensource.gitlab.io/kanod/reference/kiab/network.html#kiab-network. But I think it is safe to say that it is outside the scope of Kanod.


### Goals

* Split responsibilities between infrastructure teams, who manage servers, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infrastructure teams and cluster administrators roles need to be defined in typical bare-metal scenario and the user (tenant)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably state more explicitly that administrators of the target clusters are the tenants.

The access to the remote cluster is pushed from a specific "remote"
HostClaim to the metal3 machine controller.
Risks of using a kubeconfig for automation are exposed.

Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pierrecregut
Copy link
Author

I have pushed a new version with the following modifications:

  • state clearly that network isolation is outside the scope of the proposal
  • two small modifications to make clear why all clusters would end up in the same namespace (it is where the baremetalhosts are) and that the workload mentioned is being a node in the target cluster.

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
It looks like you may be unaware that I have a longstanding proposal for decomposing the BMH API in a way that supports multitenancy.

Comparing the two proposals, the major differences I see are:

  • I propose a resource type (BareMetalInventory) that allows the user to get a list of available Hosts that match a selector, rather than request one and not know whether it will be fulfilled or not. I've observed that users often care deeply about which host is which in a data center, even though you'd think we'd be past the point where that matters.
  • You propose to make a Host available in multiple different namespaces, and allow any one of them to claim it. While my proposal could easily accomodate that, I started from the assumption that Hosts would be pre-allocated to individual namespaces by some other process or administrative action. It's not clear to me what the right thing to do here is - being able to allocate dynamically from a pool of hosts is appealing, but if this is done within Metal³ then there will be pressure to have Metal³ automatically reconfigure the network. Whereas if the allocation process is externalised, the network reconfiguration can be as well. In general, users tend to want to utilize all of their hosts; it's rare to see bare metal left sitting idle. It's also rare for private clouds to implement chargeback schemes that give tenants an incentive not to just grab all of the available hosts in the pool; more common would be to allocate a fixed amount of capacity to a tenant based on their contribution to a capex budget.
  • I propose a first-class API (BareMetalAllocation) for specifying which namespace(s) can provision a host. You should do this too.
  • My proposal incorporates a detailed breakdown of all the fields in the BMH and which ones should be configurable by which actor (including the ability for an administrator to set defaults that can be overridden by the user), rather than just the 3 you care about (Image URL, User data, and Network data). This is more flexible (especially, but not only, for non-CAPI use cases) and could eventually lead to a future version of the API where a BMH resource is more like just a "BMC" resource, and not just collecting every field used by every actor together in one CRD as it has historically done.

I'd be interested to hear your comments.

recycled and are bound to new HostClaims, potentially belonging to other
clusters.

#### Hybrid Clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems clearly out of scope for Metal³ to me. If you want to mix infrastructure types within a cluster, then the other types should be handled by their respective CAPI infrastructure providers. Dealing with multiple infrastructure types in a cluster is CAPI's job, and if improvements in CAPI or in individual infrastructure providers are needed they should be driven there, not by creating an inner-platform inside Metal³.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had the same concern. It's clearly something CAPI should provide. If they don't want to provide that, is it our job to basically invent a substitute CAPI feature? We need to keep in mind that the Metal3 team is rather small. We already struggle with reviews on some projects. We need to make sure we're focused on the primary mission - bringing bare-metal provisioning to kubernetes - even if it requires making some tough calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb I do not think that the proposal fit the inner platform pattern:

  • It does not add new core functionality, there is nothing new in the proposal that fits the notion of "universal interpreter" to embed in Metal3
  • It is simply putting API (contracts) at new places so that the framework can be used differently and other project can use part of its functionalities instead of reimplementing the same thing

On the other hand trying to pretend that everything has a BMC and pre starting/configuring VM so that they can be used by Metal3 looks closer to an anti pattern for me.

@dtantsur, I perfectly understand your fear of having to address new feature requests with a limited number of developers but hostclaim does not add new needs for capm3 and it should help to expand the community. Metal3 must not implement the other back-ends for disposable compute resources (kubevirt, openstack, vmware or other). It is up to the hostclaim controller implementation to respect the host-claim contract in the same way as they accept the capi contract and if not possible they can always stay as a standard infra provider.

Why hybrid cluster use case is not handled above at capi level ? I think there are at least three reasons:

  • The initial design did not consider it. This is not a good reason but the separate evolution of the infra providers makes it harder to reconsider the initial design choices and the next two factors are probably the most important.
  • Most providers handle networking. For the hybrid case, the usual "one size fits all" approach to networking the providers implement just do not work. Metal3 does not address networking and should not.
  • providers for public cloud and providers for bare-metal/private cloud do not necessarily access the same kind of API. For those it is not clear that the hostclaim approach would be easy/wise to adopt. But it does not mean that putting a branching point at the level of the host claim rather than at the level of the machine for all the use cases that support it is not the right software engineering choice.

A lot of infra providers implement exactly the same functions in different ways. Not only it means developers redoing and maintaining different implementations but for end-user and especially frameworks orchestrating cluster creations it means slightly different semantics and completely different ways of passing parameters. Two examples:

  • the way for a node to tell it is up uses an ssh connection with kubevirt (see open issue complaining on it). capm3 implementation goes through the cluster API. That code would be shared and implemented once with hostclaims
  • the feature to access compute resource on another cluster, implemented with either Multi-tenancy proposal #429 or the remote hostclaim, would also be useful for the kubevirt provider to be able to create VM based clusters over other clusters. Unfortunately as long as they are completely different providers they would have to reimplement it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb unfortunately the google document is private, so I cannot read it. But from your explanation I see a few drawbacks:

  • knowing the tenant does not mean knowing the network configuration of nodes and I do not see why it should be so static. Not all cluster nodes need the same connectivity.
  • we are looking for ways to dynamically share infrastructure for clusters that have different activity profiles and reduce hardware over provisioning.

It is possible to reconfigure network dynamically either from a centralized manager (the approach we chose in Kanod) or better from the cluster themselves with Host based routing. On the other hand I think we should always distinguish the provisioning network seen by Ironic and OS image repositories which is star shaped (servers can only communicate with Ironic) and the network underlay for the workload which is the one that is reconfigured. On those network, capm3, potentially not in the same network as Ironic, needs access to the k8s endpoint (the reverse is not true).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAPI provides an abstraction layer over different platforms' ways of provisioning machines.
This proposal provides and abstraction layer over different platforms' ways of provisioning machines, inside one of the CAPI platform providers.
It's exactly the inner platform effect. It's the most canonical example I've seen since people started writing document layout programs in JavaScript.

It would be naive to think that it will end with adding this one field. If anybody uses this, then we will inevitably be called upon to reimplement all of the features of CAPI.

From the fact that we are still speculating about the reasons that CAPI doesn't support this, I would infer that you have not actually approached the CAPI project and explained your requirements to them. I certainly understand why taking a more direct route to implementing the thing you want is appealing for you, but it rarely ends there and this is why open source projects must be protective of their scope.

The API decomposition doc is shared with the metal3-development mailing list, so if you join the project mailing list you'll be able to read and comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed the wikipedia article, I cite this explanation: "The inner-platform effect occurs when this library expands to include general purpose functions that duplicate functionality already available as part of the programming language or platform." and you also said: "we will inevitably be called upon to reimplement all of the features of CAPI."

I really fail to see where the proposal reimplements anything from CAPI. It is just a different branching point allowing more code reuse than was previously possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the CAPI topic: have you heard a definite "no, we won't do this" from them?

The solution is to enforce that BareMetalHost that can be bound to a
HostClaim have a label (proposed name: ``hostclaims.metal3.io/namespaces``)
restricting authorized HostClaims to specific namespaces. The value could be
either ``*`` for no constraint, or a comma separated list of namespace names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make a proper API for this? There's no reason it needs to be so janky.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes a proper field would be better. But I do not think that a complex authorization framework is a good idea.

Another unrelated problem is that Cluster-API has been designed
to define clusters using homogeneous compute resources: it is challenging to
define a cluster with both bare-metal servers and virtual machines in a private
or public cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning the CAPI developers give for not supporting heterogeneous clusters? Are they aware of the use cases?

use hostSelectors. Controllers for a "bare-metal as a service" service
may use selectors.

#### Compute Resources in Another Cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a feature that could be implemented separately from the larger HostClaim discussion.

Copy link
Author

@pierrecregut pierrecregut Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be #429 and be included in capm3 code. For HostClaim you need to add the resource to the list of tracked remote resources (BMH in the proposal). I suggested to use another hostclaim controller but it was rejected. An example implementation along #429 principle is here (for hostclaims only, not bmh).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants