Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(webhook): more robust cidr check for ippool #29

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

starbops
Copy link
Member

@starbops starbops commented Mar 8, 2024

IMPORTANT: Please do not create a Pull Request without creating an issue first.

Problem:

We relied on a node annotation called rke2.io/node-args to extract the cluster-wide service CIDR string --service-cidr. It's an RKE2-specific annotation. Currently, there's no other good way to get such information via Kubernetes API calls. Plus, the implementation has a flaw iterating through all the nodes: worker nodes do not have such a flag so the validation procedure will fail if the cluster has any pure worker nodes.

On the other hand, we need a way for admin to specify the cluster-wide service CIDR if the cluster is not RKE2.

Solution:

Load the cluster-wide service CIDR from the following sources:

  • rke2.io/node-args annotation in management Node objects
  • Webhook command argument --service-cidr
  • Default value 10.53.0.0/16

Comparing the CIDR of the user-input IPPool object with the cluster-wide one. If overlap happens, reject the create/update requests.

Related Issue:

harvester/harvester#5153

Test plan:

  1. Install and enable the harvester-vm-dhcp-controller add-on
    apiVersion: harvesterhci.io/v1beta1
    kind: Addon
    metadata:
      labels:
        addon.harvesterhci.io/experimental: "true"
      namespace: harvester-system
      name: harvester-vm-dhcp-controller
    spec:
      chart: harvester-vm-dhcp-controller
      enabled: true
      repo: https://charts.harvesterhci.io
      valuesContent: |
        image:
          repository: starbops/harvester-vm-dhcp-controller
          tag: fix-5153-head
        agent:
          image:
            repository: starbops/harvester-vm-dhcp-agent
            tag: fix-5153-head
        webhook:
          image:
            repository: starbops/harvester-vm-dhcp-webhook
            tag: fix-5153-head
      version: 0.3.0
  2. Prepare a VM Network (NAD) named test-net
  3. Create an IPPool object associated to the VM Network and overlapped with the cluster service CIDR using kubectl
    apiVersion: network.harvesterhci.io/v1alpha1
    kind: IPPool
    metadata:
      namespace: default
      name: test-net
    spec:
      ipv4Config:
        serverIP: 10.53.0.2
        cidr: 10.53.0.0/16
        pool:
          start: 10.53.0.100
          end: 10.53.0.200
      networkName: default/test-net
  4. The creation request should be rejected by the validating admission webhook:
    Error from server (InternalError): error when creating "STDIN": admission webhook "validator.harvester-system.harvester-vm-dhcp-controller-webhook" denied the request: Internal error occurred: could not create IPPool default/test-net because cidr 10.53.0.0/16 overlaps cluster service cidr 10.53.0.0/16
    

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

var serviceCIDR string
serviceCIDR, err = util.GetServiceCIDRFromNode(node)
if err != nil {
logrus.Warningf("could not find service cidr from node annoatation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap the err to log, which has node name and more detailed information

sets := labels.Set{
util.ManagementNodeLabelKey: "true",
}
mgmtNodes, err := v.nodeCache.List(sets.AsSelector())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When webhook is just up, the nodeCache has a time to be empty, seems essential to list from remote

Copy link
Member Author

@starbops starbops Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if nodeCache could be empty initially, will other types of resources, e.g., NAD, VM, etc, also have the same issue? Changing all the caches to clients for webhooks seems too heavy. Or is there a good way to force it fill the cache when the webhook is up?

Copy link
Member

@w13915984028 w13915984028 Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others will fail, and reconciler will solve it; but here you have a fallback path ...

or when the list return length is 0, then retry to list from remote

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that makes sense. Invalid ippool objects could slip in at that specific moment. Thanks!

},
},
expected: output{
err: fmt.Errorf("could not create IPPool %s/%s because cidr %s overlaps cluster service cidr %s", testIPPoolNamespace, testIPPoolName, testCIDROverlap, testServiceCIDR),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot or could not ? and a few followings

@starbops
Copy link
Member Author

After discussion, I'll drop the runtime decision of service CIDR from the node's annotation because the footprint is too heavy, querying for every Node object whenever a new IPPool is created/updated. Instead, I will leave 10.53.0.0/16 the default value for service CIDR and allow users to configure it via the webhook binary's argument (also configurable from chart value).

cc @w13915984028

@bk201 bk201 self-requested a review March 11, 2024 04:39
Load the cluster-wide service CIDR from the following sources:

- "rke2.io/node-args" annotation in management Node objects
- Webhook command argument "--service-cidr"
- Default value "10.53.0.0/16"

Comparing the CIDR of the user-input IPPool object with the cluster-wide
one. If overlap, reject the create/update requests.

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Copy link
Member

@bk201 bk201 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link

@mingshuoqiu mingshuoqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cmd/webhook/root.go Outdated Show resolved Hide resolved
It's overkill to retrieve the cluster's service CIDR in runtime since
it's rarely changed and almost the same in every Harvester deployment.
Revert the relevant code and let users to input the service CIDR string
from the webhook's command line argument to remain flexibility. The
default value is still `10.53.0.0/16`.

Signed-off-by: Zespre Chang <zespre.chang@suse.com>
@starbops starbops merged commit 5d790ef into harvester:main Mar 12, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants