From 34cd21d01e95f96c326a8fd1c93e7586ff1560e8 Mon Sep 17 00:00:00 2001 From: Antonio Ojea Date: Wed, 8 Jun 2022 17:18:33 +0200 Subject: [PATCH 1/2] KEP-1880 Multiple Service CIDRs Kubernetes networking is based fundamentally on two kind of networks, the Pod network aka Cluster Network and the Service Network. These networks are configured statically, using flags, causing a lot of limitations and pain on users and administrators. The Pod network is consumed by the CNI plugins, Kubernetes already provides a simple IPAM that will be extended by the KEP-2593 "Multiple Cluster CIDRs", allowing users to use an API to configure it. This KEP provides a similar API for the Service Network, allowing users to dynamically add more networks or resize the existing ones. It also remove some of the limitations of current implementation. --- keps/prod-readiness/sig-network/1880.yaml | 3 + .../1880-multiple-service-cidrs/README.md | 981 ++++++++++++++++++ .../1880-multiple-service-cidrs/kep.yaml | 54 + 3 files changed, 1038 insertions(+) create mode 100644 keps/prod-readiness/sig-network/1880.yaml create mode 100644 keps/sig-network/1880-multiple-service-cidrs/README.md create mode 100644 keps/sig-network/1880-multiple-service-cidrs/kep.yaml diff --git a/keps/prod-readiness/sig-network/1880.yaml b/keps/prod-readiness/sig-network/1880.yaml new file mode 100644 index 00000000000..4be500d1ba9 --- /dev/null +++ b/keps/prod-readiness/sig-network/1880.yaml @@ -0,0 +1,3 @@ +kep-number: 1880 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-network/1880-multiple-service-cidrs/README.md b/keps/sig-network/1880-multiple-service-cidrs/README.md new file mode 100644 index 00000000000..aaca46261d7 --- /dev/null +++ b/keps/sig-network/1880-multiple-service-cidrs/README.md @@ -0,0 +1,981 @@ +# KEP-1880: Multiple Service CIDRs + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Current implementation details](#current-implementation-details) + - [New allocation model](#new-allocation-model) + - [The kube-apiserver bootstrap process and the service-cidr flags](#the-kube-apiserver-bootstrap-process-and-the-service-cidr-flags) + - [The special "default" ServiceCIDRConfig](#the-special-default-servicecidrconfig) + - [Service IP Allocation](#service-ip-allocation) + - [Service IP Reservation](#service-ip-reservation) + - [Edge cases](#edge-cases) + - [Resizing Service IP Ranges](#resizing-service-ip-ranges) + - [API](#api) + - [Allocator](#allocator) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Deprecation](#deprecation) + - [Upgrade / Downgrade / Version Skew Strategy](#upgrade--downgrade--version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Alternative 1](#alternative-1) + - [Alternative 2](#alternative-2) + - [Alternative 3](#alternative-3) + - [Alternative 4](#alternative-4) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in + [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input + (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance + Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by + [Conformance + Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to + [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list + discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Allow to dynamically expand the number of IPs available for Services. + +## Motivation + +Services are an abstract way to expose an application running on a set of Pods. Some type of +Services: ClusterIP, NodePort and LoadBalancer use cluster-scoped virtual IP address, ClusterIP. +Across your whole cluster, every Service ClusterIP must be unique. Trying to create a Service with a +specific ClusterIP that has already been allocated will return an error. + +Current implementation of the Services IP allocation logic has several limitations: + +- users can not resize or increase the ranges of IPs assigned to Services, causing problems when + there are overlapping networks or the cluster run out of available IPs. +- the Service IP range is not exposed, so it is not possible for other components in the cluster to + consume it +- the configuration is per apiserver, there is no consensus and different apiservers can fight and delete + others IPs. +- the allocation logic is racy, with the possibility of leaking ClusterIPs + +- the size of the Service Cluster CIDR, for IPv4 the prefix size is limited to /12, however, for + IPv6 it is limited to /112 or fewer. This restriction is causing issues for IPv6 users, since /64 + is the standard and minimum recommended prefix length +- only is possible to use reserved a range of Service IP addresses using the feature gate + "keps/sig-network/3070-reserved-service-ip-range" + + +### Goals + +Implement a new allocation logic for Services IPs that: + +- scales well with the number of reservations +- the number of allocation is tunable +- is completely backwards compatible + +### Non-Goals + +- Any change unrelated to Services +- Any generalization of the API model that can evolve onto something different, like an IPAM API, or + collide with existing APIs like Ingress and GatewayAPI. +- NodePorts use the same allocation model than ClusterIPs, but it is totally out of scope of this + KEP. However, this KEP can be a proof that a similar model will work for NodePorts too. +- Change the default IPFamily used for Services IP allocation, the defaulting depends on the + services.spec.IPFamily and services.spec.IPFamilyPolicy, a simple webhook or an admission plugin + can set this fields to the desired default, so the allocation logic doesn't have to handle it. +- Removing the apiserver flags that define the service IP CIDRs, though that may be possible in the future. + +## Proposal + +The proposal is to implement a new allocator logic that uses 2 new API Objects: ServiceCIDRConfig +and IPAddress, and allow users to dynamically increase the number of Services IPs available by +creating new ServiceCIDRConfigs. + +The new allocator will be able to "automagically" consume IPs from any ServiceCIDRConfig available, we can +think about this model, as the same as adding more disks to a Storage system to increase the +capacity. + +To simplify the model, make it backwards compatible and to avoid that it can evolve into something +different and collide with other APIs, like Gateway APIs, we are adding the following constraints: + +- a ServiceCIDRConfig will be immutable after creation (to be revisited before Beta). +- a ServiceCIDRConfig can only be deleted if there are no Service IP associated to it (enforced by finalizer). +- there can be overlapping ServiceCIDRConfigs. +- the apiserver will periodically ensure that a "default" ServiceCIDRConfig exists to cover the service CIDR flags + and the "kubernetes.default" Service. +- any IPAddress existing in a cluster has to belong to a ServiceCIDRConfig. +- any Service with a ClusterIP assigned is expected to have always a IPAddress object associated. +- a ServiceCIDRConfig which is being deleted can not be used to allocate new IPs + + This creates a 1-to-1 relation between Service and IPAddress, and a 1-to-N relation between + ServiceCIDRConfig and IPAddress. + + The new allocator logic can be used by other APIs, like Gateway API. + +### User Stories + +#### Story 1 + +As a Kubernetes user I want to be able to dynamically increase the number of IPs available for +Services. + +#### Story 2 + +As a Kubernetes admin I want to have a process that allows me to renumber my Services IPs. + +#### Story 3 + +As a Kubernetes developer I want to be able to know the current Service IP range and the IP +addresses allocated. + +#### Story 4 + +As a Kubernetes admin that wants to use IPv6, I want to be able to follow the IPv6 address planning +best practices, being able to assign /64 prefixes to my end subnets. + +### Notes/Constraints/Caveats (Optional) + +Current API machinery doesn't consider transactions, Services API is kind of special in this regard +since already performs allocations inside the Registry pipeline using the Storage to keep +consistency on an Obscure API Object that stores the bitmap with the allocations. This proposal still +maintains this behavior, but instead of modifying the shared bitmap, it will create a new IPAddress object. + +Changing service IP allocation to be async with regards to the service creation would be a MAJOR +semantic change and would almost certainly break clients. + +### Risks and Mitigations + +Cross validation of Objects is not common in the Kubernetes API, sig-apimachinery should verify that a +bad precedent is not introduced. + +Current Kubernetes cluster have a very static network configuration, allowing to expand the Service +IP ranges will give more flexibility to users, with the risk of having problematic or incongruent +network configurations with overlapping. But this is not really a new problem, users need to do a +proper network planning before adding new ranges to the Service IP pool. + +Service implementations,like kube-proxy, can be impacted if they were doing assumptions about the ranges assigned to the Services. +Those implementations should implement logic to watch the configured Service CIDRs. + +Kubernetes DNS implementations, like CoreDNS, need to know the Service CIDRs for PTR lookups (and to know which PTR look ups they are NOT authoritative on). Those implementations should be impacted, but they can also benefit from the +new API to automatically configure the Service CIDRs. + +## Design Details + +### Current implementation details + +[Kubernetes +Services](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#service-v1-core) need +to be able to dynamically allocate resources from a predefined range/pool for ClusterIPs. + +ClusterIP is the IP address of the service and is usually assigned randomly. If an address is +specified manually, is in-range, and is not in use, it will be allocated to the service; otherwise +creation of the service will fail. The Service ClusterIP range is defined in the kube-apiserver with +the following flags: + +``` +--service-cluster-ip-range string +A CIDR notation IP range from which to assign service cluster IPs. This must not overlap with any IP ranges assigned to nodes or pods. Max of two dual-stack CIDRs is allowed. +``` + +And in the controller-manager (this is required for the node-ipam controller to avoid overlapping with the podCIDR ranges): + +``` +--service-cluster-ip-range string +CIDR Range for Services in cluster. Requires --allocate-node-cidrs to be true +``` + +The allocator is implemented using a bitmap that is serialized and stored in an ["opaque" API +object](https://github.com/kubernetes/kubernetes/blob/b246220/pkg/apis/core/types.go#L5311). + +To avoid leaking resources, the apiservers run a repair loop that keep the bitmaps in sync with the current Services allocated IPs. + +The order of the IP families defined in the service-cluster-ip-range flag defines the primary IP family used by the allocator. +This default IP family is used in cases where a Service creation doesn't provide the necessary information, defaulting +the Service to Single Stack with an IP chosen from the default IP family defined. + +The current implementation doesn't guarantee consistency for Service IP ranges and default IP families across +multiple apiservers. + +[![](https://mermaid.ink/img/pako:eNqFU0FuwjAQ_MrKZ_iAD0gpFIlbJKSectnaG1jJsVPbQaKIv9dpEhICtM4tM7OZnXEuQjlNQopAXw1ZRRvGg8eqsJAOqug8ZIAB1p4wEuzJn1hRB9foIyuu0UZ4owPbjvQMLJ2nV2iW7z7QsMbIzj7C--QBD536KWHLlsPx1fQNqaRPM4pemsFytZr6lU-Xsy69cSfy99Sd5cjJ7TfBLoctVmzOUDIZHf7UZcY41X5kbZoQye_yMBia8GDZeRvjkhBisk-H8-P4KSv3lLamrfPTIF6xN97VoDngpyF9sz_YGZmdn7uCJJxmZd3BnWLWmQSKSvd3ykQIjVIU-sDaM-N3Q27NyeSwrXjk36DfLjMJHUQmEJTI5p_J0xsjwVMKKI6SMbQ5zxCGaYPAJZD37d0axFPJzJzVbcTDA5MjFqIiXyHr9CdeWqwQ8UgVFSKphaYSGxMLUdhrojZ1ipreNafVhCwxLb0Q2ES3P1slZPQNDaT-b-5Z1x-6EVE6)](https://mermaid.live/edit#pako:eNqFU0FuwjAQ_MrKZ_iAD0gpFIlbJKSectnaG1jJsVPbQaKIv9dpEhICtM4tM7OZnXEuQjlNQopAXw1ZRRvGg8eqsJAOqug8ZIAB1p4wEuzJn1hRB9foIyuu0UZ4owPbjvQMLJ2nV2iW7z7QsMbIzj7C--QBD536KWHLlsPx1fQNqaRPM4pemsFytZr6lU-Xsy69cSfy99Sd5cjJ7TfBLoctVmzOUDIZHf7UZcY41X5kbZoQye_yMBia8GDZeRvjkhBisk-H8-P4KSv3lLamrfPTIF6xN97VoDngpyF9sz_YGZmdn7uCJJxmZd3BnWLWmQSKSvd3ykQIjVIU-sDaM-N3Q27NyeSwrXjk36DfLjMJHUQmEJTI5p_J0xsjwVMKKI6SMbQ5zxCGaYPAJZD37d0axFPJzJzVbcTDA5MjFqIiXyHr9CdeWqwQ8UgVFSKphaYSGxMLUdhrojZ1ipreNafVhCwxLb0Q2ES3P1slZPQNDaT-b-5Z1x-6EVE6) + +### New allocation model + +The new allocation mode requires: + +- 2 new API objects ServiceCIDRConfig and IPAddress in networking.k8s.io/v1alpha1, see . Both will be protected with finalizers. +- 1 new allocator implementing current `allocator.Interface`, that runs in each apiserver, and uses the new objects to allocate IPs for Services. +- 1 new controller that participates on the ServiceCIDRConfig deletion, the guarantee that each IPAddresses has + a ServiceCIDRConfig associated. It also handles the special case for the `default` ServiceCIDRConfig. +- 1 new controller that participates in the IPAddress deletion, that guarantees that each Service IP + allocated has its corresponding IPAddress object, recreating it if necessary. +- 1 new controller that handles the bootstrap process and the default ServiceCIDRConfig. + +[![](https://mermaid.ink/img/pako:eNp9UstqwzAQ_BWxx5IcCqUHUwrFacGXYJLeqh620tpV0SNIcsAk_vfKjpNiJ3QvYmdGs7NIBxBOEmQgNIawUlh7NNyyVAPCtuT3StDhhPWV6yZE8kXJQvTK1jeYwD4-52RRvqFRWlFPjk17Rbel00q07G7av7c7Omm7G-HyYrXJna1UfYm5RkOzfEW5f7iGHifQxL0oX6T0FMJ_rhuqmPv6ScfE4VynbszJONxzYE_H5fL4PDaXIVkCUGsnMFLgMLn4t-DcYj23EM5GVHZwgAUY8gaVTA88LMEhfpMhDr1UUoWNjr2yS9JmJ9PoV6mi85BVqAMtAJvotq0VkEXf0Fk0_pNR1f0CyVS2dw)](https://mermaid.live/edit#pako:eNp9UstqwzAQ_BWxx5IcCqUHUwrFacGXYJLeqh620tpV0SNIcsAk_vfKjpNiJ3QvYmdGs7NIBxBOEmQgNIawUlh7NNyyVAPCtuT3StDhhPWV6yZE8kXJQvTK1jeYwD4-52RRvqFRWlFPjk17Rbel00q07G7av7c7Omm7G-HyYrXJna1UfYm5RkOzfEW5f7iGHifQxL0oX6T0FMJ_rhuqmPv6ScfE4VynbszJONxzYE_H5fL4PDaXIVkCUGsnMFLgMLn4t-DcYj23EM5GVHZwgAUY8gaVTA88LMEhfpMhDr1UUoWNjr2yS9JmJ9PoV6mi85BVqAMtAJvotq0VkEXf0Fk0_pNR1f0CyVS2dw) + +#### The kube-apiserver bootstrap process and the service-cidr flags + +Currently, the Service CIDRs are configured independently in each kube-apiserver using flags. During +the bootstrap process, the apiserver uses the first IP of each range to create the special +"kubernetes.default" Service. It also starts a reconcile loop, that synchronize the state of the +bitmap used by the internal allocators with the assigned IPs to the Services. + +With current implementation, each kube-apiserver can boot with different ranges configured. +There is no conflict resolution, each apiserver keep writing and deleting others apiservers allocator bitmap +and Services. + +In order to be completely backwards compatible, the bootstrap process will remain the same, the +difference is that instead of creating a bitmap based on the flags, it will create a new +ServiceCIDRConfig object from the flags (flags configuration removal is out of scope of this KEP) +with a special label `networking.kubernetes.io/service-cidr-from-flags` set to `"true"`. + +It now has to handle the possibility of multiple ServiceCIDRConfig with the special label, and +also updating the configuration, per example, from single-stack to dual-stack. + +The new bootstrap process will be: + +``` +at startup: + read_flags + if invalid flags + exit + run default-service-ip-range controller + run apiserver + +controller: + watch all ServiceCIDR objects labelled from-flags + ignore being-deleted ranges + wait for first sync + +controller on_event: + if no default range matching exactly my flags + log + create a ServiceCIDR from my flags + generateName: "from-flags-" + from-flags label: "true" + else if multiple + log + if multiple ranges match exactly my flags (or a single-family subset of) + log + delete all subsets, leaving the largest set that exactly matches on at least on family + endif + endif + if kubernetes.default does not exist + create it +``` + +#### The special "default" ServiceCIDRConfig + +The `kubernetes.default` Service is expected to be covered by a valid range. Each apiserver will +ensure that a ServiceCIDRConfig object exists to cover its own flag-defined ranges, so this should +be true in normal cases. If someone were to force-delete the ServiceCIDRConfig covering `kubernetes.default` it +would be treated the same as any Service in the repair loop, which will generate warnings about +orphaned Service IPs. + +#### Service IP Allocation + +When a a new Service is going to be created and already has defined the ClusterIPs, the allocator will +just try to create the corresponding IPAddress objects, any error creating the IPAddress object will +cause an error on the Service creation. The allocator will also set the reference to the Service on the +IPAddress object. + +The new allocator will have a local list with all the Service CIDRs and IPs already allocated, so it will +have to check just for one free IP in any of these ranges and try to create the object. +There can be races with 2 allocators trying to allocate the same IP, but the storage guarantees the consistency +and the first will win, meanwhile the other will have to retry. + +Another racy situation is when the allocator is full and an IPAddress is deleted but the allocator takes +some time to receive the notification. One solution can be to perform a List to obtain current state, but +it will be simpler to just fail to create the Service and ask the user to try to create the Service again. + +If the apiserver crashes during a Service create operation, the IPAddress is allocated but the +Service is not created, the IPaddress will be orphan. To avoid leaks, a controller will use the +`metadata.creationTimestamp` field to detect orphan objects and delete them. + +There is going to be a controller to avoid leaking resources: + +- checking that the corresponding parentReference on the IPAddress match the corresponding Service +- cleaning all IPAddress without an owner reference or, if the time since it was created is greater +than 60 seconds (default timeout value on the kube-apiserver ) + +#### Service IP Reservation + +In [keps/sig-network/3070-reserved-service-ip-range](https://github.com/kubernetes/kubernetes/issues/95570) a new feature was introduced that allow to prioritize one part +of the Service CIDR for dynamically allocation, the range size for dynamic allocation depends +ont the size of the CIDR. + +The new allocation logic has to be compatible, but in this case we have more flexibility and +there are different possibilities, that should be resolved before Beta. + +``` +<<[UNRESOLVED keps/sig-network/3070-reserved-service-ip-range]>> +Option 1: Maintain the same formula and behavior per ServiceCIDRConfig +Option 2: Define a new "allocationMode: (Dynamic | Static | Mixed)" field +Option 3: Define a new "allocationStaticThreshold: int" field +<<[/UNRESOLVED]>> +``` + +#### Edge cases + +Since we have to maintain 3 relationships Services, ServiceCIDRConfigs and IPAddresses, we should be able +to handle edge cases and race conditions. + +- Valid Service and IPAddress without ServiceCIDRConfig: + +This situation can happen if a user forcefully delete the ServiceCIDRConfig, it can be recreated for the +"default" ServiceCIDRConfigs because the information is in the apiserver flags, but for other ServiceCIDRConfigs +that information is no longer available. + +Another possible situation is when one ServiceCIDRConfig has been deleted, but the information takes too long +to reach one of the apiservers, its allocator will still consider the range valid and may allocate one IP +from there. To mitigate this problem, we'll set a grace period of 60 seconds on the servicecidrconfig controller +to remove the finalizer, if an IP address is created during this time we'll be able to block the deletion +and inform the user. + +[![](https://mermaid.ink/img/pako:eNp1kjFvwyAQhf8KYm3ixlZVtQyRbNwhW9WsLCe4JEg2uBi3qqL890Jw67hNGBDifffuieNIpVVIGe3xfUAjsdawd9AKQ8IC6a0jFYGe1NigR7JF96El8k39xq3Z6X0CO3BeS92B8YRHHDrdBxQdKf8T9ZyoLpuVUeMOYWomTAIqslyv7zi7mYXkz0WWPz5lq2x1XzykqrSXsZbU7I_1qBobrmzMEghoGisjM7nlReLqs0vJfsuTm0oqj-qyYleCqXPiiRvDzPPOqSnTRb_NK_nU_mAHf3USwtAFbdG1oFWY6TE6CeoP2KKgLBwV7mBovKDCnAI6dCrEf1E6vDxlO2h6XFAYvN1-GUmZdwP-QOO_GKnTN31Vsn4)](https://mermaid.live/edit#pako:eNp1kjFvwyAQhf8KYm3ixlZVtQyRbNwhW9WsLCe4JEg2uBi3qqL890Jw67hNGBDifffuieNIpVVIGe3xfUAjsdawd9AKQ8IC6a0jFYGe1NigR7JF96El8k39xq3Z6X0CO3BeS92B8YRHHDrdBxQdKf8T9ZyoLpuVUeMOYWomTAIqslyv7zi7mYXkz0WWPz5lq2x1XzykqrSXsZbU7I_1qBobrmzMEghoGisjM7nlReLqs0vJfsuTm0oqj-qyYleCqXPiiRvDzPPOqSnTRb_NK_nU_mAHf3USwtAFbdG1oFWY6TE6CeoP2KKgLBwV7mBovKDCnAI6dCrEf1E6vDxlO2h6XFAYvN1-GUmZdwP-QOO_GKnTN31Vsn4) + +For any Service and IPAddress that doesn't belong to a ServiceCIDRConfig the controller will raise an event +informing the user, keeping the previous behavior + +```go +// cluster IP is out of range +c.recorder.Eventf(&svc, nil, v1.EventTypeWarning, "ClusterIPOutOfRange", "ClusterIPAllocation", "Cluster IP [%v]:%s is not within the service CIDR %s; please recreate service", +``` + +- Valid Service and ServiceCIDRConfig but not IPAddress + +It can happen that an user forcefully delete an IPAddress, in this case, the controller will regenerate the IPAddress, as long as a valid ServiceCIDRConfig covers it. + +During this time, there is a chance that an apiserver tries to allocate this IPAddress, with a possible situation where +2 Services has the same IPAddress. In order to avoid it, the Allocator will not delete an IP from its local cache +until it verifies that the consumer associated to that IP has been deleted too. + +- Valid IPAddress and ServiceCIDRConfig but no Service + +The IPAddress will be deleted and an event generated if the controller determines that the IPAddress +is orphan [(see Allocator section)](#Allocator) + +- IPAddress referencing recreated Object (different UID) + +1. User created Gateway "foo" +2. Gateway controller allocated IP and ref -> "foo" +3. User deleted gateway "foo" +4. Gateway controller doesn't delete the IP (leave it for GC) +5. User creates a new Gateway "foo" +6. apiserver repair loop finds the IP with a valid ref to "foo" + +If the new gateway is created before the apiserver observes the delete, apiserver will find that gateway "foo" +still exists and can't release the IP. It can't peek inside "foo" to see if that is the right IP because it is +a type it does not know. If it knew the UID it could see that "foo" UID was different and release the IP. +The IPAddress will use the UID to reference the parent to avoid problems in this scenario. + +#### Resizing Service IP Ranges + +One of the most common problems users may have is how to scale up or scale down their Service CIDRs. + +Let's see an example on how these operations will be implemented. + +Assume we have configured a Service CIDR 10.0.0.0/24 and its fully used, we can: + +1. Add another /24 Service CIDR 10.0.1.0/24 and keep the previous one +2. Add an overlapping larger Service CIDR 10.0.0.0/23 + +After 2., the user can now remove the first /24 Service CIDR, since the new Service CIDR covers all the existing IP Addresses + +The same applies for scaling down, meanwhile the new Service CIDR contains all the existing IPAddresses, the old +Service CIDR will be safely deleted. + +However, there can be a race condition during the operation of scaling down, since the propagation of the +deletion can take some time, one allocator can successfully allocate an IP address out of new Service CIDR (but still +inside of the old Service CIDR). + +There is one controller that will periodically check that the 1-on-1 relation between IPAddresses and Services is +correct, and will start sending events to warn the user that it has to fix/recreate the corresponding Service. + +#### API + +```go +// ServiceCIDRConfig defines a range of IPs using CIDR format (192.168.0.0/24 or 2001:db2::0/64). +type ServiceCIDRConfig struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec ServiceCIDRConfigSpec `json:"spec,omitempty"` +} + + +// ServiceCIDRConfigSpec describe how the ServiceCIDRConfig's specification looks like. +type ServiceCIDRConfigSpec struct { + // IPv4 is an IPv4 block in CIDR notation "10.0.0.0/8" + IPv4 string `json:"ipv4"` + // IPv6 is an IPv6 block in CIDR notation "fd12:3456:789a:1::/64" + IPv6 string `json:"ipv6"` +} + +// IPAddress represents an IP used by Kubernetes associated to an ServiceCIDRConfig. +// The name of the object is the IP address in canonical format. +type IPAddress struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec IPAddressSpec `json:"spec,omitempty"` +} + +// IPAddressSpec describe the attributes in an IP Address, +type IPAddressSpec struct { + // ParentRef references the resources (usually Services) that a IPAddress wants to be attached to. + ParentRef ParentReference +} + +type ParentReference struct { + // Group is the group of the referent. + Group string + // Resource is resource of the referent. + Resource string + // Namespace is the namespace of the referent + Namespace string + // Name is the name of the referent + Name string + // UID is the uid of the referent + UID string +} + +``` + +#### Allocator + +A new allocator will be implemented that implements the current Allocator interface in the apiserver. + +```go +// Interface manages the allocation of IP addresses out of a range. Interface +// should be threadsafe. +type Interface interface { + Allocate(net.IP) error + AllocateNext() (net.IP, error) + Release(net.IP) error + ForEach(func(net.IP)) + CIDR() net.IPNet + IPFamily() api.IPFamily + Has(ip net.IP) bool + Destroy() + EnableMetrics() + + // DryRun offers a way to try operations without persisting them. + DryRun() Interface +} +``` + +This allocator will have an informer watching Services, ServiceCIDRConfigs and IPAddresses, so it can have locally the +information needed to assign new IP addresses to Services. + +IPAddresses can only be allocated from ServiceCIDRConfigs that are available and not being deleted. + +The uniqueness of an IPAddress is guaranteed by the apiserver, since trying to create an IP +address that already exist will fail. + +The allocator will set finalizer on the IPAddress created to avoid that there are Services without the corresponding +IP allocated. + +It also will add a reference to the Service the IPAddress is associated. + +### Test Plan + +This is a very core and critical change, it has to be thoroughly tested on different layers: + +API objects must have unit tests for defaulting and validation and integration tests exercising the +different operations and fields permutations, with both positive and negative cases: Special +attention to cross-reference validation problems, like create IPAddress reference wrong ServiceCIDRConfig or +invalid or missing + +Controllers must have unit tests and integration tests covering all possible race conditions. + +E2E test must be added covering the user stories defined in the KEP. + +In addition to testing, it will require a lot of user feedback, requiring multiple releases to +graduate. + +[x] I/we understand the owners of the involved components may require updates to existing tests to +make this code solid enough prior to committing the changes necessary to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind a feature flag +- Initial unit, integration and e2e tests completed and enabled +- Only basic functionality e2e tests implemented +- Define "Service IP Reservation" implementation + +#### Beta + +- API stability, no changes on new types and behaviors: + - ServiceCIDR immutability + - default IPFamily + - two or one IP families per ServiceCIDRConfig +- Gather feedback from developers and users +- Document and add more complex testing scenarios: scaling out ServiceCIDRConfigs, ... +- Additional tests are in Testgrid and linked in KEP +- Scale test to O(10K) services and O(1K) ranges +- Improve performance on the allocation logic, O(1) for allocating a free address +- Allowing time for feedback + +#### GA + +- 2 examples of real-world usage +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- Allowing time for feedback + +**Note:** Generally we also wait at least two releases between beta and GA/stable, because there's +no opportunity for user feedback, or even bug reports, in back-to-back releases. + +**For non-optional features moving to GA, the graduation criteria must include [conformance +tests].** + +[conformance tests]: + https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md + +#### Deprecation + +### Upgrade / Downgrade / Version Skew Strategy + +The source of truth are the IPs assigned to the Services, both the old and new methods have +reconcile loops that rebuild the state of the allocators based on the assigned IPs to the Services, +this allows to support upgrades and skewed clusters. + +Clusters running with the new allocation model will not keep running the reconcile loops that +keep the bitmap used by the allocators. + +Since the new allocation model will remove some of the limitations of the current model, skewed +versions and downgrades can only work if the configurations are fully compatible, per example, +current CIDRs are limited to a /112 max for IPv6, if an user configures a /64 to their IPv6 subnets +in the new model, and IPs are assigned out of the first /112 block, the old allocator based in +bitmap will not be able to use those IPs creating an inconsistency in the cluster. + +It will be required that those Services are recreated to get IP addresses inside the configured +ranges, for consistency, but there should not be any functional problem in the cluster if the +Service implementation (kube-proxy, ...) is able to handle those IPs. + +Example: + +- flags set to 10.0.0.0/20 +- upgrade to 1.25 with alpha gate +- apiserver create a default ServiceCIDRConfig object for 10.0.0.0/20 +- user creates a new ServiceCIDR for 192.168.1.0/24 +- create a Service which gets 192.168.1.1 +- rollback or disable the gate +- the apiserver repair loops will generate periodic events informing the user that the Service with the IP allocated + is not within the configured range + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: ServiceCIDRConfig + - Components depending on the feature gate: kube-apiserver, kube-controller-manager + +###### Does enabling the feature change any default behavior? + +The time to reuse an already allocated ClusterIP to a Service will be longer, since the ClusterIPs +now depend on an IPAddress object that is protected by a finalizer. + +The bootstrap logic has been updated to deal with the problem of multiple apiservers with different +configurations, making it more flexible and resilient. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +If the feature is disabled the old allocator logic will reconcile all the current allocated IPs +based on the Services existing on the cluster. + +If there are IPAddresses allocated outside of the configured Service IP Ranges in the apiserver flags, +the apiserver will generate events referencing the Services using IPs outside of the configured range. +The user must delete and recreate these Services to obtain new IPs within the range. + +###### What happens if we reenable the feature if it was previously rolled back? + +There are controllers that will reconcile the state based on the current created Services, +restoring the cluster to a working state. + +###### Are there any tests for feature enablement/disablement? + + + +Tests for feature enablement/disablement will be implemented. +TBD later, during the alpha implementation. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + +See Drawbacks section + +###### Will enabling / using this feature result in any new API calls? + +When creating a Service this will require to create an IPAddress object, +previously we updated a bitmap on etcd, so we keep the number of request +but the size is reduced considerable. + +###### Will enabling / using this feature result in introducing new API types? + +See [API](#api) + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +N/A + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +See [Drawbacks](#drawbacks) + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +The apiservers will increase the memory usage because they have to keep a local informer with the new objects ServiceCIDRConfig and IPAddress. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +The feature depends on the API server to work. + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + +The number of etcd objects required scales at O(N) where N is the number of services. That seems +reasonable (the largest number of services today is limited by the bitmap size ~32k). etcd memory +use is proportional to the number of keys, but there are other objects like Pods and Secrets that +use a much higher number of objects. 1-2 million keys in etcd is a reasonable upper bound in a +cluster, and this has not a big impact (<10%). The objects described here are smaller than most +other Kube objects (especially pods), so the etcd storage size is still reasonable. + +xref: Clayton Coleman + +## Alternatives + +Several alternatives were proposed in the original PR but discarded by different reasons: + +#### Alternative 1 + + + +#### Alternative 2 + + + +#### Alternative 3 + + + +#### Alternative 4 + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-network/1880-multiple-service-cidrs/kep.yaml b/keps/sig-network/1880-multiple-service-cidrs/kep.yaml new file mode 100644 index 00000000000..752588de698 --- /dev/null +++ b/keps/sig-network/1880-multiple-service-cidrs/kep.yaml @@ -0,0 +1,54 @@ +title: Multiple Service CIDRs +kep-number: 1880 +authors: + - "@aojea" +owning-sig: sig-network +participating-sigs: + - sig-network + - sig-api-machinery +status: implementable +creation-date: 2022-06-08 +reviewers: + - "@thockin" + - "@danwinship" +approvers: + - "@thockin" + - "@lavalamp" + +##### WARNING !!! ###### +# prr-approvers has been moved to its own location +# You should create your own in keps/prod-readiness +# Please make a copy of keps/prod-readiness/template/nnnn.yaml +# to keps/prod-readiness/sig-xxxxx/00000.yaml (replace with kep number) +#prr-approvers: + +see-also: + - "keps/sig-network/3070-reserved-service-ip-range" +replaces: + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.25" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.25" + beta: "v1.27" + stable: "v1.30" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ServiceCIDRConfig + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + From dae5e05f219551d337267df05f0dfeccb433cae1 Mon Sep 17 00:00:00 2001 From: Antonio Ojea Date: Tue, 21 Jun 2022 23:28:39 +0200 Subject: [PATCH 2/2] clarify serviceCIDRConfig IPAddress relation --- keps/sig-network/1880-multiple-service-cidrs/README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/keps/sig-network/1880-multiple-service-cidrs/README.md b/keps/sig-network/1880-multiple-service-cidrs/README.md index aaca46261d7..8632e30cd10 100644 --- a/keps/sig-network/1880-multiple-service-cidrs/README.md +++ b/keps/sig-network/1880-multiple-service-cidrs/README.md @@ -154,12 +154,11 @@ different and collide with other APIs, like Gateway APIs, we are adding the foll - there can be overlapping ServiceCIDRConfigs. - the apiserver will periodically ensure that a "default" ServiceCIDRConfig exists to cover the service CIDR flags and the "kubernetes.default" Service. -- any IPAddress existing in a cluster has to belong to a ServiceCIDRConfig. +- any IPAddress existing in a cluster has to belong to a Service CIDR defined by a ServiceCIDRConfig. - any Service with a ClusterIP assigned is expected to have always a IPAddress object associated. - a ServiceCIDRConfig which is being deleted can not be used to allocate new IPs - This creates a 1-to-1 relation between Service and IPAddress, and a 1-to-N relation between - ServiceCIDRConfig and IPAddress. + This creates a 1-to-1 relation between Service and IPAddress, and a 1-to-N relation between the ServiceCIDRs defined by the ServiceCIDRConfig and IPAddress. It is important to clarify that overlapping ServiceCIDRConfig are merged in memory, an IPAddress doesn't come from a specific ServiceCIDRConfig object, but "any ServiceCIDRConfig that includes that IP" The new allocator logic can be used by other APIs, like Gateway API.