-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support automatic failover for Egress #2128
Comments
@tnqn this looks great to me, thanks for the detailed proposal. A few notes:
|
@antoninbas Thanks for the quick review.
I was considering compatability if some Egresses have been persistent in etcd without the field. Will check how is this case handled in K8s. Thanks for the suggestion.
Yes, we need ARP annoucement for IPv4 and NDP for IPv6. will add this to the design.
I recall @jianjuns had a concern that it would require users to duplicate NodeSelector for all Egresses. We could discuss more about it here or in the community meeting.
Good idea. Will check it and get back to you. |
@jianjuns do you think "Have a per-Egress NodeSelector in the Spec of Egress" can work for the cases you mentioned that a subset of Nodes can hold some IPs and another subnet of Nodes can hold other IPs? Or you have concern on duplicating labelSelectors in Egresses. In some way, specifying labelSelector (if it's quite simple) is same as specifying the name of NodePool? |
@tnqn : as we talked in the community meeting, to be most flexible, I think we should separate Egress IP - Node association. Originally I was thinking an IPPool and a NodePool CRD (see #667), and one of them refers to the other for the association. Then Egress IP belongs to a pool should be configured to only a Node in the associated the Node pool. The same design can be reused by other IP allocation case, like LB VIP, and maybe routable Pod IP allocation. I think finally we will need to provide the flexibility, as I saw it is a typical deployment pattern that one K8s cluster can span across multiple Node network segments. I am fine with starting from a single global NodePool, but considering the future extension, I would suggest the following:
Let me know what you think. |
I would associate IPs (not Egresses) with Nodes. The reasons are: 1) indeed the relationship is between IPs and Nodes; 2) later we might do egress IP allocation for Egresses; 3) multiple Egresses might share one egress IP; 4) later we might support other IP assignment cases like LB VIP and might like a common way to define IP - Node mapping. |
@jianjuns I agree IP and Node association makes more sense. About the default NodePool and removing |
But I do not think failover policy should be configured on Egress. I thought once you configure a NodePool (a single pool for now), it means you want to enable auto-failover (otherwise why even configure it?). Do you agree? |
I meant not creating the default NodePool explicitly. Just like all Nodes can be candicate if you don't specify any |
I feel it might be a little obscure, but not too bad in my mind. You want auto-assignment and failover, then you should define a scope for that, even it is the whole cluster? What is your proposal then? I do not like to associate failover policy with Egress for the reasons described above. |
@jianjuns to associate with Egress IPs with Nodes, what do you think about this way?
Then we could add an optional pool field to
Currently I don't see the need for NodePool CRD and it seems a NodeSelector can solve it which is a common way to define a pool of Nodes in K8s and simpler to use. Please let me know what you think. |
@tnqn : do you mean we will support Egress IP allocation? I thought it will not be in the next release? For IPPool CRD, I like to make it generic, so it can be used by other use cases too like LB VIP allocation. LB VIPs share common requirements as Egress IPs, so your definition should work. However, it is not very clear to me if we will use the same IPPool to support allocating Pod IPs (e.g. for secondary interfaces, or pool per Namespace/Service). If so, we should support subnet information to the pool. And should we still add NodeSelector to a pool? I mean there can be two models: 1) IPPool is just a pool of IPs but it can be associated with objects in some other way (e.g. annotating the Namespace); 2) IPPool selects associated objects. What are your thoughts here? Another thing is, if we set up a Node election group because an Egress refers to an IPPool which selects these Nodes, it sounds indirect? Your EgressIPPool way sounds better, but I still like the idea of a generic IPPool CRD for multiple use cases. What you think? |
Yes, I want to make auto-failover enablement more explicit to users when they create Egress. Defining an standalone IPPool or NodePool but not associating them with Egress sounds obscure to me and it seems we would reuse the Pool resources for other features which would make it more obscure.
I was thinking something like SubnetPool to support Pod IPs allocation. It sounds complex to support individual IP allocation and subnet allocation with single resource. It might also be hard to track its usage/reference if we let multiple other resources referring to them to claim their association. For the Egress feature, if we don't add NodeSelector to the pool, you mean creating a NodePool resource and let the NodePool refer to IPPool?
Could you elaborate more about the question? I thought you prefer to assocate IP with Node, not Egress with Node? #2128 (comment). |
I feel it is ok if we want to do only failover first. As you proposed originally, we can have a NodeSelector parameter in the ConfigMap, and we can use a name like DefaultEgressNodes, which should be clear enough Egress IPs can be assigned to these Nodes. For Pod IP allocation, I feel it is not really different from Egress IP or VIP allocation. It is still a pool of IPs, but just that IPs are associated with a subnet and gateway (they are not really relevant to IP allocation, but just extra information appended). For my last comment, I meant if we use a generic IPPool CRD which includes a Node Selector, then the trigger of "creating a failover Node group with these Nodes" will be "the IPPool is referred by an Egress", which sounds a little indirect to me. But maybe it is not too bad either. I like to learn your thoughts. |
@jianjuns Is static Egress IP that don't need failover a valid case to you? And a combination of static Egress IP and failover-able Egress IP in same cluster a valid case? For example, some pods may just use a specified Node's primary IP as Egress IP, while some others use floating IPs. The Since IPPool could have different Node selectors. I think we don't really create a failover group for each Pool, but a global group for all IPPools. An agent assigns the Egress IP to itself only when it has the maxmium hash value across the Nodes that are selected by the IPPool's NodeSelector and are active. Does it make sense to you and address the concern? |
I feel it is not very important to support this case. Of course I am not saying not good to have per IPPool settings, but I am just saying it might be ok to start from a global setting, if we plan to develop this feature across releases. I got what you mean by a single global group. Originally I thought smaller groups can be configured based on failure domains, and can scale better (but there can be complexity like one Node are in multiple groups). What you think? |
We may not be able to catch 1.1 anyway so I assume we have some time to support per IPPool setting and it doesn't sound complex to implement. Except the API change, I imagine we just need antrea-controller to maintain the IP usage of IPPools in memory and allocate an IP from the speicfied IPPool if an Egress doesn't have IP specified.
Smaller groups would also introduce extra traffic if a Node can be in multiple groups and it would be complicated to negotiate the port that will be used by each group. A global group doesn't have to include all Nodes. It could be this: a Node joins the global group when it's selected by any IPPool. |
@tnqn should we close this issue? |
Yes, updated task lists and closing this. |
Describe what you are trying to solve
Currently the Egress feature requires the Egress IP must be assigned to an arbitrary interface of one Node manually, and users have to reassign the IP to another Node if the previous one becomes unavailable. This is not ideal for production where automatic failover is desired. The proposal describes a solution that supports automatic failover for Egress.
Describe how your solution impacts user flows
When creating an Egress, user can specify its failover policy, which has two options: "None" and "Auto". By "None", the Egress will just work like before. By "Auto", Antrea will take care of selecting a Node from eligible Nodes and assigning the Egress's IP to the Node's network interface, and moving it to another Node if the current Node becomes unavailable.
Describe the main design/architecture of your solution
API change
Add a new field
FailoverPolicy
which allows users to specify if the Egress IP should failover to another Node when the Node that holds it becomes unavailable.The reason why the proposal keeps the "None" failover policy is, there may be such use cases:
Egress Node failure detection
To support automatic failover, first we need to detect Node failure. There are two kinds of solutions to do it:
In practice, the first approach typically takes around 40 seconds to detect Node failure. It strongly relies on kube-controller-manager's
node-monitor-grace-period
option, which defaults to 40 seconds. Besides, it requires antrea-controller to be health so that it can observe the failure and initiate failover, i.e. reassigning Egress IPs that were assigned to the failed Node to others.The second approach relies on neither K8s control plane nor antrea-controller. The antrea-agents will talk to each other directly using specific protocol directly. memerlist is Go library that implements it based on a gossip based protocol. Typically it can detect Node failover in 3 seconds with default configuration, and it can be tuned. The library is quite re-usable, the code that creates and joins a cluster is like below:
memberlist is currently used by MetalLB and Consul as the member failure detection solution.
To provide a faster failover period and have less dependency on control plane, we propose to use the second approach to implement the failure detection.
Egress Node selection && IP assignment
Based on the above Node failure detection mechanism, each agent can get the list of the health Egress Nodes. For each Egress, the agent calculates its owner Node in a deterministic way (hash), and if the Node that the agent it runs on is the owner of the Egress, it assigns the Egress IP to its network interface. In theory, each Egress Node will own num_egresses/num_nodes Egresses.
When determining an Egress's owner Node, simply hashing the Egress and mapping it a Node via a modular operation will cause nearly all Egresses to be remapped when a Node joins or leaves the cluster, which will affect the whole Egress traffic of the cluster. To avoid it, we propose to use consistent hashing for Node selection. In this way, only num_egresses/num_nodes will be remapped when a Node joins or leaves the cluster. One way of doing that is following how MetalLB determines whether a Node should announces a Service's ingress IP: the agent hashes the Egress and each eligible Node, sorts the hashes, and selects the first one as the owner.
Then each agent will do the following:
Limit Egress Nodes
In some cases, users may want only certain Nodes to be the Egress Nodes. To support it, there could be 3 options:
Status report (Optional)
To expose Egresses' owner Node to users, we could add
EgressStatus
to Egress API:When an agent assigns an Egress's IP to its own Node, it should update the Egress's status with its own NodeName. In addition, it could create an Event associated with the Egress, so users can see the migration history from K8s API.
Work breakdown
Test plan
TBD
The text was updated successfully, but these errors were encountered: