Improve Egress IP scheduling #4627

tnqn · 2023-02-15T05:00:29Z

PR #4593 introduced maxEgressIPsPerNode to limit the number of Egress IPs that can be assigned to a Node. However, it used the EgressInformer cache to check whether a Node can accommodate new Egress IPs and did the calculation for different Egresses concurrently, which may cause inconsistent schedule results among agents. For instance:

When Nodes' capacity is 1 and two Egresses, e1 and e2, are created concurrently, different agents may process them in different orders, with different contexts:

agent a1 may process Egress e1 first and assign it to Node n1; it then processes Egress e2 and think it should be assigned to Node n2 by agent a2 because n1 is out of space.
agent a2 may process Egress e1 and e2 faster, before any of their status is updated in Egress API, and would think both Egresses should be assigned to Node n1 by agent a1.

As a result, Egress e2 will be left unassigned.

To fix the problem, the Egress IP scheduling should be deterministic accross agents and time. This patch adds an egressIPScheduler, which takes the spec of Egress and ExternalIPPool and the state of memberlist cluster as inputs, generates scheduling results deterministically.

According to the benchmark test, scheduling 1,000 Egresses among 1,000 Nodes once takes less than 3ms.

The PR also includes the following improvement:

A global max-egress-ips may not work for the case that the cluster consists of different instance types of Nodes. It adds support for per-Node max-egress-ips annotation, with which Nodes can be configured with different capacity via their annotations. It also makes dynamically adjusting a Node's capacity at runtime and configuring Node capacity post-deployment possible.

tnqn · 2023-02-15T05:00:39Z

/test-e2e

codecov · 2023-02-15T05:19:20Z

Codecov Report

Merging #4627 (f721f61) into main (aafea18) will increase coverage by 0.98%.
The diff coverage is 81.81%.

❗ Current head f721f61 differs from pull request most recent head 6128e99. Consider uploading reports for the commit 6128e99 to get more accurate results

@@            Coverage Diff             @@
##             main    #4627      +/-   ##
==========================================
+ Coverage   68.65%   69.64%   +0.98%     
==========================================
  Files         402      403       +1     
  Lines       59570    58608     -962     
==========================================
- Hits        40900    40819      -81     
+ Misses      15847    14991     -856     
+ Partials     2823     2798      -25

Flag	Coverage Δ		*Carryforward flag
e2e-tests	`38.63% <69.47%> (+0.25%)`	⬆️
integration-tests	`34.35% <ø> (-0.05%)`	⬇️	Carriedforward from 6128e99
kind-e2e-tests	`46.64% <69.47%> (+7.09%)`	⬆️	Carriedforward from 6128e99
unit-tests	`59.90% <80.41%> (+0.10%)`	⬆️	Carriedforward from 6128e99

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files	Coverage Δ
cmd/antrea-agent/agent.go	`0.00% <0.00%> (ø)`
pkg/agent/controller/egress/egress_controller.go	`75.45% <74.41%> (-9.79%)`	⬇️
pkg/agent/controller/egress/ip_scheduler.go	`83.33% <83.33%> (ø)`
pkg/agent/memberlist/cluster.go	`78.50% <100.00%> (-0.84%)`	⬇️
...gent/controller/noderoute/node_route_controller.go	`62.56% <0.00%> (-5.72%)`	⬇️
pkg/agent/wireguard/client_linux.go	`77.07% <0.00%> (-4.46%)`	⬇️
pkg/agent/flowexporter/exporter/exporter.go	`70.96% <0.00%> (-4.04%)`	⬇️
pkg/apiserver/storage/ram/watch.go	`90.66% <0.00%> (-2.67%)`	⬇️
...gent/controller/networkpolicy/status_controller.go	`79.16% <0.00%> (-2.50%)`	⬇️
... and 55 more

pkg/agent/controller/egress/ip_scheduler.go

jianjuns · 2023-02-16T01:50:15Z

pkg/agent/controller/egress/ip_scheduler.go

+		newResults := map[string]*scheduleResult{}
+		nodeToIPs := map[string]sets.String{}
+		egresses, _ := s.egressLister.List(labels.Everything())
+		// Sort Egresses by creation timestamp to make the result deterministic and prioritize objected created earlier


Question - it is still possible different agents get different lists for a period right? In that case, could different agents decide different IP assignment? Will that be corrected when all agents converge? Does it mean we may re-assign IPs?

Yes, different agents may get differenet lists at a moment. But the diff should be likely the Egresses created most recently because they have been sent to some agents but haven't been sent to others.
As we sort Egresses by creation timestamp, Egresses created earilier will be prioritized and assigned to Nodes, and the results won't be affected by the Egresses created most recently. For instance, agent a1 receives Egress {e1, e2, e3}, agent a2 receives Egress {e1, e2}

their schedule decisions about e1 and e2 will be the same;

if a1 schedules e3 to itself, it will configure it to its interface, otherwise does nothing;

when a2 receives e3, it should get the same result as a1, and configure e3 to its own interface or do nothing.

During the process no IP is re-assigned.

There could be also other cases causing different agents get different lists, e.g. Egress delete/update events. However, the behavior that one Egress's assignment may affect others only happens when Node capacity is reached. If the capacity is enough, each Egress's schedule is individual, and the consistent hash should guarantee the Egresses are distributed evenly. So in most cases when the Egress's number is not greater than Nodes * maxEgressIPsPerNode, there should be no IP re-assignning.

In all cases, all agents can get the same schedule results and correct IP assignment when their cache converge.

My read is it is still possible that two agents decide different IP - Node assignment, when they get different Egress list, at Egress update/delete events? E.g. one gets {e1, e3}, one gets {e1, e2, e3}.

Good to add comments to describe the scenarios.

Added comments to this function, PTAL

wenqiq · 2023-02-16T14:58:36Z

pkg/agent/controller/egress/ip_scheduler.go

+}
+
+// addEgress processes Egress ADD events.
+func (c *egressIPScheduler) addEgress(obj interface{}) {


Nit: Receiver names are different. It’s better to rename all receivers to 's'?

fixed, thanks

pkg/agent/controller/egress/ip_scheduler.go

jianjuns

LGTM

jianjuns

In the commit message:

the cluster consists different instance types of Nodes

consists -> consists of

jianjuns · 2023-02-23T20:46:47Z

pkg/agent/types/annotations.go

@@ -24,6 +24,9 @@ const (
 	// NodeWireGuardPublicAnnotationKey represents the key of the Node's WireGuard public key in the Annotations of the Node.
 	NodeWireGuardPublicAnnotationKey string = "node.antrea.io/wireguard-public-key"

+	// NodeMaxEgressIPsAnnotationKey represents the key of the maximum number of Egress IPs in the Annotations of the Node.


How about "the key of maximum Egress IP number"?

jianjuns · 2023-02-23T20:51:29Z

pkg/agent/controller/egress/ip_scheduler.go

+	s.nodeToMaxEgressIPsMutex.Lock()
+	defer s.nodeToMaxEgressIPsMutex.Unlock()
+
+	oldMaxEgressIPs, exists := s.nodeToMaxEgressIPs[nodeName]


If not exists, should we return false if the value equals to the global value?

Yes, and skipped inserting it to avoid extra check when deleting the cached value (as there is no need to reschedule when deleting the same value).

jianjuns · 2023-02-23T20:52:34Z

pkg/agent/controller/egress/ip_scheduler.go

@@ -67,18 +71,23 @@ type egressIPScheduler struct {
 	// eventHandlers is the registered callbacks.
 	eventHandlers []scheduleEventHandler

-	// The maximum number of Egress IPs a Node can accommodate.
+	// The global maximum number of Egress IPs a Node can accommodate.


global -> default?

jianjuns · 2023-02-23T20:54:00Z

pkg/agent/controller/egress/ip_scheduler.go

+		}
+	}
+	if s.deleteMaxEgressIPsByNode(node.Name) {
+		s.queue.Add(workItem)


If we need to trigger rescheduling at Node deletion, it should be done even before this commit when there was no per Node annotation?

Removed it, there is no need to trigger it because if the Node is selected by any pool, ClusterEventHandler will trigger it.

PR antrea-io#4593 introduced maxEgressIPsPerNode to limit the number of Egress IPs that can be assigned to a Node. However, it used the EgressInformer cache to check whether a Node can accommodate new Egress IPs and did the calculation for different Egresses concurrently, which may cause inconsistent schedule results among agents. For instance: When Nodes' capacity is 1 and two Egresses, e1 and e2, are created concurrently, different agents may process them in different orders, with different contexts: - agent a1 may process Egress e1 first and assign it to Node n1; it then processes Egress e2 and think it should be assigned to Node n2 by agent a2 because n1 is out of space. - agent a2 may process Egress e1 and e2 faster, before any of their status is updated in Egress API, and would think both Egresses should be assigned to Node n1 by agent a1. As a result, Egress e2 will be left unassigned. To fix the problem, the Egress IP scheduling should be deterministic accross agents and time. This patch adds an egressIPScheduler, which takes the spec of Egress and ExternalIPPool and the state of memberlist cluster as inputs, generates scheduling results deterministically. According to the benchmark test, scheduling 1,000 Egresses among 1,000 Nodes once takes less than 3ms. The PR also adds support for per-Node max-egress-ips annotation, which which Nodes can be configured with different capacity via their annotations. It also makes dynamically adjusting a Node's capacity at runtime and configuring Node capacity post-deployment possible. Signed-off-by: Quan Tian <qtian@vmware.com>

tnqn · 2023-02-24T01:52:34Z

/test-all

tnqn · 2023-02-24T07:16:14Z

/skip-e2e which failed on a known flaky case

tnqn force-pushed the fix-max-egress-ips branch 2 times, most recently from 45f6830 to a7597c4 Compare February 15, 2023 11:03

tnqn added this to the Antrea v1.11 release milestone Feb 15, 2023

tnqn force-pushed the fix-max-egress-ips branch from a7597c4 to 47294bc Compare February 15, 2023 12:01

tnqn marked this pull request as ready for review February 15, 2023 12:01

tnqn requested review from wenqiq, antoninbas, jianjuns and xliuxu February 15, 2023 12:01

tnqn force-pushed the fix-max-egress-ips branch 2 times, most recently from f7e8713 to 9e8b2b3 Compare February 15, 2023 14:05

jianjuns reviewed Feb 16, 2023

View reviewed changes

pkg/agent/controller/egress/ip_scheduler.go Outdated Show resolved Hide resolved

pkg/agent/controller/egress/ip_scheduler.go Show resolved Hide resolved

jianjuns reviewed Feb 16, 2023

View reviewed changes

tnqn force-pushed the fix-max-egress-ips branch 3 times, most recently from 6825c45 to 0a56063 Compare February 16, 2023 13:07

wenqiq reviewed Feb 16, 2023

View reviewed changes

pkg/agent/controller/egress/ip_scheduler.go Outdated Show resolved Hide resolved

wenqiq reviewed Feb 16, 2023

View reviewed changes

pkg/agent/controller/egress/ip_scheduler.go Outdated Show resolved Hide resolved

tnqn force-pushed the fix-max-egress-ips branch from 0a56063 to 2dfb987 Compare February 16, 2023 15:36

xliuxu reviewed Feb 16, 2023

View reviewed changes

pkg/agent/controller/egress/ip_scheduler.go Show resolved Hide resolved

tnqn force-pushed the fix-max-egress-ips branch from 2dfb987 to a9aeef3 Compare February 17, 2023 05:29

jianjuns previously approved these changes Feb 17, 2023

View reviewed changes

tnqn dismissed jianjuns’s stale review via 814727a February 23, 2023 14:47

tnqn force-pushed the fix-max-egress-ips branch 2 times, most recently from 814727a to 7337199 Compare February 23, 2023 15:01

tnqn changed the title ~~Fix Egress IP scheduling~~ Improve Egress IP scheduling Feb 23, 2023

tnqn force-pushed the fix-max-egress-ips branch from 7337199 to 35e5f20 Compare February 23, 2023 15:56

jianjuns reviewed Feb 23, 2023

View reviewed changes

tnqn force-pushed the fix-max-egress-ips branch from 35e5f20 to 6128e99 Compare February 24, 2023 01:04

jianjuns approved these changes Feb 24, 2023

View reviewed changes

tnqn merged commit d5dd02e into antrea-io:main Feb 24, 2023

tnqn deleted the fix-max-egress-ips branch February 24, 2023 07:16

tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jan 23, 2024

antoninbas mentioned this pull request Aug 7, 2024

Support multiple Egress IPs per Egress #6591

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Egress IP scheduling #4627

Improve Egress IP scheduling #4627

tnqn commented Feb 15, 2023 •

edited

Loading

tnqn commented Feb 15, 2023

codecov bot commented Feb 15, 2023 •

edited

Loading

jianjuns Feb 16, 2023

tnqn Feb 16, 2023 •

edited

Loading

jianjuns Feb 16, 2023

tnqn Feb 17, 2023

wenqiq Feb 16, 2023

tnqn Feb 16, 2023

jianjuns left a comment

jianjuns left a comment

jianjuns Feb 23, 2023

tnqn Feb 24, 2023

jianjuns Feb 23, 2023

tnqn Feb 24, 2023

jianjuns Feb 23, 2023

tnqn Feb 24, 2023

jianjuns Feb 23, 2023

tnqn Feb 24, 2023

tnqn commented Feb 24, 2023

tnqn commented Feb 24, 2023

Improve Egress IP scheduling #4627

Improve Egress IP scheduling #4627

Conversation

tnqn commented Feb 15, 2023 • edited Loading

tnqn commented Feb 15, 2023

codecov bot commented Feb 15, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

tnqn Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianjuns left a comment

Choose a reason for hiding this comment

jianjuns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnqn commented Feb 24, 2023

tnqn commented Feb 24, 2023

tnqn commented Feb 15, 2023 •

edited

Loading

codecov bot commented Feb 15, 2023 •

edited

Loading

tnqn Feb 16, 2023 •

edited

Loading