-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Egress IP scheduling #4627
Conversation
/test-e2e |
Codecov Report
@@ Coverage Diff @@
## main #4627 +/- ##
==========================================
+ Coverage 68.65% 69.64% +0.98%
==========================================
Files 402 403 +1
Lines 59570 58608 -962
==========================================
- Hits 40900 40819 -81
+ Misses 15847 14991 -856
+ Partials 2823 2798 -25
*This pull request uses carry forward flags. Click here to find out more.
|
45f6830
to
a7597c4
Compare
a7597c4
to
47294bc
Compare
f7e8713
to
9e8b2b3
Compare
newResults := map[string]*scheduleResult{} | ||
nodeToIPs := map[string]sets.String{} | ||
egresses, _ := s.egressLister.List(labels.Everything()) | ||
// Sort Egresses by creation timestamp to make the result deterministic and prioritize objected created earlier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - it is still possible different agents get different lists for a period right? In that case, could different agents decide different IP assignment? Will that be corrected when all agents converge? Does it mean we may re-assign IPs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different agents may get differenet lists at a moment. But the diff should be likely the Egresses created most recently because they have been sent to some agents but haven't been sent to others.
As we sort Egresses by creation timestamp, Egresses created earilier will be prioritized and assigned to Nodes, and the results won't be affected by the Egresses created most recently. For instance, agent a1 receives Egress {e1, e2, e3}, agent a2 receives Egress {e1, e2}
- their schedule decisions about e1 and e2 will be the same;
- if a1 schedules e3 to itself, it will configure it to its interface, otherwise does nothing;
- when a2 receives e3, it should get the same result as a1, and configure e3 to its own interface or do nothing.
During the process no IP is re-assigned.
There could be also other cases causing different agents get different lists, e.g. Egress delete/update events. However, the behavior that one Egress's assignment may affect others only happens when Node capacity is reached. If the capacity is enough, each Egress's schedule is individual, and the consistent hash should guarantee the Egresses are distributed evenly. So in most cases when the Egress's number is not greater than Nodes * maxEgressIPsPerNode, there should be no IP re-assignning.
In all cases, all agents can get the same schedule results and correct IP assignment when their cache converge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My read is it is still possible that two agents decide different IP - Node assignment, when they get different Egress list, at Egress update/delete events? E.g. one gets {e1, e3}, one gets {e1, e2, e3}.
Good to add comments to describe the scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments to this function, PTAL
6825c45
to
0a56063
Compare
} | ||
|
||
// addEgress processes Egress ADD events. | ||
func (c *egressIPScheduler) addEgress(obj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Receiver names are different. It’s better to rename all receivers to 's'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks
0a56063
to
2dfb987
Compare
2dfb987
to
a9aeef3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
814727a
to
7337199
Compare
7337199
to
35e5f20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit message:
the cluster consists different instance types of Nodes
consists -> consists of
pkg/agent/types/annotations.go
Outdated
@@ -24,6 +24,9 @@ const ( | |||
// NodeWireGuardPublicAnnotationKey represents the key of the Node's WireGuard public key in the Annotations of the Node. | |||
NodeWireGuardPublicAnnotationKey string = "node.antrea.io/wireguard-public-key" | |||
|
|||
// NodeMaxEgressIPsAnnotationKey represents the key of the maximum number of Egress IPs in the Annotations of the Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "the key of maximum Egress IP number"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
s.nodeToMaxEgressIPsMutex.Lock() | ||
defer s.nodeToMaxEgressIPsMutex.Unlock() | ||
|
||
oldMaxEgressIPs, exists := s.nodeToMaxEgressIPs[nodeName] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not exists, should we return false if the value equals to the global value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and skipped inserting it to avoid extra check when deleting the cached value (as there is no need to reschedule when deleting the same value).
@@ -67,18 +71,23 @@ type egressIPScheduler struct { | |||
// eventHandlers is the registered callbacks. | |||
eventHandlers []scheduleEventHandler | |||
|
|||
// The maximum number of Egress IPs a Node can accommodate. | |||
// The global maximum number of Egress IPs a Node can accommodate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global -> default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
} | ||
} | ||
if s.deleteMaxEgressIPsByNode(node.Name) { | ||
s.queue.Add(workItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to trigger rescheduling at Node deletion, it should be done even before this commit when there was no per Node annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it, there is no need to trigger it because if the Node is selected by any pool, ClusterEventHandler will trigger it.
PR antrea-io#4593 introduced maxEgressIPsPerNode to limit the number of Egress IPs that can be assigned to a Node. However, it used the EgressInformer cache to check whether a Node can accommodate new Egress IPs and did the calculation for different Egresses concurrently, which may cause inconsistent schedule results among agents. For instance: When Nodes' capacity is 1 and two Egresses, e1 and e2, are created concurrently, different agents may process them in different orders, with different contexts: - agent a1 may process Egress e1 first and assign it to Node n1; it then processes Egress e2 and think it should be assigned to Node n2 by agent a2 because n1 is out of space. - agent a2 may process Egress e1 and e2 faster, before any of their status is updated in Egress API, and would think both Egresses should be assigned to Node n1 by agent a1. As a result, Egress e2 will be left unassigned. To fix the problem, the Egress IP scheduling should be deterministic accross agents and time. This patch adds an egressIPScheduler, which takes the spec of Egress and ExternalIPPool and the state of memberlist cluster as inputs, generates scheduling results deterministically. According to the benchmark test, scheduling 1,000 Egresses among 1,000 Nodes once takes less than 3ms. The PR also adds support for per-Node max-egress-ips annotation, which which Nodes can be configured with different capacity via their annotations. It also makes dynamically adjusting a Node's capacity at runtime and configuring Node capacity post-deployment possible. Signed-off-by: Quan Tian <qtian@vmware.com>
35e5f20
to
6128e99
Compare
/test-all |
/skip-e2e which failed on a known flaky case |
PR #4593 introduced maxEgressIPsPerNode to limit the number of Egress IPs that can be assigned to a Node. However, it used the EgressInformer cache to check whether a Node can accommodate new Egress IPs and did the calculation for different Egresses concurrently, which may cause inconsistent schedule results among agents. For instance:
When Nodes' capacity is 1 and two Egresses, e1 and e2, are created concurrently, different agents may process them in different orders, with different contexts:
As a result, Egress e2 will be left unassigned.
To fix the problem, the Egress IP scheduling should be deterministic accross agents and time. This patch adds an egressIPScheduler, which takes the spec of Egress and ExternalIPPool and the state of memberlist cluster as inputs, generates scheduling results deterministically.
According to the benchmark test, scheduling 1,000 Egresses among 1,000 Nodes once takes less than 3ms.
The PR also includes the following improvement:
A global max-egress-ips may not work for the case that the cluster consists of different instance types of Nodes. It adds support for per-Node max-egress-ips annotation, with which Nodes can be configured with different capacity via their annotations. It also makes dynamically adjusting a Node's capacity at runtime and configuring Node capacity post-deployment possible.