-
Notifications
You must be signed in to change notification settings - Fork 522
fix: apply large ipv4 neigh GC settings to nodes of all sizes #2732
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2732 +/- ##
==========================================
+ Coverage 72.33% 72.33% +<.01%
==========================================
Files 137 137
Lines 25317 25317
==========================================
+ Hits 18313 18314 +1
+ Misses 5948 5947 -1
Partials 1056 1056 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me based on your description, but I don't know why this tuning was restricted to big boxes only. Let's get more eyes on this to be safe.
These configurations, as far as I understand, are to set the allowable number of ARP entries to cache before enabling garbage collection. I assume that ARP cache garbage collection operates according to a single criteria: "if there are more entries in the ARP cache than are allowed, delete the n oldest entries until the cache entry count drops below the maximum allowed number". In practice, does this mean that the cache entry count is the effective ARP entry count in general? I.e., assuming a vm is continually thrashing at the maximum count, does this mean in practice additional latency for layer 2 operations (e.g., establishing a MAC <--> IP mapping), which bubbles up to higher abstraction layers, affecting workloads? Also, the original change to only increase the ARP cache enties for "8 core vms"... was that to ensure that we don't impose memory requirements on vms that don't have a lot of memory overhead? Otherwise I don't know why we'd set this value based on the vm size, as the amount of ARP activity should have nothing to do with the vm size, but instead everything to do with the scale and behaviors of the cluster it is participating in. Hope that makes sense, just want to clearly document this change and proceed according to a purposeful understanding of why we're making this change. |
@jackfrancis , regarding your questions, for #1, yes, you are correct. For hub-like pods (CoreDNS for example) that communicates with MANY other pods, they need a much larger ARP cache on the node to store the all those MAC<-> IP mappings. When ARP cache thrashing happens, you will see intermittent DNS resolution failure due to ARP cache overflow error. For #2, yes, there will be some memory overhead with larger values for these settings. That is why I want to be safe and apply only to large nodes initially. However, from CRI, we noticed that customer can hit this issue with 4 core node and 1500+ pods. After apply the settings, their issues is gone. So we think it is safe to apply the settings to all nodes now. |
Can we determine memory overhead? For nodes w/ hundreds of clusters, it's possible that 8k is still not enough. I wonder if we can set a follow-up item to ensure we aren't being unnecessarily conservative. It would be nice to document exactly how much memory overhead requirements we're adding to the kubelet runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, mboersma, xizhamsft The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jackfrancis , yes I will make sure to include a follow up work item in large node work. |
Reason for Change:
We should apply the large IPv4 neigh GC settings for all nodes (instead of just nodes with more than 8 cores). This is needed because we have seen many cases where CoreDNS is only working intermittently due to this issue. CoreDNS may not run on a large node but still has to handle many connections to it from many other pods.
Issue Fixed:
Requirements:
Notes: