Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: apply large ipv4 neigh GC settings to nodes of all sizes #2732

Merged
merged 1 commit into from
Feb 18, 2020

Conversation

xizha162
Copy link
Contributor

@xizha162 xizha162 commented Feb 15, 2020

Reason for Change:

We should apply the large IPv4 neigh GC settings for all nodes (instead of just nodes with more than 8 cores). This is needed because we have seen many cases where CoreDNS is only working intermittently due to this issue. CoreDNS may not run on a large node but still has to handle many connections to it from many other pods.

Issue Fixed:

Requirements:

Notes:

@acs-bot acs-bot added the size/S label Feb 15, 2020
@codecov
Copy link

codecov bot commented Feb 15, 2020

Codecov Report

Merging #2732 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2732      +/-   ##
==========================================
+ Coverage   72.33%   72.33%   +<.01%     
==========================================
  Files         137      137              
  Lines       25317    25317              
==========================================
+ Hits        18313    18314       +1     
+ Misses       5948     5947       -1     
  Partials     1056     1056

Copy link
Member

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me based on your description, but I don't know why this tuning was restricted to big boxes only. Let's get more eyes on this to be safe.

@jackfrancis
Copy link
Member

These configurations, as far as I understand, are to set the allowable number of ARP entries to cache before enabling garbage collection. I assume that ARP cache garbage collection operates according to a single criteria: "if there are more entries in the ARP cache than are allowed, delete the n oldest entries until the cache entry count drops below the maximum allowed number".

In practice, does this mean that the cache entry count is the effective ARP entry count in general? I.e., assuming a vm is continually thrashing at the maximum count, does this mean in practice additional latency for layer 2 operations (e.g., establishing a MAC <--> IP mapping), which bubbles up to higher abstraction layers, affecting workloads?

Also, the original change to only increase the ARP cache enties for "8 core vms"... was that to ensure that we don't impose memory requirements on vms that don't have a lot of memory overhead? Otherwise I don't know why we'd set this value based on the vm size, as the amount of ARP activity should have nothing to do with the vm size, but instead everything to do with the scale and behaviors of the cluster it is participating in.

Hope that makes sense, just want to clearly document this change and proceed according to a purposeful understanding of why we're making this change.

@xizha162
Copy link
Contributor Author

@jackfrancis , regarding your questions, for #1, yes, you are correct. For hub-like pods (CoreDNS for example) that communicates with MANY other pods, they need a much larger ARP cache on the node to store the all those MAC<-> IP mappings. When ARP cache thrashing happens, you will see intermittent DNS resolution failure due to ARP cache overflow error.

For #2, yes, there will be some memory overhead with larger values for these settings. That is why I want to be safe and apply only to large nodes initially. However, from CRI, we noticed that customer can hit this issue with 4 core node and 1500+ pods. After apply the settings, their issues is gone. So we think it is safe to apply the settings to all nodes now.

@jackfrancis
Copy link
Member

Can we determine memory overhead? For nodes w/ hundreds of clusters, it's possible that 8k is still not enough. I wonder if we can set a follow-up item to ensure we aren't being unnecessarily conservative. It would be nice to document exactly how much memory overhead requirements we're adding to the kubelet runtime.

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot acs-bot added the lgtm label Feb 18, 2020
@jackfrancis jackfrancis merged commit e870f47 into Azure:master Feb 18, 2020
@acs-bot
Copy link

acs-bot commented Feb 18, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, mboersma, xizhamsft

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xizha162
Copy link
Contributor Author

@jackfrancis , yes I will make sure to include a follow up work item in large node work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants