Use MPMC bounded queue for group freelist #1472
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adresses
by employing a smarter data structure for the group freelist (MPMC bounded queue).
It also moves the rebalancing house keeping into
packet.free
, and keeps limits on the upper bound of work performed in any rebalance/reclaim step.In the density plots below we have
master
in red and this branch in blue (green was an alternative wip branch).This certainly improves latency and thus performance of interlinks, however this does not make interlinks scale without restriction. The benchmarking I did seems to show that sharing memory between cores still turns into a bottleneck, and depending on your CPU architecture you are going to run into that sooner or later. I did compare results between Intel and EPYC machines and they are quite different. But for here and now I’m going to focus only on EPYC as an example.
In the plot above we compare latencies by number of receivers for a single transmitter, and we see a significant blowup of latencies after 1 transmitter + >5 receivers. Now why is that?
If we look at the topology of our CPU as reported by AMDuprof we can get a hint:
Each CCD spans six CPU cores. So while our workload fits a single CCD we get OK perf (~60Mpps) but as soon as we add a receiver running on a distinct CCD latency and perf tanks (~10Mpps).
The above diagram of the CPUs architecture/topology gives some hints. So each CCD houses two CCXs, and a cores in a CCX share a L3 cache. I am assuming the CCXs in a CCD also have faster interconnects to each other than to a CCX in a remote CCD?
Anyways if I look at some PMU counters using AMDuprof we can maybe see why a workload distributed across CCDs fares worse (take this with some salt, this is me reading the tea leaves):
DCFillsFromL3orDiffL2
is higherL2DtlbMiss
es, and have to perform lessDCFillsFromLocalMemory