You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
During shuffle, we need to preallocate memory for the all the reducer. If we set prefer spill to false (current it's true but will change to false soon, check issue 840), we need to cache all the split partitions until memory is out of order. It will be in GB level memory. If we still use 4K page, which will leads to huge paga fault number during allocation, and very high DTLB misses during split.
Describe the solution you'd like
Enable large page support:
allocate a large buffer with 2M aligned for all the splitted batches
call madvise(addr,length, MADV_HUGEPAGE ) immediately, page will be allocated as 2M page
slice the buffer for each reducer and each column
Split into dedicated column buffers.
Difficult is the variable length buffer support. Currently we go through the first record batch to get the average length of each variable buffer, then pre-allocate memory as (row * avg_length+1K). During split process, we check if it's enough and re-allocate if not which leads to memcpy usually.
In HugePage support we can still use the policy, if it's not enough, allocate another 4K page based buffer. Fortunately most string buffers in real workload has fixed width.
Another possible solution is to stop split once any variable buffer is full and then allocate the next buffer for split. The drawback of this solution is the waste of other buffers.
The text was updated successfully, but these errors were encountered:
Arrow doesn't support customized alignment allocation. If we use jemalloc, 2M aligned allocation either doesn't show better performance. use madvise function call does decrease the DTLB miss and page walking a lot. But performance gain a little.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
During shuffle, we need to preallocate memory for the all the reducer. If we set prefer spill to false (current it's true but will change to false soon, check issue 840), we need to cache all the split partitions until memory is out of order. It will be in GB level memory. If we still use 4K page, which will leads to huge paga fault number during allocation, and very high DTLB misses during split.
Describe the solution you'd like
Enable large page support:
Difficult is the variable length buffer support. Currently we go through the first record batch to get the average length of each variable buffer, then pre-allocate memory as (row * avg_length+1K). During split process, we check if it's enough and re-allocate if not which leads to memcpy usually.
In HugePage support we can still use the policy, if it's not enough, allocate another 4K page based buffer. Fortunately most string buffers in real workload has fixed width.
Another possible solution is to stop split once any variable buffer is full and then allocate the next buffer for split. The drawback of this solution is the waste of other buffers.
The text was updated successfully, but these errors were encountered: