`--train-cover` performance on Windows experiences a critical bottleneck at `Constructing partial suffix array` #4185

akrieger · 2024-10-31T22:04:43Z

Describe the bug
For whatever reason, across multiple different compilers (msvc, clang, gcc) and C runtime libraries (ucrt, cygwin, msvcrt), the qsort step of --train-cover seems to take forever. Profiling of an msvc-built zstd shows that the qsort function body is not inline-able. This results in excessive numbers of forced function calls from qsort to zstd and then back to non-inlined memcmp. That said, I could be misinterpreting the data and need to look at the actual asm to be sure. Maybe just pessimal pivot selection and worst case performance?

Training on a dataset of 100MB takes almost 45 minutes even with full parallelism. A naive reimplementation in C++ which calls std::stable_sort(std::execution::par_unseq, ...) completes the sort step in seconds.

To Reproduce
Steps to reproduce the behavior:

Get a 100+mb training dataset.
Run zstd.exe --train-cover=steps=512,d=10 -T0 -r data --maxdict=100KB -o data.dict --memory-limit 100MB -v
Observe functional hang at Constructing partial suffix array.
Compare against same training set on linux.

Expected behavior

--train-cover can be used on larger datasets without issue on all operating systems.

Screenshots and charts
qsort implementation performance (training did not finish, this is a 3 minute 21 second snapshot of the Constructing partial suffix array step:

c++ implementation with par_unseq (completes in scant seconds with less than half the total cpu time. RtlUserThreadStart is the rollup of the threads created to do the parallel sort):

c++ singlethreaded (~30 seconds wall time):

Desktop (please complete the following information):

OS: Windows
Version: 11
Compiler: msvc VS2022 (also reproduces with zstd prebuilts in msys2 environments built with clang, gcc against cygwin, msvc, and ucrt c runtimes https://www.msys2.org/docs/environments/)
Flags: "Release" visual studio configuration
Other relevant hardware specs: 32/64 threadripper
Build system: visual studio

Additional context
I don't think I botched the C++ implementation, but I may have broken something with all the void* casts happening. The biggest difference is I had to change the 'tiebreaker' because the pointer value is not passed to the comparator. https://gist.github.com/akrieger/c023dad6ffe5eac7d44a3a34a3dc7721

The text was updated successfully, but these errors were encountered:

Cyan4973 assigned terrelln Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--train-cover` performance on Windows experiences a critical bottleneck at `Constructing partial suffix array` #4185

`--train-cover` performance on Windows experiences a critical bottleneck at `Constructing partial suffix array` #4185

akrieger commented Oct 31, 2024 •

edited

Loading

--train-cover performance on Windows experiences a critical bottleneck at Constructing partial suffix array #4185

--train-cover performance on Windows experiences a critical bottleneck at Constructing partial suffix array #4185

Comments

akrieger commented Oct 31, 2024 • edited Loading

`--train-cover` performance on Windows experiences a critical bottleneck at `Constructing partial suffix array` #4185

`--train-cover` performance on Windows experiences a critical bottleneck at `Constructing partial suffix array` #4185

akrieger commented Oct 31, 2024 •

edited

Loading