Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--train-cover performance on Windows experiences a critical bottleneck at Constructing partial suffix array #4185

Open
akrieger opened this issue Oct 31, 2024 · 0 comments
Assignees

Comments

@akrieger
Copy link

akrieger commented Oct 31, 2024

Describe the bug
For whatever reason, across multiple different compilers (msvc, clang, gcc) and C runtime libraries (ucrt, cygwin, msvcrt), the qsort step of --train-cover seems to take forever. Profiling of an msvc-built zstd shows that the qsort function body is not inline-able. This results in excessive numbers of forced function calls from qsort to zstd and then back to non-inlined memcmp. That said, I could be misinterpreting the data and need to look at the actual asm to be sure. Maybe just pessimal pivot selection and worst case performance?

Training on a dataset of 100MB takes almost 45 minutes even with full parallelism. A naive reimplementation in C++ which calls std::stable_sort(std::execution::par_unseq, ...) completes the sort step in seconds.

To Reproduce
Steps to reproduce the behavior:

  1. Get a 100+mb training dataset.
  2. Run zstd.exe --train-cover=steps=512,d=10 -T0 -r data --maxdict=100KB -o data.dict --memory-limit 100MB -v
  3. Observe functional hang at Constructing partial suffix array.
  4. Compare against same training set on linux.

Expected behavior

  • --train-cover can be used on larger datasets without issue on all operating systems.

Screenshots and charts
qsort implementation performance (training did not finish, this is a 3 minute 21 second snapshot of the Constructing partial suffix array step:
image

c++ implementation with par_unseq (completes in scant seconds with less than half the total cpu time. RtlUserThreadStart is the rollup of the threads created to do the parallel sort):
image

c++ singlethreaded (~30 seconds wall time):
image

Desktop (please complete the following information):

  • OS: Windows
  • Version: 11
  • Compiler: msvc VS2022 (also reproduces with zstd prebuilts in msys2 environments built with clang, gcc against cygwin, msvc, and ucrt c runtimes https://www.msys2.org/docs/environments/)
  • Flags: "Release" visual studio configuration
  • Other relevant hardware specs: 32/64 threadripper
  • Build system: visual studio

Additional context
I don't think I botched the C++ implementation, but I may have broken something with all the void* casts happening. The biggest difference is I had to change the 'tiebreaker' because the pointer value is not passed to the comparator. https://gist.github.com/akrieger/c023dad6ffe5eac7d44a3a34a3dc7721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants