Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native libraries use mimalloc as global allocator #1249

Merged
merged 14 commits into from
Mar 15, 2024
Merged

Conversation

idavis
Copy link
Collaborator

@idavis idavis commented Mar 10, 2024

This PR moves the global allocator to MS's mimalloc. The performance gains are significant, and it adds 98KB to the Python wheel and 215KB to the native dylib used in the python wheel. This dependency adds 2.5s to the build time.

We can investigate wasm usage in the future, but this would likely require having a wasi-libc installation available for the build process to give mimalloc a wasi compatible libc to link against.

Benchmarks on Apple M1 Max 64GB

Teleport evaluation     time:   [44.024 µs 44.043 µs 44.063 µs]
                        change: [-21.789% -21.595% -21.420%] (p = 0.00 < 0.05)
                        Performance has improved.

Deutsch-Jozsa evaluation
                        time:   [3.0302 ms 3.0336 ms 3.0375 ms]
                        change: [-23.878% -23.671% -23.493%] (p = 0.00 < 0.05)
                        Performance has improved.

Large file parity evaluation
                        time:   [28.956 ms 28.972 ms 28.989 ms]
                        change: [-1.9178% -1.8288% -1.7431%] (p = 0.00 < 0.05)
                        Performance has improved.

Array append evaluation time:   [306.58 µs 306.78 µs 307.02 µs]
                        change: [-14.312% -14.068% -13.881%] (p = 0.00 < 0.05)
                        Performance has improved.

Array update evaluation time:   [304.71 µs 304.84 µs 305.00 µs]
                        change: [-10.777% -10.651% -10.537%] (p = 0.00 < 0.05)
                        Performance has improved.

Array literal evaluation
                        time:   [144.03 µs 144.32 µs 144.68 µs]
                        change: [-39.800% -39.455% -39.171%] (p = 0.00 < 0.05)
                        Performance has improved.

Large nested iteration  time:   [30.418 ms 30.446 ms 30.476 ms]
                        change: [-11.368% -11.252% -11.131%] (p = 0.00 < 0.05)
                        Performance has improved.

Large input file        time:   [15.869 ms 15.883 ms 15.900 ms]
                        change: [-30.391% -30.267% -30.147%] (p = 0.00 < 0.05)
                        Performance has improved.

Standard library        time:   [9.5701 ms 9.5801 ms 9.5919 ms]
                        change: [-28.527% -28.371% -28.224%] (p = 0.00 < 0.05)
                        Performance has improved.

@idavis idavis self-assigned this Mar 10, 2024
Copy link

Benchmark for 01a59b2

Click to view benchmark
Test Base PR %
Array append evaluation 757.5±5.36µs 740.1±4.34µs -2.30%
Array literal evaluation 563.7±45.41µs 291.8±2.25µs -48.23%
Array update evaluation 752.7±2.70µs 748.3±3.94µs -0.58%
Deutsch-Jozsa evaluation 7.0±0.06ms 6.2±0.05ms -11.43%
Large file parity evaluation 35.1±0.08ms 35.0±0.44ms -0.28%
Large input file 41.2±1.00ms 31.5±0.85ms -23.54%
Large nested iteration 75.1±3.08ms 76.8±0.91ms +2.26%
Standard library 23.2±0.65ms 17.1±0.47ms -26.29%
Teleport evaluation 95.9±1.55µs 89.8±1.97µs -6.36%

Copy link
Collaborator

@swernli swernli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it!

Copy link

Benchmark for 634f970

Click to view benchmark
Test Base PR %
Array append evaluation 758.1±4.86µs 749.9±3.88µs -1.08%
Array literal evaluation 539.2±29.12µs 332.8±2.60µs -38.28%
Array update evaluation 753.5±6.02µs 756.4±3.21µs +0.38%
Deutsch-Jozsa evaluation 7.0±0.05ms 6.6±0.10ms -5.71%
Large file parity evaluation 35.3±0.73ms 35.4±0.73ms +0.28%
Large input file 39.9±0.74ms 30.1±0.71ms -24.56%
Large nested iteration 76.7±1.21ms 77.0±0.89ms +0.39%
Standard library 22.8±0.56ms 16.9±0.36ms -25.88%
Teleport evaluation 96.2±1.50µs 95.1±1.68µs -1.14%

Copy link

Benchmark for 4acde1f

Click to view benchmark
Test Base PR %
Array append evaluation 760.0±6.16µs 741.8±3.87µs -2.39%
Array literal evaluation 513.6±9.50µs 303.5±10.63µs -40.91%
Array update evaluation 764.8±23.59µs 754.0±4.49µs -1.41%
Deutsch-Jozsa evaluation 7.0±0.15ms 6.4±0.05ms -8.57%
Large file parity evaluation 35.5±0.18ms 35.6±0.65ms +0.28%
Large input file 41.7±1.11ms 34.0±1.42ms -18.47%
Large nested iteration 75.4±1.21ms 77.8±0.83ms +3.18%
Standard library 23.4±0.60ms 17.6±0.41ms -24.79%
Teleport evaluation 95.8±2.51µs 94.6±1.44µs -1.25%

@billti
Copy link
Member

billti commented Mar 12, 2024

Does this mean we have a dependency on a C compiler in our build now also if users want to set up a Q# development environment? (Which I guess we kind of implicitly do anyway as Rust on Windows requires MSVC or C++ Build Tools are installed, Linux generally has a C compiler already... but I'm not sure if that's already a requirement on macOS).

@idavis
Copy link
Collaborator Author

idavis commented Mar 12, 2024

Does this mean we have a dependency on a C compiler in our build now also if users want to set up a Q# development environment? (Which I guess we kind of implicitly do anyway as Rust on Windows requires MSVC or C++ Build Tools are installed, Linux generally has a C compiler already... but I'm not sure if that's already a requirement on macOS).

Yes, you need to have a C compiler handy. I removed the dep on Ninja so that you don't need the generator. You also need cmake on your path. We can get rid of the cmake dep, but we'd need a submodule.

Copy link

Benchmark for 52c76e8

Click to view benchmark
Test Base PR %
Array append evaluation 758.1±11.58µs 742.7±7.77µs -2.03%
Array literal evaluation 505.8±5.76µs 300.2±3.20µs -40.65%
Array update evaluation 754.0±16.55µs 751.7±4.36µs -0.31%
Deutsch-Jozsa evaluation 7.0±0.05ms 6.5±0.06ms -7.14%
Large file parity evaluation 35.1±0.07ms 35.3±0.30ms +0.57%
Large input file 40.4±0.93ms 31.4±0.64ms -22.28%
Large nested iteration 77.5±1.14ms 77.1±1.68ms -0.52%
Standard library 22.8±0.58ms 17.1±0.57ms -25.00%
Teleport evaluation 95.7±2.38µs 94.2±1.63µs -1.57%

Copy link

Benchmark for 5c3e644

Click to view benchmark
Test Base PR %
Array append evaluation 755.6±4.55µs 758.4±5.45µs +0.37%
Array literal evaluation 513.0±25.59µs 330.5±3.80µs -35.58%
Array update evaluation 751.4±4.77µs 757.2±2.81µs +0.77%
Deutsch-Jozsa evaluation 7.0±0.06ms 6.5±0.07ms -7.14%
Large file parity evaluation 35.2±0.06ms 35.2±0.06ms 0.00%
Large input file 42.6±1.08ms 38.5±0.69ms -9.62%
Large nested iteration 76.1±0.39ms 74.9±0.55ms -1.58%
Standard library 23.6±0.58ms 19.7±0.55ms -16.53%
Teleport evaluation 96.2±3.41µs 94.1±1.76µs -2.18%

Copy link

Benchmark for 72e6dac

Click to view benchmark
Test Base PR %
Array append evaluation 756.3±8.54µs 743.0±2.91µs -1.76%
Array literal evaluation 508.3±12.07µs 304.7±10.57µs -40.06%
Array update evaluation 757.5±4.23µs 757.3±3.92µs -0.03%
Deutsch-Jozsa evaluation 7.1±0.07ms 6.5±0.10ms -8.45%
Large file parity evaluation 35.8±0.16ms 35.5±0.42ms -0.84%
Large input file 41.5±1.48ms 30.2±0.98ms -27.23%
Large nested iteration 76.3±0.42ms 78.1±0.57ms +2.36%
Standard library 23.3±0.50ms 16.8±0.10ms -27.90%
Teleport evaluation 95.8±1.72µs 95.6±2.39µs -0.21%

Copy link

Benchmark for 1f33ca3

Click to view benchmark
Test Base PR %
Array append evaluation 756.6±4.03µs 1066.5±6.96µs +40.96%
Array literal evaluation 512.1±11.12µs 1260.6±13.24µs +146.16%
Array update evaluation 761.3±45.57µs 891.5±4.91µs +17.10%
Deutsch-Jozsa evaluation 7.2±0.05ms 13.6±0.40ms +88.89%
Large file parity evaluation 35.2±0.13ms 39.2±0.16ms +11.36%
Large input file 39.7±0.69ms 74.6±0.54ms +87.91%
Large nested iteration 77.1±0.54ms 94.1±3.85ms +22.05%
Standard library 22.6±0.55ms 40.3±0.47ms +78.32%
Teleport evaluation 96.9±1.91µs 184.7±4.99µs +90.61%

Copy link

Benchmark for fc1ff57

Click to view benchmark
Test Base PR %
Array append evaluation 759.3±4.50µs 750.8±2.99µs -1.12%
Array literal evaluation 520.0±21.55µs 326.6±2.10µs -37.19%
Array update evaluation 755.6±8.69µs 754.1±2.46µs -0.20%
Deutsch-Jozsa evaluation 7.0±0.08ms 6.3±0.07ms -10.00%
Large file parity evaluation 35.2±0.14ms 35.3±0.78ms +0.28%
Large input file 40.3±0.89ms 33.4±0.72ms -17.12%
Large nested iteration 77.6±1.17ms 77.8±0.92ms +0.26%
Standard library 23.1±0.63ms 18.9±0.95ms -18.18%
Teleport evaluation 95.8±1.65µs 91.3±1.51µs -4.70%

@idavis idavis requested a review from swernli March 13, 2024 00:00
@idavis idavis force-pushed the iadavis/mimalloc branch from ed83a4a to db58fff Compare March 13, 2024 17:16
Copy link

Benchmark for 0579203

Click to view benchmark
Test Base PR %
Array append evaluation 755.5±2.93µs 751.1±11.03µs -0.58%
Array literal evaluation 511.4±15.14µs 296.0±3.26µs -42.12%
Array update evaluation 751.6±5.11µs 757.3±13.93µs +0.76%
Deutsch-Jozsa evaluation 7.0±0.11ms 6.4±0.33ms -8.57%
Large file parity evaluation 35.2±0.09ms 35.1±0.14ms -0.28%
Large input file 40.8±1.41ms 30.2±1.59ms -25.98%
Large nested iteration 76.2±0.61ms 77.8±1.08ms +2.10%
Standard library 22.8±0.50ms 16.7±0.62ms -26.75%
Teleport evaluation 95.9±3.34µs 92.6±1.58µs -3.44%

@idavis
Copy link
Collaborator Author

idavis commented Mar 13, 2024

Build warnings on macos when xcode isn't installed will be fixed by rust-lang/cc-rs#1007

Copy link

Benchmark for f40f724

Click to view benchmark
Test Base PR %
Array append evaluation 756.7±6.54µs 765.3±3.10µs +1.14%
Array literal evaluation 521.8±34.67µs 297.5±5.11µs -42.99%
Array update evaluation 756.8±23.61µs 771.0±3.46µs +1.88%
Deutsch-Jozsa evaluation 7.0±0.06ms 6.3±0.09ms -10.00%
Large file parity evaluation 35.3±0.15ms 35.3±1.08ms 0.00%
Large input file 42.9±1.46ms 33.3±0.64ms -22.38%
Large nested iteration 74.8±0.64ms 81.6±3.24ms +9.09%
Standard library 24.1±0.73ms 18.5±0.90ms -23.24%
Teleport evaluation 96.0±2.89µs 93.8±1.88µs -2.29%

build.py Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
Copy link

Benchmark for 7cba09a

Click to view benchmark
Test Base PR %
Array append evaluation 757.1±4.59µs 750.8±7.21µs -0.83%
Array literal evaluation 539.4±16.03µs 309.7±2.05µs -42.58%
Array update evaluation 751.6±2.71µs 758.6±13.36µs +0.93%
Deutsch-Jozsa evaluation 7.0±0.23ms 6.3±0.05ms -10.00%
Large file parity evaluation 35.0±0.11ms 34.9±0.14ms -0.29%
Large input file 39.7±1.02ms 30.5±2.21ms -23.17%
Large nested iteration 77.1±0.67ms 78.1±1.20ms +1.30%
Standard library 23.0±0.52ms 16.4±0.29ms -28.70%
Teleport evaluation 96.7±1.97µs 92.1±2.46µs -4.76%

Copy link

Benchmark for c2294fb

Click to view benchmark
Test Base PR %
Array append evaluation 755.2±4.93µs 752.6±47.80µs -0.34%
Array literal evaluation 517.6±24.20µs 297.9±1.94µs -42.45%
Array update evaluation 752.9±3.65µs 750.4±4.78µs -0.33%
Deutsch-Jozsa evaluation 7.0±0.05ms 6.2±0.03ms -11.43%
Large file parity evaluation 35.0±0.21ms 34.8±0.58ms -0.57%
Large input file 39.5±1.05ms 33.6±1.76ms -14.94%
Large nested iteration 76.3±0.57ms 77.8±1.89ms +1.97%
Standard library 22.8±0.46ms 18.7±1.59ms -17.98%
Teleport evaluation 95.7±1.77µs 91.4±2.38µs -4.49%

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
Copy link
Member

@minestarks minestarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm frankly on the fence about taking this. The perf gains are undeniable, but I have several reservations about the change overall.

  1. WASM is one our two main use cases and this change wouldn't improve it.
  2. I'd rather reassess impact after we look into optimizations that reduce allocations across the board, such as Optimize evaluator perf via Control Flow Graph #1261 . I still think we have a lot of low hanging fruit here.

If we do decide it's worth it, these would make the change more palatable for me:

  1. An easy way to opt out of the CMake/C dependency for local builds. We have external contributors and we love them! One more prereq before you can build at all can be a barrier to engagement.
  2. Can we use an external implementation via crates.io? This code is a perfect candidate for just using OSS, it's a very general implementation, and exactly the kind of code I wouldn't want to be stuck maintaining.

Copy link

Benchmark for c29e736

Click to view benchmark
Test Base PR %
Array append evaluation 753.7±7.46µs 757.7±4.38µs +0.53%
Array literal evaluation 507.8±21.43µs 328.0±1.76µs -35.41%
Array update evaluation 750.6±3.38µs 753.8±5.50µs +0.43%
Deutsch-Jozsa evaluation 7.0±0.04ms 6.3±0.05ms -10.00%
Large file parity evaluation 35.0±0.12ms 35.1±0.66ms +0.29%
Large input file 40.8±2.42ms 29.7±0.95ms -27.21%
Large nested iteration 74.5±0.88ms 75.4±0.72ms +1.21%
Standard library 22.8±0.53ms 16.4±0.25ms -28.07%
Teleport evaluation 95.4±1.80µs 91.4±2.16µs -4.19%

Copy link

Benchmark for 8eefe59

Click to view benchmark
Test Base PR %
Array append evaluation 753.3±5.40µs 741.6±3.01µs -1.55%
Array literal evaluation 516.7±40.50µs 300.1±6.07µs -41.92%
Array update evaluation 754.3±36.92µs 747.6±2.25µs -0.89%
Deutsch-Jozsa evaluation 7.0±0.06ms 6.2±0.05ms -11.43%
Large file parity evaluation 35.0±0.19ms 34.8±0.07ms -0.57%
Large input file 39.9±0.85ms 28.7±0.49ms -28.07%
Large nested iteration 78.6±5.23ms 77.5±2.09ms -1.40%
Standard library 22.8±0.45ms 16.4±0.21ms -28.07%
Teleport evaluation 96.0±1.50µs 92.2±3.99µs -3.96%

Copy link
Member

@minestarks minestarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the concerns, excited about the perf improvement!

@idavis idavis added this pull request to the merge queue Mar 15, 2024
Merged via the queue into main with commit 93f01d0 Mar 15, 2024
16 checks passed
@idavis idavis deleted the iadavis/mimalloc branch March 15, 2024 16:18
billti added a commit that referenced this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants