Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

AlexGuteniev · 2023-07-25T12:06:09Z

Towards #3858.

@AlexGuteniev suggested using basic_string::resize_and_overwrite, creating an _Ugly version for unconditional use internally. (We generally avoid adding totally novel secret machinery to Standard classes as it can become a maintenance burden, but simply having _Ugly names to access Future Technology is easy and common.)

This is what was addressed

@AlexGuteniev also suggested investigating whether branchless codegen for assigning _Elem0 vs. _Elem1 would be superior. (Note: we cannot assume that their values are consecutive, they could be 'M' and 'E'.)

This is confirmed to be non applicable. Before the change it was cmovne to set '1'. After it became setne al to set to 0 or 1, then add al, '0'. Both are branchless. I think, the compiler will be able to make efficient branchless code for non-consecutive values by going back to cmovcc

Benchmark is already there - thanks @achabense!

Benchmark results before:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    922 ns          924 ns       896000
BM_bitset_to_string<64, char>                 125791 ns       125558 ns         5600
BM_bitset_to_string_large_single<char>          2449 ns         2431 ns       263529
BM_bitset_to_string<7, wchar_t>                  539 ns          547 ns      1000000
BM_bitset_to_string<64, wchar_t>                4399 ns         4325 ns       144516
BM_bitset_to_string_large_single<wchar_t>       4792 ns         4708 ns       149333

after

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    235 ns          234 ns      3200000
BM_bitset_to_string<64, char>                 119085 ns       119978 ns         5600
BM_bitset_to_string_large_single<char>          2111 ns         2100 ns       320000
BM_bitset_to_string<7, wchar_t>                  153 ns          153 ns      4480000
BM_bitset_to_string<64, wchar_t>                4106 ns         4081 ns       172308
BM_bitset_to_string_large_single<wchar_t>       2137 ns         2148 ns       320000

The anomality of BM_bitset_to_string<64, char> is caused by too angry loop unrolling, and, IMO, should be fixed in compiler, if needs fixing.

stl/inc/xstring

StephanTLavavej · 2023-07-25T19:13:37Z

stl/inc/bitset

-            if (_Subscript(_Bits - 1 - _Pos)) {
-                _Str[_Pos] = _Elem1;
+        basic_string<_Elem, _Tr, _Alloc> _Str;
+        _Str._Resize_and_overwrite(_Bits, [this, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) {


No change requested: I observed that we could ignore _Len and take advantage of our knowledge that it will always be the compile-time constant _Bits, which is available without capturing. I checked the codegen, and this indeed allows the compiler to bake in the constant. However, while benchmarking showed some improvement for BM_bitset_to_string<64, char>, it showed significant pessimization for BM_bitset_to_string<15, char> .

BEFORE USING _Bits ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_bitset_to_string<15, char> 140 ns 144 ns 4977778 BM_bitset_to_string<64, char> 3336 ns 3348 ns 224000 BM_bitset_to_string_large_single<char> 2008 ns 2040 ns 344615 BM_bitset_to_string<7, wchar_t> 88.3 ns 87.9 ns 7466667 BM_bitset_to_string<64, wchar_t> 2211 ns 2197 ns 298667 BM_bitset_to_string_large_single<wchar_t> 1451 ns 1430 ns 448000 AFTER USING _Bits ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_bitset_to_string<15, char> 640 ns 642 ns 1120000 BM_bitset_to_string<64, char> 2646 ns 2609 ns 263529 BM_bitset_to_string_large_single<char> 2010 ns 2040 ns 344615 BM_bitset_to_string<7, wchar_t> 82.3 ns 82.0 ns 8960000 BM_bitset_to_string<64, wchar_t> 2407 ns 2441 ns 320000 BM_bitset_to_string_large_single<wchar_t> 1372 ns 1360 ns 448000

I think despite some pessimization of the certain case, it is still the right thing to do

Oh, actually the pessimization is too much. I think, angry loop unrolling strikes here too.

Let's keep it as a parameter for now then

As _Bits is the natural decision but we have to use _Len for performance reasons, I think we need to add a comment for it.

Hmm, how to compare benchmark results before and after change in the same run in a convenient way? I'm doing this in a very awkward way.

Benchmark

#include <array> #include <benchmark/benchmark.h> #include <bit> #include <bitset> #include <climits> #include <cstddef> #include <cstdint> #include <random> #include <iostream> using namespace std; struct impl_old { template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>> static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs, const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) { basic_string<_Elem, _Tr, _Alloc> _Str(_Bits, _Elem0); for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) { if (bs[_Bits - 1 - _Pos]) { _Str[_Pos] = _Elem1; } } return _Str; } }; struct impl_new_Bits { template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>> static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs, const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) { basic_string<_Elem, _Tr, _Alloc> _Str; _Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t) { for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) { _Buf[_Pos] = bs[_Bits - 1 - _Pos] ? _Elem1 : _Elem0; } return _Bits; }); return _Str; } }; struct impl_new_Len { template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>> static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs, const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) { basic_string<_Elem, _Tr, _Alloc> _Str; _Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) { for (size_t _Pos = 0; _Pos < _Len; ++_Pos) { _Buf[_Pos] = bs[_Len - 1 - _Pos] ? _Elem1 : _Elem0; } return _Len; }); return _Str; } }; namespace { const auto random_bits = [] { mt19937_64 rnd{}; array<uint64_t, 32> arr; for (auto& d : arr) { d = rnd(); } return arr; }(); template <class op, size_t N, class charT> void BM_bitset_to_string(benchmark::State& state) { for (auto _ : state) { for (const auto& bits : random_bits) { bitset<N> bs{ bits }; benchmark::DoNotOptimize(op::template to_string<charT>(bs)); } } } template <class op, class charT> void BM_bitset_to_string_large_single(benchmark::State& state) { const auto large_bitset = bit_cast<bitset<CHAR_BIT * sizeof(random_bits)>>(random_bits); for (auto _ : state) { benchmark::DoNotOptimize(op::template to_string<charT>(large_bitset)); } } } // namespace #define MAKE_BENCHMARKS(impl)\ BENCHMARK(BM_bitset_to_string<impl, 15, char>)->Name("\nBM_bitset_to_string<" #impl ", 15, char>");\ BENCHMARK(BM_bitset_to_string<impl, 64, char>);\ BENCHMARK(BM_bitset_to_string_large_single<impl, char>);\ BENCHMARK(BM_bitset_to_string<impl, 7, wchar_t>);\ BENCHMARK(BM_bitset_to_string<impl, 64, wchar_t>);\ BENCHMARK(BM_bitset_to_string_large_single<impl, wchar_t>); MAKE_BENCHMARKS(impl_old); MAKE_BENCHMARKS(impl_new_Bits); MAKE_BENCHMARKS(impl_new_Len); BENCHMARK_MAIN();

(And I'm feeling nervious that the effect is very complex and can even vary between different runs)

Results

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1562 ns 1535 ns 448000 BM_bitset_to_string<impl_old, 64, char> 11188 ns 11230 ns 64000 BM_bitset_to_string_large_single<impl_old, char> 9014 ns 9050 ns 127758 BM_bitset_to_string<impl_old, 7, wchar_t> 915 ns 907 ns 896000 BM_bitset_to_string<impl_old, 64, wchar_t> 6158 ns 6166 ns 83627 BM_bitset_to_string_large_single<impl_old, wchar_t> 5732 ns 5755 ns 149333 BM_bitset_to_string<impl_new_Bits, 15, char> 898 ns 914 ns 768981 BM_bitset_to_string<impl_new_Bits, 64, char> 8399 ns 8438 ns 100000 BM_bitset_to_string_large_single<impl_new_Bits, char> 3976 ns 3962 ns 228761 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 198 ns 197 ns 3733333 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 11847 ns 11858 ns 44800 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 7289 ns 7357 ns 148670 BM_bitset_to_string<impl_new_Len, 15, char> 550 ns 562 ns 1000000 BM_bitset_to_string<impl_new_Len, 64, char> 6000 ns 6099 ns 128085 BM_bitset_to_string_large_single<impl_new_Len, char> 3976 ns 4004 ns 320000 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 135 ns 137 ns 5120000 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 12374 ns 12556 ns 44800 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 15409 ns 15346 ns 56000

Another run

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1731 ns 1726 ns 298667 BM_bitset_to_string<impl_old, 64, char> 7147 ns 7150 ns 89600 BM_bitset_to_string_large_single<impl_old, char> 4201 ns 4148 ns 139378 BM_bitset_to_string<impl_old, 7, wchar_t> 734 ns 732 ns 746667 BM_bitset_to_string<impl_old, 64, wchar_t> 10492 ns 10463 ns 74667 BM_bitset_to_string_large_single<impl_old, wchar_t> 5644 ns 5620 ns 122334 BM_bitset_to_string<impl_new_Bits, 15, char> 1435 ns 1416 ns 640000 BM_bitset_to_string<impl_new_Bits, 64, char> 5507 ns 5418 ns 115348 BM_bitset_to_string_large_single<impl_new_Bits, char> 3620 ns 3594 ns 247782 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 188 ns 190 ns 2635294 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 8363 ns 8231 ns 112000 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 5042 ns 4933 ns 104533 BM_bitset_to_string<impl_new_Len, 15, char> 232 ns 235 ns 2986667 BM_bitset_to_string<impl_new_Len, 64, char> 6508 ns 6562 ns 100000 BM_bitset_to_string_large_single<impl_new_Len, char> 3964 ns 3941 ns 281493 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 174 ns 174 ns 4480000 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 15721 ns 15625 ns 56000 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 7529 ns 7533 ns 112000

Are there problems with my comparing benchmark? I'm finding resize_and_overwrite&_Len approach stably slow for large wchar_t cases in my testing.

It may be CPU-dependent.
For _Bits the compiler is more likely to advantage of knowing _Bits and unroll loops or vectorize something.
For _Len the compiler may be restricted from doing that.

Whether the optimizations like loop unrolling are useful or harmful -- dependent on CPU type.

Also non-unrolled loop (_Len) is subject to JCC errata pessimization on some CPUs, /QIntel-jcc-erratum might help in this case https://learn.microsoft.com/ru-ru/cpp/build/reference/qintel-jcc-erratum?view=msvc-170

I'm neutral with regards to _Bits vs _Len if none of them are absolutely advantage.

I think we need to test for _Bits vs _Len in more environments to do the decision. I tend to believe that by using _Bits we can get general improvements over previous impl, though not significant in some cases like small bitset.to_string<char>; by using _Len we risk introducing some regressions.

Here is some results after turning on that switch:
.......I find my benchmark result highly unstable between separate runs...

Result

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1718 ns 1726 ns 298667 BM_bitset_to_string<impl_old, 64, char> 6191 ns 6147 ns 119467 BM_bitset_to_string_large_single<impl_old, char> 5596 ns 5580 ns 89600 BM_bitset_to_string<impl_old, 7, wchar_t> 842 ns 825 ns 1003520 BM_bitset_to_string<impl_old, 64, wchar_t> 8230 ns 8196 ns 89600 BM_bitset_to_string_large_single<impl_old, wchar_t> 5406 ns 5312 ns 100000 BM_bitset_to_string<impl_new_Bits, 15, char> 993 ns 993 ns 802816 BM_bitset_to_string<impl_new_Bits, 64, char> 7076 ns 7063 ns 139378 BM_bitset_to_string_large_single<impl_new_Bits, char> 4549 ns 4552 ns 281493 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 132 ns 133 ns 4480000 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 5765 ns 5755 ns 89600 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 7747 ns 7782 ns 136533 BM_bitset_to_string<impl_new_Len, 15, char> 446 ns 445 ns 2986667 BM_bitset_to_string<impl_new_Len, 64, char> 5924 ns 5929 ns 89600 BM_bitset_to_string_large_single<impl_new_Len, char> 2957 ns 2979 ns 320000 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 242 ns 239 ns 3200000 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 11633 ns 11509 ns 44800 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 5270 ns 5336 ns 149333

Another run

--------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------- BM_bitset_to_string<impl_old, 15, char> 1415 ns 1445 ns 432552 BM_bitset_to_string<impl_old, 64, char> 6740 ns 6627 ns 89600 BM_bitset_to_string_large_single<impl_old, char> 5079 ns 5000 ns 100000 BM_bitset_to_string<impl_old, 7, wchar_t> 632 ns 628 ns 1120000 BM_bitset_to_string<impl_old, 64, wchar_t> 8644 ns 8545 ns 89600 BM_bitset_to_string_large_single<impl_old, wchar_t> 7400 ns 7324 ns 74667 BM_bitset_to_string<impl_new_Bits, 15, char> 1096 ns 1108 ns 719369 BM_bitset_to_string<impl_new_Bits, 64, char> 6058 ns 6104 ns 143360 BM_bitset_to_string_large_single<impl_new_Bits, char> 4695 ns 4708 ns 298667 BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 247 ns 248 ns 5600000 BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 5696 ns 5755 ns 89600 BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 6840 ns 6875 ns 100000 BM_bitset_to_string<impl_new_Len, 15, char> 231 ns 234 ns 2800000 BM_bitset_to_string<impl_new_Len, 64, char> 5700 ns 5622 ns 144516 BM_bitset_to_string_large_single<impl_new_Len, char> 2904 ns 2916 ns 203636 BM_bitset_to_string<impl_new_Len, 7, wchar_t> 133 ns 132 ns 5439529 BM_bitset_to_string<impl_new_Len, 64, wchar_t> 13153 ns 13253 ns 44800 BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 5030 ns 5017 ns 102784

I suggest don't think too much on _Bits vs _Len. The next step would be vectorization, it would be obviously faster than this, and obviously will use _Bits.

StephanTLavavej · 2023-08-10T01:09:18Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2023-08-11T02:29:42Z

Thanks for further optimizing this function! 🐇 🐇 🐇

Towards microsoft#3858 : resize_and_overwrite

f30a937

AlexGuteniev requested a review from a team as a code owner July 25, 2023 12:06

formating

2696c9d

StephanTLavavej added the performance Must go faster label Jul 25, 2023

StephanTLavavej approved these changes Jul 25, 2023

View reviewed changes

no void return

fbe38a0

StephanTLavavej approved these changes Jul 26, 2023

View reviewed changes

StephanTLavavej assigned strega-nil-ms Jul 26, 2023

strega-nil-ms approved these changes Aug 3, 2023

View reviewed changes

StephanTLavavej assigned StephanTLavavej and unassigned strega-nil-ms Aug 3, 2023

StephanTLavavej merged commit 5df7834 into microsoft:main Aug 11, 2023
35 checks passed

AlexGuteniev deleted the bits branch August 11, 2023 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

AlexGuteniev commented Jul 25, 2023

StephanTLavavej Jul 25, 2023

AlexGuteniev Jul 25, 2023

AlexGuteniev Jul 25, 2023

AlexGuteniev Jul 25, 2023

achabense Jul 26, 2023

achabense Jul 26, 2023 •

edited

Loading

achabense Jul 26, 2023 •

edited

Loading

AlexGuteniev Jul 26, 2023

achabense Jul 26, 2023 •

edited

Loading

AlexGuteniev Jul 28, 2023

StephanTLavavej commented Aug 10, 2023

StephanTLavavej commented Aug 11, 2023

Use string::resize_and_overwrite in bitset to avoid buffer initialization #3904

Use string::resize_and_overwrite in bitset to avoid buffer initialization #3904

Conversation

AlexGuteniev commented Jul 25, 2023

StephanTLavavej Jul 25, 2023

Choose a reason for hiding this comment

AlexGuteniev Jul 25, 2023

Choose a reason for hiding this comment

AlexGuteniev Jul 25, 2023

Choose a reason for hiding this comment

AlexGuteniev Jul 25, 2023

Choose a reason for hiding this comment

achabense Jul 26, 2023

Choose a reason for hiding this comment

achabense Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

achabense Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

AlexGuteniev Jul 26, 2023

Choose a reason for hiding this comment

achabense Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

AlexGuteniev Jul 28, 2023

Choose a reason for hiding this comment

StephanTLavavej commented Aug 10, 2023

StephanTLavavej commented Aug 11, 2023

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

Use `string::resize_and_overwrite` in `bitset` to avoid buffer initialization #3904

achabense Jul 26, 2023 •

edited

Loading

achabense Jul 26, 2023 •

edited

Loading

achabense Jul 26, 2023 •

edited

Loading