-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use string::resize_and_overwrite
in bitset
to avoid buffer initialization
#3904
Conversation
if (_Subscript(_Bits - 1 - _Pos)) { | ||
_Str[_Pos] = _Elem1; | ||
basic_string<_Elem, _Tr, _Alloc> _Str; | ||
_Str._Resize_and_overwrite(_Bits, [this, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change requested: I observed that we could ignore _Len
and take advantage of our knowledge that it will always be the compile-time constant _Bits
, which is available without capturing. I checked the codegen, and this indeed allows the compiler to bake in the constant. However, while benchmarking showed some improvement for BM_bitset_to_string<64, char>
, it showed significant pessimization for BM_bitset_to_string<15, char>
.
BEFORE USING _Bits
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char> 140 ns 144 ns 4977778
BM_bitset_to_string<64, char> 3336 ns 3348 ns 224000
BM_bitset_to_string_large_single<char> 2008 ns 2040 ns 344615
BM_bitset_to_string<7, wchar_t> 88.3 ns 87.9 ns 7466667
BM_bitset_to_string<64, wchar_t> 2211 ns 2197 ns 298667
BM_bitset_to_string_large_single<wchar_t> 1451 ns 1430 ns 448000
AFTER USING _Bits
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char> 640 ns 642 ns 1120000
BM_bitset_to_string<64, char> 2646 ns 2609 ns 263529
BM_bitset_to_string_large_single<char> 2010 ns 2040 ns 344615
BM_bitset_to_string<7, wchar_t> 82.3 ns 82.0 ns 8960000
BM_bitset_to_string<64, wchar_t> 2407 ns 2441 ns 320000
BM_bitset_to_string_large_single<wchar_t> 1372 ns 1360 ns 448000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think despite some pessimization of the certain case, it is still the right thing to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually the pessimization is too much. I think, angry loop unrolling strikes here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it as a parameter for now then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As _Bits
is the natural decision but we have to use _Len
for performance reasons, I think we need to add a comment for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how to compare benchmark results before and after change in the same run in a convenient way? I'm doing this in a very awkward way.
Benchmark
#include <array>
#include <benchmark/benchmark.h>
#include <bit>
#include <bitset>
#include <climits>
#include <cstddef>
#include <cstdint>
#include <random>
#include <iostream>
using namespace std;
struct impl_old {
template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>>
static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs,
const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) {
basic_string<_Elem, _Tr, _Alloc> _Str(_Bits, _Elem0);
for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) {
if (bs[_Bits - 1 - _Pos]) {
_Str[_Pos] = _Elem1;
}
}
return _Str;
}
};
struct impl_new_Bits {
template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>>
static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs,
const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) {
basic_string<_Elem, _Tr, _Alloc> _Str;
_Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t) {
for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) {
_Buf[_Pos] = bs[_Bits - 1 - _Pos] ? _Elem1 : _Elem0;
}
return _Bits;
});
return _Str;
}
};
struct impl_new_Len {
template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>>
static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs,
const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) {
basic_string<_Elem, _Tr, _Alloc> _Str;
_Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) {
for (size_t _Pos = 0; _Pos < _Len; ++_Pos) {
_Buf[_Pos] = bs[_Len - 1 - _Pos] ? _Elem1 : _Elem0;
}
return _Len;
});
return _Str;
}
};
namespace {
const auto random_bits = [] {
mt19937_64 rnd{};
array<uint64_t, 32> arr;
for (auto& d : arr) {
d = rnd();
}
return arr;
}();
template <class op, size_t N, class charT>
void BM_bitset_to_string(benchmark::State& state) {
for (auto _ : state) {
for (const auto& bits : random_bits) {
bitset<N> bs{ bits };
benchmark::DoNotOptimize(op::template to_string<charT>(bs));
}
}
}
template <class op, class charT>
void BM_bitset_to_string_large_single(benchmark::State& state) {
const auto large_bitset = bit_cast<bitset<CHAR_BIT * sizeof(random_bits)>>(random_bits);
for (auto _ : state) {
benchmark::DoNotOptimize(op::template to_string<charT>(large_bitset));
}
}
} // namespace
#define MAKE_BENCHMARKS(impl)\
BENCHMARK(BM_bitset_to_string<impl, 15, char>)->Name("\nBM_bitset_to_string<" #impl ", 15, char>");\
BENCHMARK(BM_bitset_to_string<impl, 64, char>);\
BENCHMARK(BM_bitset_to_string_large_single<impl, char>);\
BENCHMARK(BM_bitset_to_string<impl, 7, wchar_t>);\
BENCHMARK(BM_bitset_to_string<impl, 64, wchar_t>);\
BENCHMARK(BM_bitset_to_string_large_single<impl, wchar_t>);
MAKE_BENCHMARKS(impl_old);
MAKE_BENCHMARKS(impl_new_Bits);
MAKE_BENCHMARKS(impl_new_Len);
BENCHMARK_MAIN();
(And I'm feeling nervious that the effect is very complex and can even vary between different runs)
Results
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------
BM_bitset_to_string<impl_old, 15, char> 1562 ns 1535 ns 448000
BM_bitset_to_string<impl_old, 64, char> 11188 ns 11230 ns 64000
BM_bitset_to_string_large_single<impl_old, char> 9014 ns 9050 ns 127758
BM_bitset_to_string<impl_old, 7, wchar_t> 915 ns 907 ns 896000
BM_bitset_to_string<impl_old, 64, wchar_t> 6158 ns 6166 ns 83627
BM_bitset_to_string_large_single<impl_old, wchar_t> 5732 ns 5755 ns 149333
BM_bitset_to_string<impl_new_Bits, 15, char> 898 ns 914 ns 768981
BM_bitset_to_string<impl_new_Bits, 64, char> 8399 ns 8438 ns 100000
BM_bitset_to_string_large_single<impl_new_Bits, char> 3976 ns 3962 ns 228761
BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 198 ns 197 ns 3733333
BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 11847 ns 11858 ns 44800
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 7289 ns 7357 ns 148670
BM_bitset_to_string<impl_new_Len, 15, char> 550 ns 562 ns 1000000
BM_bitset_to_string<impl_new_Len, 64, char> 6000 ns 6099 ns 128085
BM_bitset_to_string_large_single<impl_new_Len, char> 3976 ns 4004 ns 320000
BM_bitset_to_string<impl_new_Len, 7, wchar_t> 135 ns 137 ns 5120000
BM_bitset_to_string<impl_new_Len, 64, wchar_t> 12374 ns 12556 ns 44800
BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 15409 ns 15346 ns 56000
Another run
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------
BM_bitset_to_string<impl_old, 15, char> 1731 ns 1726 ns 298667
BM_bitset_to_string<impl_old, 64, char> 7147 ns 7150 ns 89600
BM_bitset_to_string_large_single<impl_old, char> 4201 ns 4148 ns 139378
BM_bitset_to_string<impl_old, 7, wchar_t> 734 ns 732 ns 746667
BM_bitset_to_string<impl_old, 64, wchar_t> 10492 ns 10463 ns 74667
BM_bitset_to_string_large_single<impl_old, wchar_t> 5644 ns 5620 ns 122334
BM_bitset_to_string<impl_new_Bits, 15, char> 1435 ns 1416 ns 640000
BM_bitset_to_string<impl_new_Bits, 64, char> 5507 ns 5418 ns 115348
BM_bitset_to_string_large_single<impl_new_Bits, char> 3620 ns 3594 ns 247782
BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 188 ns 190 ns 2635294
BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 8363 ns 8231 ns 112000
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 5042 ns 4933 ns 104533
BM_bitset_to_string<impl_new_Len, 15, char> 232 ns 235 ns 2986667
BM_bitset_to_string<impl_new_Len, 64, char> 6508 ns 6562 ns 100000
BM_bitset_to_string_large_single<impl_new_Len, char> 3964 ns 3941 ns 281493
BM_bitset_to_string<impl_new_Len, 7, wchar_t> 174 ns 174 ns 4480000
BM_bitset_to_string<impl_new_Len, 64, wchar_t> 15721 ns 15625 ns 56000
BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 7529 ns 7533 ns 112000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there problems with my comparing benchmark? I'm finding resize_and_overwrite
&_Len
approach stably slow for large wchar_t
cases in my testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be CPU-dependent.
For _Bits
the compiler is more likely to advantage of knowing _Bits
and unroll loops or vectorize something.
For _Len
the compiler may be restricted from doing that.
Whether the optimizations like loop unrolling are useful or harmful -- dependent on CPU type.
Also non-unrolled loop (_Len
) is subject to JCC errata pessimization on some CPUs, /QIntel-jcc-erratum
might help in this case https://learn.microsoft.com/ru-ru/cpp/build/reference/qintel-jcc-erratum?view=msvc-170
I'm neutral with regards to _Bits
vs _Len
if none of them are absolutely advantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to test for _Bits
vs _Len
in more environments to do the decision. I tend to believe that by using _Bits
we can get general improvements over previous impl, though not significant in some cases like small bitset.to_string<char>
; by using _Len
we risk introducing some regressions.
Here is some results after turning on that switch:
.......I find my benchmark result highly unstable between separate runs...
Result
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------
BM_bitset_to_string<impl_old, 15, char> 1718 ns 1726 ns 298667
BM_bitset_to_string<impl_old, 64, char> 6191 ns 6147 ns 119467
BM_bitset_to_string_large_single<impl_old, char> 5596 ns 5580 ns 89600
BM_bitset_to_string<impl_old, 7, wchar_t> 842 ns 825 ns 1003520
BM_bitset_to_string<impl_old, 64, wchar_t> 8230 ns 8196 ns 89600
BM_bitset_to_string_large_single<impl_old, wchar_t> 5406 ns 5312 ns 100000
BM_bitset_to_string<impl_new_Bits, 15, char> 993 ns 993 ns 802816
BM_bitset_to_string<impl_new_Bits, 64, char> 7076 ns 7063 ns 139378
BM_bitset_to_string_large_single<impl_new_Bits, char> 4549 ns 4552 ns 281493
BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 132 ns 133 ns 4480000
BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 5765 ns 5755 ns 89600
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 7747 ns 7782 ns 136533
BM_bitset_to_string<impl_new_Len, 15, char> 446 ns 445 ns 2986667
BM_bitset_to_string<impl_new_Len, 64, char> 5924 ns 5929 ns 89600
BM_bitset_to_string_large_single<impl_new_Len, char> 2957 ns 2979 ns 320000
BM_bitset_to_string<impl_new_Len, 7, wchar_t> 242 ns 239 ns 3200000
BM_bitset_to_string<impl_new_Len, 64, wchar_t> 11633 ns 11509 ns 44800
BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 5270 ns 5336 ns 149333
Another run
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------
BM_bitset_to_string<impl_old, 15, char> 1415 ns 1445 ns 432552
BM_bitset_to_string<impl_old, 64, char> 6740 ns 6627 ns 89600
BM_bitset_to_string_large_single<impl_old, char> 5079 ns 5000 ns 100000
BM_bitset_to_string<impl_old, 7, wchar_t> 632 ns 628 ns 1120000
BM_bitset_to_string<impl_old, 64, wchar_t> 8644 ns 8545 ns 89600
BM_bitset_to_string_large_single<impl_old, wchar_t> 7400 ns 7324 ns 74667
BM_bitset_to_string<impl_new_Bits, 15, char> 1096 ns 1108 ns 719369
BM_bitset_to_string<impl_new_Bits, 64, char> 6058 ns 6104 ns 143360
BM_bitset_to_string_large_single<impl_new_Bits, char> 4695 ns 4708 ns 298667
BM_bitset_to_string<impl_new_Bits, 7, wchar_t> 247 ns 248 ns 5600000
BM_bitset_to_string<impl_new_Bits, 64, wchar_t> 5696 ns 5755 ns 89600
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t> 6840 ns 6875 ns 100000
BM_bitset_to_string<impl_new_Len, 15, char> 231 ns 234 ns 2800000
BM_bitset_to_string<impl_new_Len, 64, char> 5700 ns 5622 ns 144516
BM_bitset_to_string_large_single<impl_new_Len, char> 2904 ns 2916 ns 203636
BM_bitset_to_string<impl_new_Len, 7, wchar_t> 133 ns 132 ns 5439529
BM_bitset_to_string<impl_new_Len, 64, wchar_t> 13153 ns 13253 ns 44800
BM_bitset_to_string_large_single<impl_new_Len, wchar_t> 5030 ns 5017 ns 102784
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest don't think too much on _Bits
vs _Len
. The next step would be vectorization, it would be obviously faster than this, and obviously will use _Bits
.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thanks for further optimizing this function! 🐇 🐇 🐇 |
Towards #3858.
This is what was addressed
This is confirmed to be non applicable. Before the change it was
cmovne
to set'1'
. After it becamesetne al
to set to 0 or 1, thenadd al, '0'
. Both are branchless. I think, the compiler will be able to make efficient branchless code for non-consecutive values by going back tocmovcc
Benchmark is already there - thanks @achabense!
Benchmark results before:
after
The anomality of
BM_bitset_to_string<64, char>
is caused by too angry loop unrolling, and, IMO, should be fixed in compiler, if needs fixing.