You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use bitwise ops instead of modulus & division, e.g.: index % 8 -> index & 7 index / 8 -> index >> 3
This is an oversimplification of course, as TMBitset uses the ubits datum. Making a couple new constexprs is simple enough.
Of course, all this probably doesn't matter in practice; the compiler almost certainly already optimizes all this.
count() or size()
A count() or size() function akin to unordered_set::size() isn't implemented, as I just haven't needed one yet.
I've thought about iterating byte-by-byte and using a lookup table, similar to this, but there are better methods available.
Low priority though; no need to worry about this until a need arises in production.
Until then, if I just want a bit count for debug code (which I'm not worried about optimizing), I can just use std::distance and call it a day. No need for a new member function.
x86 has POPCNT; I believe there's an intrinsic for that.
iteration
There'ssomeprettyinterestingstuffhere. Incrementing TMBitset::iterator involves more iteration under the hood.
In the very first draft, it iterated through all the bits in each unit of the data, and when it hit a 1, iteration "paused" and our iterator became visible to the outside world.
Later I improved on this by iterating only as far as the leftmost 1. Bit shifting right as we go along, once this 1 disappears, the value of our bits hits 0, and we can move along to the next unit.
There's still room for improvement though. Consider the worst (32-bit) case, a unit of 0xC0000000. Only one 1, all the way left. We don't get to skip ahead early, and must iterate through 31 0s to get there. That's pretty inefficient. It'd be great to know where the next 1 is, and skip straight to it.
This is where "Count the consecutive zero bits (trailing) on the right" comes in.
During development I tried a couple methods involving lookup tables, but didn't hit on the idea of reducing table size via modulus or multiplication. No better performance. That, and a 32 KB table for uint16_t units is starting to get a bit big. 4GB for uint32_t? Forget about it! In contrast, the lookup table used for uint32_ts by the multiplication technique can fit in half a cache line.
What I'm doing now is comparable to the linear search method.
There are a few possibilities to play around with, though as this video with a variation on the parallel technique notes, there's an x86 instruction for this. Clang and gcc have the __builtin_ctz intrinsic. It appears that XCode uses Clang, so this intrinsic should still work, just doing whatever it has to do for @jteresco's M1 chip. If we're worried about supporting other compilers, conditional compilation is a possibility.
All that said, the improvement this stuff provides will be tiny.
Especially with subgraph generation, there are much bigger things effecting efficiency.
The text was updated successfully, but these errors were encountered:
https://graphics.stanford.edu/~seander/bithacks.html
insert, lookup, etc.
Use bitwise ops instead of modulus & division, e.g.:
index % 8
->index & 7
index / 8
->index >> 3
This is an oversimplification of course, as TMBitset uses the ubits datum. Making a couple new constexprs is simple enough.
Of course, all this probably doesn't matter in practice; the compiler almost certainly already optimizes all this.
count() or size()
A
count()
orsize()
function akin tounordered_set::size()
isn't implemented, as I just haven't needed one yet.I've thought about iterating byte-by-byte and using a lookup table, similar to this, but there are better methods available.
Low priority though; no need to worry about this until a need arises in production.
Until then, if I just want a bit count for debug code (which I'm not worried about optimizing), I can just use
std::distance
and call it a day. No need for a new member function.x86 has
POPCNT
; I believe there's an intrinsic for that.iteration
There's some pretty interesting stuff here.
Incrementing
TMBitset::iterator
involves more iteration under the hood.In the very first draft, it iterated through all the bits in each unit of the data, and when it hit a 1, iteration "paused" and our iterator became visible to the outside world.
Later I improved on this by iterating only as far as the leftmost 1. Bit shifting right as we go along, once this 1 disappears, the value of our bits hits 0, and we can move along to the next
unit
.There's still room for improvement though. Consider the worst (32-bit) case, a
unit
of0xC0000000
. Only one 1, all the way left. We don't get to skip ahead early, and must iterate through 31 0s to get there. That's pretty inefficient. It'd be great to know where the next 1 is, and skip straight to it.This is where "Count the consecutive zero bits (trailing) on the right" comes in.
During development I tried a couple methods involving lookup tables, but didn't hit on the idea of reducing table size via modulus or multiplication. No better performance. That, and a 32 KB table for
uint16_t
units is starting to get a bit big. 4GB foruint32_t
? Forget about it! In contrast, the lookup table used foruint32_t
s by the multiplication technique can fit in half a cache line.What I'm doing now is comparable to the linear search method.
There are a few possibilities to play around with, though as this video with a variation on the parallel technique notes, there's an x86 instruction for this. Clang and gcc have the
__builtin_ctz
intrinsic. It appears that XCode uses Clang, so this intrinsic should still work, just doing whatever it has to do for @jteresco's M1 chip. If we're worried about supporting other compilers, conditional compilation is a possibility.All that said, the improvement this stuff provides will be tiny.
Especially with subgraph generation, there are much bigger things effecting efficiency.
The text was updated successfully, but these errors were encountered: