-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Expose Bit Manipulation functions #27382
Comments
cc @mburbea, @tannergooding , @CarolEidt |
Please drop the |
Additional Comments/Thoughts:
|
I went back and forth on that one. The bool makes the api tighter. But it means that ultimately the implementation has branching instructions, which means perf suffers. So I am happy to refactor.
I designed this based on a previous suggestion of yours, but I am happy to revert. Perhaps you meant
I will make that change |
@grant-d Will JIT be able to optimize it properly if the value is an enregistered local? int M(uint value)
{
return InsertBit(in value, 3) ? 5 :6;
} |
It needs to be a private static int M(uint value) => InsertBit(ref value, 3, true) ? 5 : 6; I don't have enough knowledge of what the optimizer would do, someone else may be able to advise us. |
Um, |
These need to be passed by value. Passing things around by |
Sounds good, I have made the change. @tannergooding, let me know if you think the |
It will now - I have changed all |
API review probably won't like the contraction |
I like |
Done - both variants exist. |
Note to self:
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86; |
This is pointless unless you also the attribute for AggressiveInlining as the function call overhead will eat your gains otherwise. Therefore, AggressiveInlining should be contractual on these. |
The POC already does so. I will add a note to the spec |
Many of these functions will be treated as intrinsic and will be specially handled by the JIT to emit a single instruction, regardless of what the actual method body contains for the software fallback. The JIT is also very good at inlining small functions, regardless of whether or not the |
|
@GSPP can this not be handled from the callsite. |
@grant-d: InsertBit can be implemented branchlessly on any reasonably CPU. |
InsertBit is already implemented without branching, see POC code. |
The question was whether there could be a mutator that accepts a bool, then conditionally inserts or extracts. My preference is to externalize branching (at the callsite). |
And I'm saying the dynamic set-clear version can be implemented w/o branching.
|
Got you. Thanks for the gist. I still am curious whether the 'dynamic' use case is compelling enough to have a somewhat overlapping API? |
@jhudsoncedaron, just because you can implement that in software, doesn't necessarily mean it is (necessarily) a good API to expose here or that it will be efficient. Many of these functions will be treated as intrinsic, which means that they will will compile down to a single CPU instruction (rather than 5+ instructions). As an example, we can use the Also, I don't think you can convert a |
I only ever use the dynamic version of InsertBit. That might not be valid C# but the algorithm is correct anyway. bools are stored as either a 1 or a 0 so it's just a matter of convincing the compiler to emit the right code. The harder this is to do, the stronger the argument for putting it in the library. |
I tend to agree with @jhudsoncedaron that the dynamic version is useful. For this particular case if we really wanted to minimize API surface area I might vote for just the dynamic case, since I would expect the value to be either clearly a constant or not. And it would be easier for the jit to make that optimization than to recognize a sequence with branches. |
[Edited] I assumed the gist was pseudocode; in C# we can do one of the following. return condition ? 37 : 0; // 1.0x (idiomatic)
return (*(byte*)&condition) * 37; // 0.95x (cannot be inlined)
return Unsafe.As<bool, byte>(condition) * 37; // 0.60x
return new UnionStruct { Bool = condition }.Value * 37; // 0.76x (ExplicitLayout, etc) This relies on internal knowledge of That being said. I started the spec out with the |
@grant-d I would scope this issue down to just what we have in CoreLib and that is clearly useful. It will make the API review on this easy and it will allow us to make the most useful APIs public in .NET Core 3.0. I would split the bit manipulation APIs into a separate issue that will be looked at and discussed separately, on its own schedule. |
That's a great idea Jan. I had already split this into 2 asks in the summary, but I will create two separate issues instead |
Lets get the subset from dotnet/corefx#35419 through first. |
Tagging subscribers to this area: @tannergooding |
I'd like to move forward and mark this as It's worth noting that of the variants being exposed, The native intrinsics return
|
Meant to link to the native intrinsics above: https://docs.microsoft.com/en-us/cpp/intrinsics/bittest-bittest64?view=vs-2019 |
The original proposal had methods like PopCount that were already approved and that we had no intrinsic patterns for. I do not see value in having method for bit tests, etc. I believe that the existing pattern for bit tests like The fact that MSVC C++ introduced an intrinsic for bit tests sounds like over-engineering to me. They could have picked that up as pattern match just fine. |
And I believe they do, but its the same fundamental problem as with On the other hand, a method provides an explicit way to opt into the functionality, know that it will work (there probably isn't some edge case they are forgetting around bit manipulation), and that it will most likely be efficient for the given platform. It also solves the problem of matching a more complex pattern where you want to simultaneously branch based off the existing value of a bit while simultaneously setting it to a new value. |
RotateLeft and RotateRight are fine with me. The pattern to write that using language operators is not concise, and there are multiple ways to write that.
The most common ones like RotateLeft / RotateRight, PopCount, etc. are addressed already. The remaining ones are the bit tests: InsertBit, ClearBit, etc. They show a lot less occurences. I have done a few searches for patterns like |
Thanks for the ping Tanner, updated the spec's name to |
FYI, here's my previous research on duplicate functions for each operation. eg, in [Pure]
private static uint InsertBit(int position, uint bits)
{
Debug.Assert(0 == (bits & (1u << position)));
return bits | (1u << position);
} |
This is a good example: It asserts and sets the bit. We would not be able to replace it with the proposed method, without losing the debug validation. |
A question: could be possible to include optimized array/span based versions? Use case is bitmask of arbitrary length backed by array of uints (likely, like If theres no extra gains (like even more vectorization) then forget my question, it will be better to simply implement |
IMO the best API for these cases is a value tuple. However, admittedly I haven't yet had the time to get my PR #37928 working, which will enable the best codegen for these cases. If the ref API is preferred, the JIT can still take one of two avenues to achieve the same result:
I would think that would be doable fully in C#, if it was considered desirable. |
Lack of Effectively, for a void ClearBit(unsigned int index)
{
if (index % 2 == 0)
{
printf("You're setting even bits, which is sub-optimal.\n");
return;
}
index = index / 2;
rawbits[index / 8] &= ~(1 << (index % 8));
} vs C#: private static void ClearBit(ref byte rawbits, uint index)
{
if ((index % 2) == 0)
{
Console.WriteLine("You are setting even bits, which is sub-optimal");
}
else
{
index /= 2;
Unsafe.Add(ref rawbits, (nint)(index / 8)) &= (byte)~(1u << (int)(index % 8));
}
} You get the following codegen for cpp: this$ = 8
index$ = 16
void prime_sieve::ClearBit(unsigned int) PROC ; prime_sieve::ClearBit, COMDAT
mov r8d, edx
test dl, 1
jne SHORT $LN2@ClearBit
lea rcx, OFFSET FLAT:`string'
jmp printf
$LN2@ClearBit:
shr r8d, 1
mov edx, r8d
and r8d, 7
shr rdx, 3
add rdx, QWORD PTR [rcx+8]
movzx eax, BYTE PTR [rdx]
btr eax, r8d
mov BYTE PTR [rdx], al
ret 0 as compared to C#: L0000: test dl, 1
L0003: jne short L0017
L0005: mov rcx, 0x2489d6296b8
L000f: mov rcx, [rcx]
L0012: jmp System.Console.WriteLine(System.String)
L0017: shr edx, 1
L0019: mov eax, edx
L001b: shr eax, 3
L001e: add rax, rcx
L0021: mov ecx, edx
L0023: and ecx, 7
L0026: mov edx, 1
L002b: shl edx, cl
L002d: not edx
L002f: movzx edx, dl
L0032: and [rax], dl
L0034: ret This results in a much bigger inner loop and in worse overall codegen. |
This looks more like a missing optimization in the JIT, similar to #6815 |
Merged in #83400: runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs Lines 947 to 959 in c96c15c
|
Bit manipulation routines are common enough that we should expose a subset as platform primitives.
While some of them may be simple to write, it's much harder to achieve the requisite performance desired, especially since they tend to be used within tight loops.
The aim of this proposal is to scope & design a minimal set of functions, to be implemented with a bias towards performance. Per @tannergooding
Note that even though some of the formula may be simple, relevant callsites are more self-documenting when using the intrinsics (should the dev choose to use them).
Scope
CoreCLR
(Consolidate implementation of Rotate and PopCount coreclr#22584)CoreFX
(Units for BitOps.TrailingZeroCount corefx#35193)System.Numerics.BitOperations
as apublic
ref assembly inCoreFX
(https://github.com/dotnet/corefx/issues/35419)CoreFX
([NO MERGE] BitOps analysis CoreFX (WIP) corefx#34917)Rationale and Usage
The proposed functions are already implemented throughout the stack, often with different algorithms, performance characteristics and test coverage.
Existing callsites below: https://github.com/dotnet/corefx/issues/32269#issuecomment-457689128
(There is likely to be more; the initial search was timeboxed to ~1 hour)
Some of the implementation have suboptimal performance or bugs. Something like
ExtractBit
is trivial to implement, butPopCount
is more complex and thus prone to logic and performance issues. Hiding these complex formulae behind friendly signatures makes using them more approachable.Here's an example of a function (
BTC
) whose signature is simple but the algebra is easy to get wrong.However making a call to it meets our goal of abstraction and performance:
Proposed API
The proposed API is purposefully kept lean. We can add more methods in later design iterations. We should view this as an opportunity to get simple, base functionality out the door and not stray into the dangerous territory of adding every bit twiddling hack that exists.
Assume all methods are decorated with
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Details
InsertBit
/ClearBit
andWriteBit
where the latter conditionally executes the equivalent of either former (using twiddling to do so without branching). But there's enough twiddling inWrite
to avoid any branching that it's maybe worth keeping both variants.exception
s. The current design tries to dispense with the requirement by sharpening input types, contractual assumptions (eg out-of-boundsoffset
will usemod n
in some functions or result in ano-op
in others) and specific design choices.Questions
int
,uint
andbyte
are commonly used. Anything else?Decisions
namespace
this should be in. Decision:namespace System.Numerics
.BitOps
is self-documenting, terse (it might be specified frequently in a callsite, if not aliased withusing
) and both terms are well-known names or abbreviations. Decision:BitOperations
PopCount
could be calledCountSetBits
but that's not the common lingo used by twiddlers. Furthermore, intrinsics already expose the well-known names, egPopcnt.PopCount
.TrailingZeroCount
may be more concisely described asTrailingZeros
. Decision:TrailingZeroCount
already chosen by a previous PR.PopCount
return (idiomatic)int
, notuint
offset
orposition
parameters beint
oruint
. Latter is preferred since negatives not permitted regardless. Decision: int is an idiomatic input/output type in C#.Log(0)
is mathematically undefined. Should it return0
or-1
? Decision: Returns 0LeadingZeros
might need a different algorithm on BE/LE platforms. Decision: Endianess only matters if we are reinterpreting integers.Sample call sites
The following samples are taken from the linked units, from the method
BitOps_Samples
.The code chooses values that are easy to eyeball for correctness. The real units cover many more boundaries & conditions.
Updates
InsertBit
accepted a bool that determined whether it set or cleared the bit in question. Such methods have now been refactored in two,InsertBit
andClearBit
.FlipBit
becameComplementBit
).int
instead ofbyte
. All POC units pass.WriteBit(value, offset, bool on)
overloads that conditionally set/clear the specified bit.ExtractByte
and friendsEvaluate(bool)
(used internally, so may as well expose it)Span<T>
overloads per @tannergooding advice. Will maybe submit in separate proposalTrailingOnes
andLeadingOnes
into to do later proposal in commentsLog2
RotateByte
, etc into to do later proposal in commentsThe text was updated successfully, but these errors were encountered: