-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
32x32 bits multiply is enough for _Py_popcount32().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - looks good!
Thanks for the review. I don't understand well how int multiplication works in C. Does uint32_t x uint32_t produces an uint64_t number or an uint32_t number? For _Py_popcount32(), the result fits into uint32_t anyway ;-) |
C's rules are defined on the platform's unsigned char, short, int, long, and long long types. so can't be fully spelled out without knowing which of those map to uint32_t on the platform. But it doesn't matter 😉 For an NxN bit unsigned int multiply, C "in general" only returns the lowermost N bits of the product, throwing the uppermost N bits away. So the important part here is your "the result fits into uint32_t anyway". Or, really more importantly, that the 8 bits holding the final result were the topmost byte of the 32-bit result returned by the 32x32-bit multiply. Casting a multiplicand to uint64_t first (as the code used to do) caused a 64x64->64 bit multiply, and the code had to clear the top 32 bits of the result, by casting to uint32_t, before shifting right. The new code is just as correct, but clearer and possibly a bit quicker (depending on HW quirks). |
I don't think that's true, unfortunately. The old code was strictly portable, but the new code assumes that the width of The problem is that if the width of More likely would be a 64-bit I'd except any compiler to already recognise that in the old code there's no need for a 64-bit-by-64-bit multiply. Please could we revert this change, and possibly add a comment explaining why the code is delicate? |
Here's godbolt, showing that GCC at least does not produce a 64-by-64-bit multiply instruction: https://godbolt.org/z/shbY7j6cW |
@vstinner Please see also my explanation on the original PR: #20518 (comment)
|
The idea of file-by-file porting is feasible considering by-design good ABI and API cooperation between Rust code and C code. However, are there enough developers ready to invest their time into studying Rust? PRs are already left unclosed for weeks so losing experts and core devs is luxury. |
My change replaces Mark:
Do you mean that it's important to keep the uint32_t cast? Do you suggest Well, it seems like there is heavy engineering on this exact line, IMO if it's modified one more time, a comment must explain why it's written exactly like that :-) The result is in the range [0; 32], so uint32_t is enough ;-) |
That would help a bit, but still leaves the potential issue of undefined behaviour. The new code is only valid in the case that |
Here's a demonstration in Python that the implicit mask-out-everything-except-the-least-significant-32-bits operation matters. >>> x = 2**28 + 1 # clearly, population count is 2
>>> x -= ((x >> 1) & 0x55555555) # step through the algorithm as written ...
>>> x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
>>> x = (x + (x >> 4)) & 0x0F0F0F0F
>>> (x * 0x01010101) >> 24 # this is the calculation that would happen on a machine with 64-bit `int`
16843010
>>> ((x * 0x01010101) & 0xFFFFFFFF) >> 24 # fixed to keep only the least significant 32 bits before shifting
2 |
I suggest replacing the last line with:
(which incidentally is the code that was originally introduced for this, in GH-771). That avoids any suggestion of a 64-bit multiply, and also remains portable: in the weird cases where I'll make a PR to fix this, and make sure to include an explanatory comment. |
Done in #30794. |
Mark, I don't agree. A C (or C++) environment isn't required to support The ambiguities you refer to could apply to ints declared as |
Yes of course; no disagreement with that. The problem arises if the C On such a machine, C's integer promotions kick in, and any multiplication of |
Mark, ya, on third thought I agree your:
is the best that can be done - although, also ya, there's no guarantee it won't do a 512x512 bit multiply 😉. |
In exactly the same way, the following code invokes undefined behaviour on a typical modern machine: #include <stdint.h>
#include <stdio.h>
uint16_t mul_short(uint16_t x, uint16_t y) {
return x * y;
}
int main(void) {
uint16_t result = mul_short(60000, 60000);
printf("result = %d\n", result);
return 0;
} Here are the results on my machine of compiling the above with Clang's
|
Arithmetic in C is so hard :-( |
Only if you care about getting the right answer efficiently and transparently - not |
32x32 bits multiply is enough for _Py_popcount32().
https://bugs.python.org/issue29882