Skip to content

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 21, 2022
Merged

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

merged 1 commit into from
Jan 21, 2022

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jan 21, 2022

32x32 bits multiply is enough for _Py_popcount32().

https://bugs.python.org/issue29882

32x32 bits multiply is enough for _Py_popcount32().
@vstinner
Copy link
Member Author

cc @mdickinson @tim-one

Copy link
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - looks good!

@vstinner vstinner changed the title bpo-29882: _Py_popcount32() doesn't need 64x64 multiply bpo-29882: _Py_popcount32() doesn't need 64x64 multiply Jan 21, 2022
@vstinner vstinner merged commit cd8de40 into python:main Jan 21, 2022
@vstinner vstinner deleted the popcount_mul branch January 21, 2022 23:54
@vstinner
Copy link
Member Author

I agree - looks good!

Thanks for the review.

I don't understand well how int multiplication works in C. Does uint32_t x uint32_t produces an uint64_t number or an uint32_t number? For _Py_popcount32(), the result fits into uint32_t anyway ;-)

@tim-one
Copy link
Member

tim-one commented Jan 22, 2022

C's rules are defined on the platform's unsigned char, short, int, long, and long long types. so can't be fully spelled out without knowing which of those map to uint32_t on the platform. But it doesn't matter 😉 For an NxN bit unsigned int multiply, C "in general" only returns the lowermost N bits of the product, throwing the uppermost N bits away. So the important part here is your "the result fits into uint32_t anyway". Or, really more importantly, that the 8 bits holding the final result were the topmost byte of the 32-bit result returned by the 32x32-bit multiply. Casting a multiplicand to uint64_t first (as the code used to do) caused a 64x64->64 bit multiply, and the code had to clear the top 32 bits of the result, by casting to uint32_t, before shifting right.

The new code is just as correct, but clearer and possibly a bit quicker (depending on HW quirks).

@mdickinson
Copy link
Member

mdickinson commented Jan 22, 2022

The new code is just as correct

I don't think that's true, unfortunately. The old code was strictly portable, but the new code assumes that the width of int is no more than 32 bits, and will give undefined behaviour or incorrect results if that's not true.

The problem is that if the width of int exceeds that of uint32_t, then both arguments to the multiplication are promoted to int (C99 §6.3.1.1p2). The multiplication then ends up as a multiplication of ints, which, if it overflows (e.g., with a 34-bit int, unlikely though that is) triggers undefined behaviour.

More likely would be a 64-bit int, in which case we're not going to get undefined behaviour due to overflow, but we are going to get incorrect results - in the previous code, we were relying on theuint32_t cast to chop off all but the least significant 32 bits of the result.

I'd except any compiler to already recognise that in the old code there's no need for a 64-bit-by-64-bit multiply.

Please could we revert this change, and possibly add a comment explaining why the code is delicate?

@mdickinson
Copy link
Member

Here's godbolt, showing that GCC at least does not produce a 64-by-64-bit multiply instruction: https://godbolt.org/z/shbY7j6cW

@mdickinson
Copy link
Member

mdickinson commented Jan 22, 2022

@vstinner Please see also my explanation on the original PR: #20518 (comment)

My original version is actually safer here. :-) This version can still invoke undefined behaviour (albeit only on unusual machines). Realistically, it's unlikely that we'd ever hit the undefined behaviour in practice, but given that there's an easy way to write this that avoids that undefined behaviour, I'd prefer to use that way.

If your int type has width larger than 32 bits, then C's integer promotions (C99 §6.3.1.1) ensure that both
u and UINT32_C(0x01010101) are treated as (signed) int before the multiply, and then there's potential for the multiply to overflow, which gives undefined behaviour.

In contrast, if in the same situation you do u * 0x01010101U, then after those same integer promotions you're multiplying an int by an unsigned int, and C's "usual arithmetic conversions" (C99 §6.3.1.8) kick in to ensure that the multiplication is actually performed as unsigned int by unsigned int, which is safe from undefined behaviour and simply discards high bits in the expected manner if necessary.

Performance isn't likely to be a problem with the (uint32_t)(u * 0x01010101U) version: the cast ensures that a compiler doesn't have to be all that clever to realise that it's enough to do a 32x32->32 unsigned multiply here.

And yes, it's crazy that it's this hard to write a portable uint32_t * uint32_t -> uint32_t multiply in standard C. Maybe it's time to rewrite Python's core in Rust.

@arhadthedev
Copy link
Member

Maybe it's time to rewrite Python's core in Rust.

The idea of file-by-file porting is feasible considering by-design good ABI and API cooperation between Rust code and C code.

However, are there enough developers ready to invest their time into studying Rust? PRs are already left unclosed for weeks so losing experts and core devs is luxury.

@vstinner
Copy link
Member Author

My change replaces return (uint32_t)((uint64_t)x * (uint64_t)SUM) >> 24; with return (x * SUM) >> 24; where x and SUM types are uint32_t. My intent was to avoid uint64_t, or worse, uint128_t.

Mark:

the cast ensures that a compiler doesn't have to be all that clever to realise that it's enough to do a 32x32->32 unsigned multiply here.

Do you mean that it's important to keep the uint32_t cast? Do you suggest return (uint32_t)(x * SUM) >> 24;?

Well, it seems like there is heavy engineering on this exact line, IMO if it's modified one more time, a comment must explain why it's written exactly like that :-)

The result is in the range [0; 32], so uint32_t is enough ;-)

@mdickinson
Copy link
Member

Do you mean that it's important to keep the uint32_t cast? Do you suggest return (uint32_t)(x * SUM) >> 24;?

That would help a bit, but still leaves the potential issue of undefined behaviour. The new code is only valid in the case that int has width <= 32. If we're going to keep the new code, there should at least be a check for that condition and a clear indication in the source that we're depending on that assumption. But really, I'd much prefer to have code that's valid regardless of the size of int, which is what the old code did.

@mdickinson
Copy link
Member

Do you mean that it's important to keep the uint32_t cast?

Here's a demonstration in Python that the implicit mask-out-everything-except-the-least-significant-32-bits operation matters.

>>> x = 2**28 + 1                                 # clearly, population count is 2
>>> x -= ((x >> 1) & 0x55555555)                  # step through the algorithm as written ...
>>> x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
>>> x = (x + (x >> 4)) & 0x0F0F0F0F
>>> (x * 0x01010101) >> 24     # this is the calculation that would happen on a machine with 64-bit `int`
16843010
>>> ((x * 0x01010101) & 0xFFFFFFFF) >> 24  # fixed to keep only the least significant 32 bits before shifting
2

@mdickinson
Copy link
Member

mdickinson commented Jan 22, 2022

I suggest replacing the last line with:

return (uint32_t)(x * 0x01010101U) >> 24;

(which incidentally is the code that was originally introduced for this, in GH-771). That avoids any suggestion of a 64-bit multiply, and also remains portable: in the weird cases where x gets promoted to int, the multiplication is of the form int by unsigned int; C's rules then imply that it's performed as a multiplication of two unsigned ints, and so we're safe from any undefined behaviour due to overflow. In the non-weird case, we're doing an unsigned-by-unsigned multiply of non-promoted operands, so again we're safe. In both cases, the explicit (uint32_t) cast ensures that we retain only the low 32 bits of the product.

I'll make a PR to fix this, and make sure to include an explanatory comment.

@mdickinson
Copy link
Member

I'll make a PR to fix this, and make sure to include an explanatory comment.

Done in #30794.

@tim-one
Copy link
Member

tim-one commented Jan 22, 2022

Mark, I don't agree. A C (or C++) environment isn't required to support uint32_t (etc) at all, but if it does, the type it resolves to must have exactly 32 bits. It's Python that requires our C environments to support that (the "C dialect" section of PEP 7).

The ambiguities you refer to could apply to ints declared as uint_least32_t or uint_fast32_t. All C environments must support those - but we aren't using them.

@mdickinson
Copy link
Member

the type it resolves to must have exactly 32 bits

Yes of course; no disagreement with that.

The problem arises if the C int type has width greater than 32. You agree that that's possible, and allowed by the standards, right? ILP64 was once a thing, even if it's really hard today to find a machine that has a 64-bit int type.

On such a machine, C's integer promotions kick in, and any multiplication of uint32_t operands becomes a multiplication of int operands.

@tim-one
Copy link
Member

tim-one commented Jan 22, 2022

Mark, ya, on third thought I agree your:

return (uint32_t)(x * 0x01010101U) >> 24;

is the best that can be done - although, also ya, there's no guarantee it won't do a 512x512 bit multiply 😉.

@mdickinson
Copy link
Member

mdickinson commented Jan 22, 2022

In exactly the same way, the following code invokes undefined behaviour on a typical modern machine:

#include <stdint.h>
#include <stdio.h>

uint16_t mul_short(uint16_t x, uint16_t y) {
    return x * y;
}

int main(void) {
    uint16_t result = mul_short(60000, 60000);
    printf("result = %d\n", result);
    return 0;
}

Here are the results on my machine of compiling the above with Clang's -fsanitize=undefined option, and running the resulting executable.

lovelace:cpython mdickinson$ clang -fsanitize=undefined ~/Desktop/mul_short.c 
lovelace:cpython mdickinson$ ./a.out
/Users/mdickinson/Desktop/mul_short.c:5:14: runtime error: signed integer overflow: 60000 * 60000 cannot be represented in type 'int'
result = 41984

@vstinner
Copy link
Member Author

Arithmetic in C is so hard :-(

@tim-one
Copy link
Member

tim-one commented Jan 22, 2022

Arithmetic in C is so hard :-(

Only if you care about getting the right answer efficiently and transparently - not
traditional concerns of OS developers, who invented the language 😉.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants