bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

vstinner · 2022-01-21T22:56:10Z

32x32 bits multiply is enough for _Py_popcount32().

https://bugs.python.org/issue29882

32x32 bits multiply is enough for _Py_popcount32().

vstinner · 2022-01-21T22:56:22Z

cc @mdickinson @tim-one

tim-one

I agree - looks good!

vstinner · 2022-01-22T01:16:17Z

I agree - looks good!

Thanks for the review.

I don't understand well how int multiplication works in C. Does uint32_t x uint32_t produces an uint64_t number or an uint32_t number? For _Py_popcount32(), the result fits into uint32_t anyway ;-)

tim-one · 2022-01-22T01:39:19Z

C's rules are defined on the platform's unsigned char, short, int, long, and long long types. so can't be fully spelled out without knowing which of those map to uint32_t on the platform. But it doesn't matter 😉 For an NxN bit unsigned int multiply, C "in general" only returns the lowermost N bits of the product, throwing the uppermost N bits away. So the important part here is your "the result fits into uint32_t anyway". Or, really more importantly, that the 8 bits holding the final result were the topmost byte of the 32-bit result returned by the 32x32-bit multiply. Casting a multiplicand to uint64_t first (as the code used to do) caused a 64x64->64 bit multiply, and the code had to clear the top 32 bits of the result, by casting to uint32_t, before shifting right.

The new code is just as correct, but clearer and possibly a bit quicker (depending on HW quirks).

mdickinson · 2022-01-22T07:16:44Z

The new code is just as correct

I don't think that's true, unfortunately. The old code was strictly portable, but the new code assumes that the width of int is no more than 32 bits, and will give undefined behaviour or incorrect results if that's not true.

The problem is that if the width of int exceeds that of uint32_t, then both arguments to the multiplication are promoted to int (C99 §6.3.1.1p2). The multiplication then ends up as a multiplication of ints, which, if it overflows (e.g., with a 34-bit int, unlikely though that is) triggers undefined behaviour.

More likely would be a 64-bit int, in which case we're not going to get undefined behaviour due to overflow, but we are going to get incorrect results - in the previous code, we were relying on theuint32_t cast to chop off all but the least significant 32 bits of the result.

I'd except any compiler to already recognise that in the old code there's no need for a 64-bit-by-64-bit multiply.

Please could we revert this change, and possibly add a comment explaining why the code is delicate?

mdickinson · 2022-01-22T07:20:26Z

Here's godbolt, showing that GCC at least does not produce a 64-by-64-bit multiply instruction: https://godbolt.org/z/shbY7j6cW

mdickinson · 2022-01-22T07:51:31Z

@vstinner Please see also my explanation on the original PR: #20518 (comment)

My original version is actually safer here. :-) This version can still invoke undefined behaviour (albeit only on unusual machines). Realistically, it's unlikely that we'd ever hit the undefined behaviour in practice, but given that there's an easy way to write this that avoids that undefined behaviour, I'd prefer to use that way.

If your int type has width larger than 32 bits, then C's integer promotions (C99 §6.3.1.1) ensure that both
u and UINT32_C(0x01010101) are treated as (signed) int before the multiply, and then there's potential for the multiply to overflow, which gives undefined behaviour.

In contrast, if in the same situation you do u * 0x01010101U, then after those same integer promotions you're multiplying an int by an unsigned int, and C's "usual arithmetic conversions" (C99 §6.3.1.8) kick in to ensure that the multiplication is actually performed as unsigned int by unsigned int, which is safe from undefined behaviour and simply discards high bits in the expected manner if necessary.

Performance isn't likely to be a problem with the (uint32_t)(u * 0x01010101U) version: the cast ensures that a compiler doesn't have to be all that clever to realise that it's enough to do a 32x32->32 unsigned multiply here.

And yes, it's crazy that it's this hard to write a portable uint32_t * uint32_t -> uint32_t multiply in standard C. Maybe it's time to rewrite Python's core in Rust.

arhadthedev · 2022-01-22T08:24:58Z

Maybe it's time to rewrite Python's core in Rust.

The idea of file-by-file porting is feasible considering by-design good ABI and API cooperation between Rust code and C code.

However, are there enough developers ready to invest their time into studying Rust? PRs are already left unclosed for weeks so losing experts and core devs is luxury.

vstinner · 2022-01-22T13:50:02Z

My change replaces return (uint32_t)((uint64_t)x * (uint64_t)SUM) >> 24; with return (x * SUM) >> 24; where x and SUM types are uint32_t. My intent was to avoid uint64_t, or worse, uint128_t.

Mark:

the cast ensures that a compiler doesn't have to be all that clever to realise that it's enough to do a 32x32->32 unsigned multiply here.

Do you mean that it's important to keep the uint32_t cast? Do you suggest return (uint32_t)(x * SUM) >> 24;?

Well, it seems like there is heavy engineering on this exact line, IMO if it's modified one more time, a comment must explain why it's written exactly like that :-)

The result is in the range [0; 32], so uint32_t is enough ;-)

mdickinson · 2022-01-22T13:59:29Z

Do you mean that it's important to keep the uint32_t cast? Do you suggest return (uint32_t)(x * SUM) >> 24;?

That would help a bit, but still leaves the potential issue of undefined behaviour. The new code is only valid in the case that int has width <= 32. If we're going to keep the new code, there should at least be a check for that condition and a clear indication in the source that we're depending on that assumption. But really, I'd much prefer to have code that's valid regardless of the size of int, which is what the old code did.

mdickinson · 2022-01-22T14:15:11Z

Do you mean that it's important to keep the uint32_t cast?

Here's a demonstration in Python that the implicit mask-out-everything-except-the-least-significant-32-bits operation matters.

>>> x = 2**28 + 1                                 # clearly, population count is 2
>>> x -= ((x >> 1) & 0x55555555)                  # step through the algorithm as written ...
>>> x = (x & 0x33333333) + ((x >> 2) & 0x33333333)
>>> x = (x + (x >> 4)) & 0x0F0F0F0F
>>> (x * 0x01010101) >> 24     # this is the calculation that would happen on a machine with 64-bit `int`
16843010
>>> ((x * 0x01010101) & 0xFFFFFFFF) >> 24  # fixed to keep only the least significant 32 bits before shifting
2

mdickinson · 2022-01-22T15:10:58Z

I suggest replacing the last line with:

return (uint32_t)(x * 0x01010101U) >> 24;

(which incidentally is the code that was originally introduced for this, in GH-771). That avoids any suggestion of a 64-bit multiply, and also remains portable: in the weird cases where x gets promoted to int, the multiplication is of the form int by unsigned int; C's rules then imply that it's performed as a multiplication of two unsigned ints, and so we're safe from any undefined behaviour due to overflow. In the non-weird case, we're doing an unsigned-by-unsigned multiply of non-promoted operands, so again we're safe. In both cases, the explicit (uint32_t) cast ensures that we retain only the low 32 bits of the product.

I'll make a PR to fix this, and make sure to include an explanatory comment.

mdickinson · 2022-01-22T15:30:58Z

I'll make a PR to fix this, and make sure to include an explanatory comment.

Done in #30794.

tim-one · 2022-01-22T16:41:35Z

Mark, I don't agree. A C (or C++) environment isn't required to support uint32_t (etc) at all, but if it does, the type it resolves to must have exactly 32 bits. It's Python that requires our C environments to support that (the "C dialect" section of PEP 7).

The ambiguities you refer to could apply to ints declared as uint_least32_t or uint_fast32_t. All C environments must support those - but we aren't using them.

mdickinson · 2022-01-22T16:51:17Z

the type it resolves to must have exactly 32 bits

Yes of course; no disagreement with that.

The problem arises if the C int type has width greater than 32. You agree that that's possible, and allowed by the standards, right? ILP64 was once a thing, even if it's really hard today to find a machine that has a 64-bit int type.

On such a machine, C's integer promotions kick in, and any multiplication of uint32_t operands becomes a multiplication of int operands.

tim-one · 2022-01-22T16:57:52Z

Mark, ya, on third thought I agree your:

return (uint32_t)(x * 0x01010101U) >> 24;

is the best that can be done - although, also ya, there's no guarantee it won't do a 512x512 bit multiply 😉.

mdickinson · 2022-01-22T16:57:55Z

In exactly the same way, the following code invokes undefined behaviour on a typical modern machine:

#include <stdint.h>
#include <stdio.h>

uint16_t mul_short(uint16_t x, uint16_t y) {
    return x * y;
}

int main(void) {
    uint16_t result = mul_short(60000, 60000);
    printf("result = %d\n", result);
    return 0;
}

Here are the results on my machine of compiling the above with Clang's -fsanitize=undefined option, and running the resulting executable.

lovelace:cpython mdickinson$ clang -fsanitize=undefined ~/Desktop/mul_short.c 
lovelace:cpython mdickinson$ ./a.out
/Users/mdickinson/Desktop/mul_short.c:5:14: runtime error: signed integer overflow: 60000 * 60000 cannot be represented in type 'int'
result = 41984

vstinner · 2022-01-22T18:14:38Z

Arithmetic in C is so hard :-(

tim-one · 2022-01-22T18:22:45Z

Arithmetic in C is so hard :-(

Only if you care about getting the right answer efficiently and transparently - not
traditional concerns of OS developers, who invented the language 😉.

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply

2ff3eb4

32x32 bits multiply is enough for _Py_popcount32().

vstinner added the skip news label Jan 21, 2022

the-knights-who-say-ni added the CLA signed label Jan 21, 2022

tim-one approved these changes Jan 21, 2022

View reviewed changes

bedevere-bot added the awaiting merge label Jan 21, 2022

vstinner changed the title ~~bpo-29882: _Py_popcount32() doesn't need 64x64 multiply~~ bpo-29882: _Py_popcount32() doesn't need 64x64 multiply Jan 21, 2022

vstinner merged commit cd8de40 into python:main Jan 21, 2022

bedevere-bot removed the awaiting merge label Jan 21, 2022

vstinner deleted the popcount_mul branch January 21, 2022 23:54

mdickinson added a commit to mdickinson/cpython that referenced this pull request Jan 22, 2022

bpo-29882: Fix portability bug introduced in pythonGH-30774

ff5ba30

mdickinson mentioned this pull request Jan 22, 2022

bpo-29882: Fix portability bug introduced in GH-30774 #30794

Merged

mdickinson added a commit that referenced this pull request Jan 23, 2022

bpo-29882: Fix portability bug introduced in GH-30774 (#30794)

83a0ef2

niklasf mannequin mentioned this pull request Nov 13, 2022

Add an efficient popcount method for integers #74068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

vstinner commented Jan 21, 2022 •

edited by bedevere-bot

Loading

vstinner commented Jan 21, 2022

tim-one left a comment

vstinner commented Jan 22, 2022

tim-one commented Jan 22, 2022

mdickinson commented Jan 22, 2022 •

edited

Loading

mdickinson commented Jan 22, 2022

mdickinson commented Jan 22, 2022 •

edited

Loading

arhadthedev commented Jan 22, 2022

vstinner commented Jan 22, 2022

mdickinson commented Jan 22, 2022

mdickinson commented Jan 22, 2022

mdickinson commented Jan 22, 2022 •

edited

Loading

mdickinson commented Jan 22, 2022

tim-one commented Jan 22, 2022

mdickinson commented Jan 22, 2022

tim-one commented Jan 22, 2022

mdickinson commented Jan 22, 2022 •

edited

Loading

vstinner commented Jan 22, 2022

tim-one commented Jan 22, 2022

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

bpo-29882: _Py_popcount32() doesn't need 64x64 multiply #30774

Conversation

vstinner commented Jan 21, 2022 • edited by bedevere-bot Loading

vstinner commented Jan 21, 2022

tim-one left a comment

Choose a reason for hiding this comment

vstinner commented Jan 22, 2022

tim-one commented Jan 22, 2022

mdickinson commented Jan 22, 2022 • edited Loading

mdickinson commented Jan 22, 2022

mdickinson commented Jan 22, 2022 • edited Loading

arhadthedev commented Jan 22, 2022

vstinner commented Jan 22, 2022

mdickinson commented Jan 22, 2022

mdickinson commented Jan 22, 2022

mdickinson commented Jan 22, 2022 • edited Loading

mdickinson commented Jan 22, 2022

tim-one commented Jan 22, 2022

mdickinson commented Jan 22, 2022

tim-one commented Jan 22, 2022

mdickinson commented Jan 22, 2022 • edited Loading

vstinner commented Jan 22, 2022

tim-one commented Jan 22, 2022

vstinner commented Jan 21, 2022 •

edited by bedevere-bot

Loading

mdickinson commented Jan 22, 2022 •

edited

Loading

mdickinson commented Jan 22, 2022 •

edited

Loading

mdickinson commented Jan 22, 2022 •

edited

Loading

mdickinson commented Jan 22, 2022 •

edited

Loading