-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-29882: Add an efficient popcount method for integers #771
Conversation
@niklasf, thanks for your PR! By analyzing the history of the files in this pull request, we identified @tim-one, @birkenfeld and @ncoghlan to be potential reviewers. |
Some compilers (gcc) provide hardware optimized |
@ilevkivskyi: Yes, using hardware popcnt makes this ever so slightly faster.
Previously I have submitted #594, but so far there is no consensus on wether or how to use these intrinsics. So I think it's best to consider this seperately. Edit: I should also mention that there is a faster software implementation, using complete lookup tables for 16bit integers. This one is commonly used in chess engines if hardware popcnt is not available, but seems like a rather large overhead for a general purpose interpreter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks OK.
But it seems no consensus about adding this method or not.
1c6e9fa
to
b6e46da
Compare
b6e46da
to
74e3ea6
Compare
Thanks for reviewing. All done, in case we do decide to add this. |
@niklasf I'm interested in getting this updated and merged, for Python 3.10. May I merge master into this branch and make the appropriate fixes (changing the "whats new" entry from 3.8 to 3.10)? Or would you prefer to do that yourself? |
@mdickinson Of course, feel free to modify as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. (Admittedly that's not entirely an independent judgement at this point, but I haven't changed any of the core code or tests.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
LGTM: the implementation is correct. I'm not strongly excited by having this method since I never needed it, but it seems like some other people need it. So well, it looks reasonable to add it. Also, in general, I completely trust @mdickinson :-)
|
||
/* Each digit has up to PyLong_SHIFT ones, so the accumulated bit count | ||
from the first PY_SSIZE_T_MAX/PyLong_SHIFT digits can't overflow a | ||
Py_ssize_t. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment, thanks! Before the comment, I was surprised by the two loops.
/* Each digit has up to PyLong_SHIFT ones, so the accumulated bit count | ||
from the first PY_SSIZE_T_MAX/PyLong_SHIFT digits can't overflow a | ||
Py_ssize_t. */ | ||
Py_ssize_t ndigits_fast = Py_MIN(ndigits, PY_SSIZE_T_MAX/PyLong_SHIFT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed to work use size_t for bit_count, but then I realzied that i is a Py_ssize_t, so it's better to leave the code as it is: work on Py_ssize_t for i and bit_count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of popcount_digit
isn't quite as portable as it should be.
The fix should be trivial. I'll look into it.
Objects/longobject.c
Outdated
u -= (u >> 1) & 0x55555555; | ||
u = (u & 0x33333333) + ((u >> 2) & 0x33333333); | ||
u = (u + (u >> 4)) & 0x0f0f0f0f; | ||
return (u * 0x01010101) >> 24; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised that this line isn't quite portable. The code is relying on the multiplication being a multiplication of 32-bit integers, so that the result is implicitly reduced modulo 2^32
.
However, if we're on a platform with a 64-bit int
then 0x01010101
will have type int
(C99 6.4.4.1p5), u
will be promoted to int
(following the "usual arithmetic conversions" in C99 6.3.1.8p1), the multiplication will be a multiplication of 64-bit integers and we'll end up with the wrong result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix implemented, in the shape of an extra (uint32_t)
cast on the multiplication result along with some U
suffixes on the integer constants.
Note that the U
suffix on 0x01010101U
is actually necessary to avoid undefined behaviour: consider a (hypothetical, but permissible under the C standard) machine with a 48-bit int
. If we do u * 0x01010101
on that machine, then first the integer promotions are applied to both arguments, and they both end up being of type int
. Then we do an int
-by-int
multiplication. With the right (wrong?) value of d
, this could overflow, giving undefined behaviour.
But with the U
suffix, we end up after the integer promotions multiplying an int
by an unsigned int
, and then the usual arithmetic conversions kick in and the multiplication is performed as type unsigned int
, which has well-defined wraparound behaviour for the result.
The other U
suffixes aren't strictly necessary; they're just there to satisfy my OCD and make the code easier to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python/hamt.c has basically a copy of this function, it's just written a little bit differently, but its return type is uint32_t. It exists since Python 3.8 and we didn't get any bug report :-)
I added pycore_byteswap.h. Maybe we could have a pycore_bitutils.h for such function and declare it as static inline, to reuse it here and in hamt.c. Or it can be done later ;-)
@pitrou also asked if "pycore_byteswap.h" filename was not too specific, maybe we could pick a more generic name to put more functions there. Like "bits and bytes utilities" :-)
Note: _testinternalcapi has tests for the byte swap functions ;-) We could have tests on this popcount function as well to detect if a compiler "miscompiles" this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm volunteer to move this function as static inline function and add the new test ;-) As I wrote, it can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exists since Python 3.8 and we didn't get any bug report :-)
That's probably because it was never run on a machine with a 64-bit C int
. Which is not surprising, because it's very hard to find such a machine these days. But who knows what could happen in the future. :-)
The version of the code in hamt.c
has the same issues: it'll return the wrong result with a 64-bit C int
, and it can technically invoke undefined behaviour. The return type of uint32_t
on that function doesn't help, because you already have the wrong number / undefined behaviour before you hit the return.
Realistically, the undefined behaviour is unlikely to ever be an issue in practice. But the wrong result with a 64-bit int
isn't entirely implausible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this, then we can look at combining the implementations or fixing the hamt.c
code as necessary.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approving after fixing my own portability objection.
Merging. @niklasf: Thank you! |
@mdickinson: Please replace |
OMG, I was merging the PR (I was editing the commit message) while @mdickinson merged it :-D Thanks @mdickinson ;-) |
Thanks for getting this over the finish line! |
PR created in 2017! It has an identifier lower than 1000 rather than the latest PR got the identifier 20517! Thanks @niklasf! |
I wrote PR #20518 which adds a _Py_popcount32() static inline function based on popcount_digit(). |
* 'master' of github.com:python/cpython: (497 commits) bpo-40061: Fix a possible refleak in _asynciomodule.c (pythonGH-19748) bpo-40798: Generate a different message for already removed elements (pythonGH-20483) closes bpo-29017: Update the bindings for Qt information with PySide2 (pythonGH-20149) bpo-39885: Make IDLE context menu cut and copy work again (pythonGH-18951) bpo-29882: Add an efficient popcount method for integers (python#771) Further de-linting of zoneinfo module (python#20499) bpo-40780: Fix failure of _Py_dg_dtoa to remove trailing zeros (pythonGH-20435) Indicate that abs() method accept argument that implement __abs__(), just like call() method in the docs (pythonGH-20509) bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (pythongh-17620) bpo-40784: Fix sqlite3 deterministic test (pythonGH-20448) bpo-30064: Properly skip unstable loop.sock_connect() racing test (pythonGH-20494) Note the output ordering of combinatoric functions (pythonGH-19732) bpo-40474: Updated coverage.yml to better report coverage stats (python#19851) bpo-40806: Clarify that itertools.product immediately consumes its inpt (pythonGH-20492) bpo-1294959: Try to clarify the meaning of platlibdir (pythonGH-20332) bpo-37878: PyThreadState_DeleteCurrent() was not removed (pythonGH-20489) bpo-40777: Initialize PyDateTime_IsoCalendarDateType.tp_base at run-time (pythonGH-20493) bpo-40755: Add missing multiset operations to Counter() (pythonGH-20339) bpo-25920: Remove socket.getaddrinfo() lock on macOS (pythonGH-20177) bpo-40275: Fix test.support.threading_helper (pythonGH-20488) ...
codeimport random
import timeit
lens = range(1, 10000)
maxs = [2**i for i in lens]
# array with random numbers in [0, maxs[i]-1]
randos = [max + random.randrange(max) for max in maxs]
# time bin().count('1') vs bit_count() for each random number
bin_times = [timeit.timeit(lambda: bin(rando).count('1'), number=100) for rando in randos]
bit_count_times = [timeit.timeit(lambda: rando.bit_count(), number=100) for rando in randos]
# create plot
from matplotlib import pyplot as plt
plt.loglog(lens, bin_times, label='bin().count(\'1\')')
plt.loglog(lens, bit_count_times, label='bit_count()')
plt.legend()
plt.show()
|
@HannesGitH Why the comment? Is there some action that needs to be taken? |
nope, all fine, sorry just wanted to add the speedup graph for anyone who stumbles upon this and is curious |
https://bugs.python.org/issue29882
https://bugs.python.org/issue29882