Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-29882: Add an efficient popcount method for integers #771

Merged
merged 13 commits into from
May 29, 2020

Conversation

niklasf
Copy link
Contributor

@niklasf niklasf commented Mar 22, 2017

@mention-bot
Copy link

@niklasf, thanks for your PR! By analyzing the history of the files in this pull request, we identified @tim-one, @birkenfeld and @ncoghlan to be potential reviewers.

@ilevkivskyi
Copy link
Member

Some compilers (gcc) provide hardware optimized __builtin_popcount function(s). Could these speed-up the implementation?

@niklasf
Copy link
Contributor Author

niklasf commented Mar 22, 2017

@ilevkivskyi: Yes, using hardware popcnt makes this ever so slightly faster.

$ ./python -m timeit "bin(123456789).count('1')"  # equivalent
1000000 loops, best of 5: 286 nsec per loop
$ ./python -m timeit "(123456789).bit_count()"  # gnuc builtin
5000000 loops, best of 5: 44.8 nsec per loop
$ ./python -m timeit "(123456789).bit_count()"  # swar32
5000000 loops, best of 5: 46.3 nsec per loop

Previously I have submitted #594, but so far there is no consensus on wether or how to use these intrinsics. So I think it's best to consider this seperately.

Edit: I should also mention that there is a faster software implementation, using complete lookup tables for 16bit integers. This one is commonly used in chess engines if hardware popcnt is not available, but seems like a rather large overhead for a general purpose interpreter.

Copy link
Member

@methane methane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks OK.
But it seems no consensus about adding this method or not.

@niklasf niklasf force-pushed the bpo-29882-popcount branch from 1c6e9fa to b6e46da Compare June 2, 2019 09:21
@niklasf niklasf force-pushed the bpo-29882-popcount branch from b6e46da to 74e3ea6 Compare June 2, 2019 09:33
@niklasf
Copy link
Contributor Author

niklasf commented Jun 2, 2019

Thanks for reviewing. All done, in case we do decide to add this.

@mdickinson
Copy link
Member

@niklasf I'm interested in getting this updated and merged, for Python 3.10. May I merge master into this branch and make the appropriate fixes (changing the "whats new" entry from 3.8 to 3.10)? Or would you prefer to do that yourself?

@niklasf
Copy link
Contributor Author

niklasf commented May 25, 2020

@mdickinson Of course, feel free to modify as needed.

Copy link
Member

@mdickinson mdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. (Admittedly that's not entirely an independent judgement at this point, but I haven't changed any of the core code or tests.)

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

LGTM: the implementation is correct. I'm not strongly excited by having this method since I never needed it, but it seems like some other people need it. So well, it looks reasonable to add it. Also, in general, I completely trust @mdickinson :-)


/* Each digit has up to PyLong_SHIFT ones, so the accumulated bit count
from the first PY_SSIZE_T_MAX/PyLong_SHIFT digits can't overflow a
Py_ssize_t. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment, thanks! Before the comment, I was surprised by the two loops.

/* Each digit has up to PyLong_SHIFT ones, so the accumulated bit count
from the first PY_SSIZE_T_MAX/PyLong_SHIFT digits can't overflow a
Py_ssize_t. */
Py_ssize_t ndigits_fast = Py_MIN(ndigits, PY_SSIZE_T_MAX/PyLong_SHIFT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed to work use size_t for bit_count, but then I realzied that i is a Py_ssize_t, so it's better to leave the code as it is: work on Py_ssize_t for i and bit_count.

Copy link
Member

@mdickinson mdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of popcount_digit isn't quite as portable as it should be.

The fix should be trivial. I'll look into it.

u -= (u >> 1) & 0x55555555;
u = (u & 0x33333333) + ((u >> 2) & 0x33333333);
u = (u + (u >> 4)) & 0x0f0f0f0f;
return (u * 0x01010101) >> 24;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that this line isn't quite portable. The code is relying on the multiplication being a multiplication of 32-bit integers, so that the result is implicitly reduced modulo 2^32.

However, if we're on a platform with a 64-bit int then 0x01010101 will have type int (C99 6.4.4.1p5), u will be promoted to int (following the "usual arithmetic conversions" in C99 6.3.1.8p1), the multiplication will be a multiplication of 64-bit integers and we'll end up with the wrong result.

Copy link
Member

@mdickinson mdickinson May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix implemented, in the shape of an extra (uint32_t) cast on the multiplication result along with some U suffixes on the integer constants.

Note that the U suffix on 0x01010101U is actually necessary to avoid undefined behaviour: consider a (hypothetical, but permissible under the C standard) machine with a 48-bit int. If we do u * 0x01010101 on that machine, then first the integer promotions are applied to both arguments, and they both end up being of type int. Then we do an int-by-int multiplication. With the right (wrong?) value of d, this could overflow, giving undefined behaviour.

But with the U suffix, we end up after the integer promotions multiplying an int by an unsigned int, and then the usual arithmetic conversions kick in and the multiplication is performed as type unsigned int, which has well-defined wraparound behaviour for the result.

The other U suffixes aren't strictly necessary; they're just there to satisfy my OCD and make the code easier to think about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python/hamt.c has basically a copy of this function, it's just written a little bit differently, but its return type is uint32_t. It exists since Python 3.8 and we didn't get any bug report :-)

I added pycore_byteswap.h. Maybe we could have a pycore_bitutils.h for such function and declare it as static inline, to reuse it here and in hamt.c. Or it can be done later ;-)

@pitrou also asked if "pycore_byteswap.h" filename was not too specific, maybe we could pick a more generic name to put more functions there. Like "bits and bytes utilities" :-)

Note: _testinternalcapi has tests for the byte swap functions ;-) We could have tests on this popcount function as well to detect if a compiler "miscompiles" this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm volunteer to move this function as static inline function and add the new test ;-) As I wrote, it can be done later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It exists since Python 3.8 and we didn't get any bug report :-)

That's probably because it was never run on a machine with a 64-bit C int. Which is not surprising, because it's very hard to find such a machine these days. But who knows what could happen in the future. :-)

The version of the code in hamt.c has the same issues: it'll return the wrong result with a 64-bit C int, and it can technically invoke undefined behaviour. The return type of uint32_t on that function doesn't help, because you already have the wrong number / undefined behaviour before you hit the return.

Realistically, the undefined behaviour is unlikely to ever be an issue in practice. But the wrong result with a 64-bit int isn't entirely implausible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this, then we can look at combining the implementations or fixing the hamt.c code as necessary.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@mdickinson mdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after fixing my own portability objection.

@mdickinson
Copy link
Member

Merging. @niklasf: Thank you!

@mdickinson mdickinson merged commit 8bd216d into python:master May 29, 2020
@bedevere-bot
Copy link

@mdickinson: Please replace # with GH- in the commit message next time. Thanks!

@vstinner
Copy link
Member

OMG, I was merging the PR (I was editing the commit message) while @mdickinson merged it :-D Thanks @mdickinson ;-)

@niklasf
Copy link
Contributor Author

niklasf commented May 29, 2020

Thanks for getting this over the finish line!

@vstinner
Copy link
Member

PR created in 2017! It has an identifier lower than 1000 rather than the latest PR got the identifier 20517! Thanks @niklasf!

@vstinner
Copy link
Member

I wrote PR #20518 which adds a _Py_popcount32() static inline function based on popcount_digit().

CuriousLearner added a commit to CuriousLearner/cpython that referenced this pull request May 30, 2020
* 'master' of github.com:python/cpython: (497 commits)
  bpo-40061: Fix a possible refleak in _asynciomodule.c (pythonGH-19748)
  bpo-40798: Generate a different message for already removed elements (pythonGH-20483)
  closes bpo-29017: Update the bindings for Qt information with PySide2 (pythonGH-20149)
  bpo-39885: Make IDLE context menu cut and copy work again (pythonGH-18951)
  bpo-29882: Add an efficient popcount method for integers (python#771)
  Further de-linting of zoneinfo module (python#20499)
  bpo-40780: Fix failure of _Py_dg_dtoa to remove trailing zeros (pythonGH-20435)
  Indicate that abs() method accept argument that implement __abs__(), just like call() method in the docs (pythonGH-20509)
  bpo-39040: Fix parsing of email mime headers with whitespace between encoded-words. (pythongh-17620)
  bpo-40784: Fix sqlite3 deterministic test (pythonGH-20448)
  bpo-30064: Properly skip unstable loop.sock_connect() racing test (pythonGH-20494)
  Note the output ordering of combinatoric functions (pythonGH-19732)
  bpo-40474: Updated coverage.yml to better report coverage stats (python#19851)
  bpo-40806: Clarify that itertools.product immediately consumes its inpt (pythonGH-20492)
  bpo-1294959: Try to clarify the meaning of platlibdir (pythonGH-20332)
  bpo-37878: PyThreadState_DeleteCurrent() was not removed (pythonGH-20489)
  bpo-40777: Initialize PyDateTime_IsoCalendarDateType.tp_base at run-time (pythonGH-20493)
  bpo-40755: Add missing multiset operations to Counter() (pythonGH-20339)
  bpo-25920: Remove socket.getaddrinfo() lock on macOS (pythonGH-20177)
  bpo-40275: Fix test.support.threading_helper (pythonGH-20488)
  ...
@HannesGitH
Copy link

HannesGitH commented Jan 26, 2023

bit_count() vs bin() count('1')

code
import random
import timeit

lens = range(1, 10000)
maxs = [2**i for i in lens]

# array with random numbers in [0, maxs[i]-1]
randos = [max + random.randrange(max) for max in maxs]

# time bin().count('1') vs bit_count() for each random number
bin_times = [timeit.timeit(lambda: bin(rando).count('1'), number=100) for rando in randos]
bit_count_times = [timeit.timeit(lambda: rando.bit_count(), number=100) for rando in randos]

# create plot
from matplotlib import pyplot as plt

plt.loglog(lens, bin_times, label='bin().count(\'1\')')
plt.loglog(lens, bit_count_times, label='bit_count()')
plt.legend()
plt.show()

@mdickinson
Copy link
Member

@HannesGitH Why the comment? Is there some action that needs to be taken?

@HannesGitH
Copy link

HannesGitH commented Jan 26, 2023

@HannesGitH Why the comment? Is there some action that needs to be taken?

nope, all fine, sorry

just wanted to add the speedup graph for anyone who stumbles upon this and is curious

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.