-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-46055: Streamline inner loop for right shifts #30243
Conversation
New timings with the latest commit (f92f3f8) give me a ~75% speedup for the
|
for (i = 0, j = wordshift; i < newsize; i++, j++) { | ||
z->ob_digit[i] = (a->ob_digit[j] >> remshift) & lomask; | ||
if (i+1 < newsize) | ||
z->ob_digit[i] |= (a->ob_digit[j+1] << hishift) & himask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code also fixes a really subtle portability issue in this code. Here the result of a->ob_digit[j+1] * 2**hishift
may not be representable in the target type. Normally that wouldn't matter, because a->ob_digit[j+1]
has type digit
, which is an unsigned type, so the C standards tell us that any out-of-range value wraps in the normal way. But integer promotions could result in the left-hand operand to the shift actually being of type int
(a signed type), and then an out-of-range shift result gives undefined behaviour according to the standard (C99 §6.5.7p4). We don't run into this in practice because under any likely combination of integer type bit widths (e.g., 16-bit digit
, 32-bit int
), if digit
is small enough to be promoted to int
, then int
is likely big enough to hold the shift result. But the C standard does allow potentially problematic bit widths (e.g., digit
could be 16 bits and int
24 bits).
Not a real issue, since it's unlikely we'd ever meet this in practice, but it's nice not to have to worry about it. With the new code, the result of the shift is guaranteed to be representable in the target type (that type being either twodigits
, or something larger in the case that there are integer promotions going on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I am wondering how it works on 32-bit platform with 32-bit digit, but I think that memory access optimization will compensate it.
accum = a->ob_digit[j++] >> remshift; | ||
for (i = 0; j < Py_SIZE(a); i++, j++) { | ||
accum |= (twodigits)a->ob_digit[j] << hishift; | ||
z->ob_digit[i] = (digit)(accum & PyLong_MASK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not be better to operate on a single digit? (digit)accum & PyLong_MASK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. I find this code clearer as a statement of intent: we're only changing the value once, not twice (all other things being equal, I like my casts not to change values).
I'll run some timings and look at the generated code. If the cast-first variant is faster, I'll change this.
Yes, me too. I'll see if I can find a way to check this. |
At least on my machine, I'm not seeing any difference. The assembly generated for the inner loop is identical both ways (
|
Whoops; sorry - that assembly excerpt was from changing the |
@mdickinson: Please replace |
While reviewing #30044, I noticed that the inner loop for the right shift operation could be more efficient. Here's a PR that streamlines that loop. The main changes are:
& lomask
), and replace& himask
with& PyLong_MASK
a
andz
On my machine (macOS 10.14.6 / Intel MacBook Pro), in informal timings I get approximately a 35% speedup for a shift of the form
huge >> small
. Some sample timings:On main (commit cf15419):
On this branch (commit 056495d):
Small shift operations are not significantly affected. More sample timings - on master:
On this branch:
(but a second run on this branch gave 22.9 nsec per loop, so any difference is being lost in the variation between runs)
https://bugs.python.org/issue46055