Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-46055: Streamline inner loop for right shifts #30243

Merged
merged 6 commits into from
Dec 27, 2021

Conversation

mdickinson
Copy link
Member

@mdickinson mdickinson commented Dec 23, 2021

While reviewing #30044, I noticed that the inner loop for the right shift operation could be more efficient. Here's a PR that streamlines that loop. The main changes are:

  • remove an unnecessary extra masking operation (& lomask), and replace & himask with & PyLong_MASK
  • rewrite the loop to remove multiple accesses to the digits of a and z
  • remove an inner branch

On my machine (macOS 10.14.6 / Intel MacBook Pro), in informal timings I get approximately a 35% speedup for a shift of the form huge >> small. Some sample timings:

On main (commit cf15419):

lovelace:cpython mdickinson$ ./python.exe -m timeit -s "a, b = 7**100000, 53" "a >> b"
20000 loops, best of 5: 11.5 usec per loop

On this branch (commit 056495d):

lovelace:cpython mdickinson$ ./python.exe -m timeit -s "a, b = 7**100000, 53" "a >> b"
50000 loops, best of 5: 8.51 usec per loop

Small shift operations are not significantly affected. More sample timings - on master:

lovelace:cpython mdickinson$ ./python.exe -m timeit -s "a, b = 7**10, 53" "a >> b"
10000000 loops, best of 5: 23.3 nsec per loop

On this branch:

lovelace:cpython mdickinson$ ./python.exe -m timeit -s "a, b = 7**10, 53" "a >> b"
10000000 loops, best of 5: 24 nsec per loop

(but a second run on this branch gave 22.9 nsec per loop, so any difference is being lost in the variation between runs)

https://bugs.python.org/issue46055

@mdickinson
Copy link
Member Author

mdickinson commented Dec 23, 2021

New timings with the latest commit (f92f3f8) give me a ~75% speedup for the big >> small case.

lovelace:cpython mdickinson$ git checkout main && make >> /dev/null && ./python.exe -m timeit -s "a, b = 7**100000, 53" "a >> b"
[ ... build output snipped ...]
20000 loops, best of 5: 11.6 usec per loop
lovelace:cpython mdickinson$ git checkout faster-rshift && make >> /dev/null && ./python.exe -m timeit -s "a, b = 7**100000, 53" "a >> b"
[ ... build output snipped ...]
50000 loops, best of 5: 6.53 usec per loop

for (i = 0, j = wordshift; i < newsize; i++, j++) {
z->ob_digit[i] = (a->ob_digit[j] >> remshift) & lomask;
if (i+1 < newsize)
z->ob_digit[i] |= (a->ob_digit[j+1] << hishift) & himask;
Copy link
Member Author

@mdickinson mdickinson Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code also fixes a really subtle portability issue in this code. Here the result of a->ob_digit[j+1] * 2**hishift may not be representable in the target type. Normally that wouldn't matter, because a->ob_digit[j+1] has type digit, which is an unsigned type, so the C standards tell us that any out-of-range value wraps in the normal way. But integer promotions could result in the left-hand operand to the shift actually being of type int (a signed type), and then an out-of-range shift result gives undefined behaviour according to the standard (C99 §6.5.7p4). We don't run into this in practice because under any likely combination of integer type bit widths (e.g., 16-bit digit, 32-bit int), if digit is small enough to be promoted to int, then int is likely big enough to hold the shift result. But the C standard does allow potentially problematic bit widths (e.g., digit could be 16 bits and int 24 bits).

Not a real issue, since it's unlikely we'd ever meet this in practice, but it's nice not to have to worry about it. With the new code, the result of the shift is guaranteed to be representable in the target type (that type being either twodigits, or something larger in the case that there are integer promotions going on).

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I am wondering how it works on 32-bit platform with 32-bit digit, but I think that memory access optimization will compensate it.

accum = a->ob_digit[j++] >> remshift;
for (i = 0; j < Py_SIZE(a); i++, j++) {
accum |= (twodigits)a->ob_digit[j] << hishift;
z->ob_digit[i] = (digit)(accum & PyLong_MASK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not be better to operate on a single digit? (digit)accum & PyLong_MASK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I find this code clearer as a statement of intent: we're only changing the value once, not twice (all other things being equal, I like my casts not to change values).

I'll run some timings and look at the generated code. If the cast-first variant is faster, I'll change this.

@mdickinson
Copy link
Member Author

I am wondering how it works on 32-bit platform with 32-bit digit

Yes, me too. I'll see if I can find a way to check this.

@mdickinson
Copy link
Member Author

mdickinson commented Dec 27, 2021

I'll run some timings and look at the generated code. If the cast-first variant is faster, I'll change this.

At least on my machine, I'm not seeing any difference. The assembly generated for the inner loop is identical both ways (z->ob_digit[i] = (digit)accum & PyLong_MASK; versus z->ob_digit[i] = (digit)(accum & PyLong_MASK);, except for a difference in column numbers in the .loc statements.

Ltmp3675:
LBB51_18:
	##DEBUG_VALUE: long_lshift1:remshift <- [DW_OP_constu 44, DW_OP_minus] [$rbp+0]
	##DEBUG_VALUE: long_lshift1:j <- $rsi
	##DEBUG_VALUE: long_lshift1:accum <- $rax
	##DEBUG_VALUE: long_lshift1:z <- $r12
	##DEBUG_VALUE: long_lshift1:i <- $r15
	##DEBUG_VALUE: long_lshift1:a <- $rbx
	##DEBUG_VALUE: long_lshift1:newsize <- $r13
	.loc	1 4589 29 is_stmt 1     ## Objects/longobject.c:4589:29
	movl	24(%rbx,%rsi,4), %edx
                                        ## kill: def $cl killed $cl killed $rcx
	.loc	1 4589 44 is_stmt 0     ## Objects/longobject.c:4589:44
	shlq	%cl, %rdx
	.loc	1 4589 15               ## Objects/longobject.c:4589:15
	orq	%rax, %rdx
Ltmp3676:
	##DEBUG_VALUE: long_lshift1:accum <- $rdx
	.loc	1 4590 26 is_stmt 1     ## Objects/longobject.c:4590:26
	movl	%edx, %eax
	andl	$1073741823, %eax       ## imm = 0x3FFFFFFF
	.loc	1 4590 24 is_stmt 0     ## Objects/longobject.c:4590:24
	movl	%eax, 24(%r12,%r15,4)
	.loc	1 4591 15 is_stmt 1     ## Objects/longobject.c:4591:15
	shrq	$30, %rdx
Ltmp3677:
	##DEBUG_VALUE: long_lshift1:accum <- $rdx
	movq	%rdx, %rax

@mdickinson
Copy link
Member Author

Whoops; sorry - that assembly excerpt was from changing the long_lshift1 inner loop, not the long_rshift1 one. But I see the same with long_rshift1: no non-trivial change to the assembly code. (The details are a bit messier: it looks as though clang has decided to unroll the inner loop for long_rshift1.)

@mdickinson mdickinson closed this Dec 27, 2021
@mdickinson mdickinson reopened this Dec 27, 2021
@mdickinson mdickinson merged commit 360fedc into python:main Dec 27, 2021
@bedevere-bot
Copy link

@mdickinson: Please replace # with GH- in the commit message next time. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants