-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pp.h: replace erroneous bitwise AND with logical AND #22937
base: blead
Are you sure you want to change the base?
Conversation
Using '&' rather than '&&' in the TARGi/u/n macros is intentional: it reduces the number of branches in hot code. |
It's a misconception that branchless code is automatically better. In fact, "branchless" often means "unconditionally does more work". Modern branch predictors are really good. That said, I don't see why using bitwise AND would result in branchless code. Anyway, let's see what GCC outputs. For example, let's have a look at pp.c:1354, this the hottest branch of pp_multiply, this line saves the result of the multiplication of two non-overflowing IVs: TARGi(il * ir, 0); /* args not GMG, so can't be tainted */ This what it compiles to on blead (i.e. with bitwise AND): # pp.c:1354: TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
mov edx, DWORD PTR 12[r12] # _29, iftmp.314_142->sv_flags
imul rsi, r8 # TARGi_iv, ir
mov ecx, edx # tmp251, _29
and ecx, -1719596801 # tmp251,
cmp ecx, 1 # tmp251,
je .L1251 #,
.L1284:
# pp.c:1475: TARGi(NEGATE_2IV(product_low), 1);
mov rdi, r12 #, iftmp.314_142
call Perl_sv_setiv_mg@PLT #
# inline.h:881: *--PL_stack_sp = sv;
mov rax, QWORD PTR PL_stack_sp[rip] # prephitmp_153, PL_stack_sp
jmp .L1252 #
#
# [ ~200 lines later... ]
#
.L1251:
# pp.c:1354: TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
or dh, 17 # tmp252,
mov QWORD PTR 16[r12], rsi # iftmp.314_142->sv_u.svu_iv, TARGi_iv
mov DWORD PTR 12[r12], edx # iftmp.314_142->sv_flags, tmp252
jmp .L1252 # And this is how it compiles on my branch (logical AND): # pp.c:1354: TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
mov edx, DWORD PTR 12[r12] # _29, iftmp.314_139->sv_flags
imul rsi, r8 # TARGi_iv, ir
mov ecx, edx # tmp232, _29
and ecx, -1719596801 # tmp232,
cmp ecx, 1 # tmp232,
jne .L1249 #,
# pp.c:1354: TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
or dh, 17 # tmp233,
mov QWORD PTR 16[r12], rsi # iftmp.314_139->sv_u.svu_iv, TARGi_iv
mov DWORD PTR 12[r12], edx # iftmp.314_139->sv_flags, tmp233
jmp .L1218 #
#
# [ ~300 lines later... ]
#
.L1249:
# pp.c:1475: TARGi(NEGATE_2IV(product_low), 1);
mov rdi, r12 #, iftmp.314_139
call Perl_sv_setiv_mg@PLT #
# inline.h:881: *--PL_stack_sp = sv;
mov rax, QWORD PTR PL_stack_sp[rip] # prephitmp_125, PL_stack_sp
jmp .L1218 # The instructions (and the number of branches) are exactly the same. However, the code layout is different; Apparently, If anything, this seems like a pessimisation, not an optimisation. And even if it were a real optimisation, such unusual code deserves a comment (and ideally some proof!) |
I remember looking at the original change here and thinking it was ugly, but based on the optimization guidance from Intel (avoid branches!!!) it seemed like it would be a performance improvement. But I've recently been working on a tool to measure performance, and with the artificial benchmarks from expr::arith::mult_lex_ni:
(the values are (roughly*) medians from 10 runs of 100 iterations each) The numbers come from Linux libperf which uses the CPU reported values. ARM (Ampere here) is similar: expr::arith::mult_lex_ni:
(negative cache-misses due to statistical variation, see the expandable section below for info on how this stuff is calculated) Note: the x86_64 results are from a non-threaded build and the aarch64 from a threaded build. * this code is very WIP Raw measurements for expr::arith::mult_lex_niThe code measures 100 iterations with the initialization and pre/post code in the loop, and 100 iterations with the pre/benchmark/post code in the loop. e.g.
vs:
The 0 case measures for branch-misses: 1804 1802 1804 2053 1796 1803 2014 1806 1805 1803 The run with the benchmark code for branch-misses: 1826 1828 2042 1824 1829 2077 1828 1829 2130 1824 So there's a fair amount of variability in these raw numbers, but it's minor in comparison to the number of branches which was: zero case: 38260 38260 38260 39393 38260 38260 38260 38260 38260 38260 benchmark case: 64306 64306 65318 64306 64306 65439 64306 64306 65723 64306 |
On Mon, Jan 20, 2025 at 03:46:02AM -0800, Tomasz Konojacki wrote:
It's a misconception that branchless code is automatically better. In fact, "branchless" often means "unconditionally does more work". Modern branch predictors are really good. That said, I don't see why using bitwise AND would result in branchless code.
Anyway, let's see what GCC outputs. For example, let's have a look at [pp.c:1354](https://github.com/Perl/perl5/blob/b7db3ff4abeee359d5a78c227250e26a41ad1105/pp.c#L1354), this the hottest branch of pp_multiply, this line saves the result of the multiplication of two non-overflowing IVs:
```c
TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
Note that this line contains one of the rare-ish usages of TARG[iun] which
has a constant 0 as the second arg. This is designed to skip taint
checking. In fact the whole second arm of the 'and' (bitwise or logical)
should be optimised away when do_taint is compile-time set to 0.
So in this particular case, the generated code should be equivalent
regardless of & or &&. Which branch of the main conditional is inline
will be down to the vaguarities of the particular compiler and version.
So you really need to look at code generated from one of the TARGi(...,1)
entries.
…--
Indomitable in retreat, invincible in advance, insufferable in victory
-- Churchill on Montgomery
|
This set of changes does not require a perldelta entry.