pp.h: replace erroneous bitwise AND with logical AND #22937

xenu · 2025-01-20T08:35:13Z

This set of changes does not require a perldelta entry.

iabyn · 2025-01-20T09:22:49Z

Using '&' rather than '&&' in the TARGi/u/n macros is intentional: it reduces the number of branches in hot code.

xenu · 2025-01-20T11:45:40Z

It's a misconception that branchless code is automatically better. In fact, "branchless" often means "unconditionally does more work". Modern branch predictors are really good. That said, I don't see why using bitwise AND would result in branchless code.

Anyway, let's see what GCC outputs. For example, let's have a look at pp.c:1354, this the hottest branch of pp_multiply, this line saves the result of the multiplication of two non-overflowing IVs:

                TARGi(il * ir, 0); /* args not GMG, so can't be tainted */

This what it compiles to on blead (i.e. with bitwise AND):

# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	mov	edx, DWORD PTR 12[r12]	# _29, iftmp.314_142->sv_flags
	imul	rsi, r8	# TARGi_iv, ir
	mov	ecx, edx	# tmp251, _29
	and	ecx, -1719596801	# tmp251,
	cmp	ecx, 1	# tmp251,
	je	.L1251	#,
.L1284:
# pp.c:1475:                             TARGi(NEGATE_2IV(product_low), 1);
	mov	rdi, r12	#, iftmp.314_142
	call	Perl_sv_setiv_mg@PLT	#
# inline.h:881:     *--PL_stack_sp = sv;
	mov	rax, QWORD PTR PL_stack_sp[rip]	# prephitmp_153, PL_stack_sp
	jmp	.L1252	#

#
# [ ~200 lines later... ]
#

.L1251:
# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	or	dh, 17	# tmp252,
	mov	QWORD PTR 16[r12], rsi	# iftmp.314_142->sv_u.svu_iv, TARGi_iv
	mov	DWORD PTR 12[r12], edx	# iftmp.314_142->sv_flags, tmp252
	jmp	.L1252	#

And this is how it compiles on my branch (logical AND):

# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	mov	edx, DWORD PTR 12[r12]	# _29, iftmp.314_139->sv_flags
	imul	rsi, r8	# TARGi_iv, ir
	mov	ecx, edx	# tmp232, _29
	and	ecx, -1719596801	# tmp232,
	cmp	ecx, 1	# tmp232,
	jne	.L1249	#,
# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	or	dh, 17	# tmp233,
	mov	QWORD PTR 16[r12], rsi	# iftmp.314_139->sv_u.svu_iv, TARGi_iv
	mov	DWORD PTR 12[r12], edx	# iftmp.314_139->sv_flags, tmp233
	jmp	.L1218	#

#
# [ ~300 lines later... ]
#

.L1249:
# pp.c:1475:                             TARGi(NEGATE_2IV(product_low), 1);
	mov	rdi, r12	#, iftmp.314_139
	call	Perl_sv_setiv_mg@PLT	#
# inline.h:881:     *--PL_stack_sp = sv;
	mov	rax, QWORD PTR PL_stack_sp[rip]	# prephitmp_125, PL_stack_sp
	jmp	.L1218	#

The instructions (and the number of branches) are exactly the same. However, the code layout is different; Apparently, LIKELY() was ignored in the "bitwise AND" version, so the unlikely/slow path (calling Perl_sv_setiv_mg) was prioritized, and the fast path (assignment) was moved far away from the test. This pattern repeats in e.g. pp_subtract and pp_add.

If anything, this seems like a pessimisation, not an optimisation. And even if it were a real optimisation, such unusual code deserves a comment (and ideally some proof!)

tonycoz · 2025-01-21T23:04:44Z

I remember looking at the original change here and thinking it was ugly, but based on the optimization guidance from Intel (avoid branches!!!) it seemed like it would be a performance improvement.

But I've recently been working on a tool to measure performance, and with the artificial benchmarks from t/perf/benchmarks I've been seeing remarkably few branch misses (AMD EPYC 7713 64-Core Processor)):

expr::arith::mult_lex_ni:

branch-misses: 25
branches: 26046
cache-misses: 57
cache-references: 131
context-switches: 0
cpu-migrations: 0
cycles: 49473
instructions: 148164
page-faults: 0
task-clock: 16481

(the values are (roughly*) medians from 10 runs of 100 iterations each)

The numbers come from Linux libperf which uses the CPU reported values.

ARM (Ampere here) is similar:

expr::arith::mult_lex_ni:

branch-misses: 27
branches: 24024
cache-misses: -3
cache-references: 98204
context-switches: 0
cpu-migrations: 0
cycles: 51661
instructions: 176559
page-faults: 0
task-clock: 30800

(negative cache-misses due to statistical variation, see the expandable section below for info on how this stuff is calculated)

Note: the x86_64 results are from a non-threaded build and the aarch64 from a threaded build.

* this code is very WIP

Raw measurements for expr::arith::mult_lex_ni

The code measures 100 iterations with the initialization and pre/post code in the loop, and 100 iterations with the pre/benchmark/post code in the loop.

e.g.

no warnings;
no strict;sub {
  srand(0);
  my $__count = shift;
  my $__perf = Linux::libperf::Simple->new;
  my ($x,$y,$z) = (1, 2.2, 3.3);;
$__perf->enable;
  for my $__iter (1 .. $__count) {
;
  }
$__perf->disable;
return $__perf->results;
}

vs:

no warnings;
no strict;
sub {
  srand(0);
  my $__count = shift;
  my $__perf = Linux::libperf::Simple->new;
  my ($x,$y,$z) = (1, 2.2, 3.3);;
$__perf->enable;
  for my $__iter (1 .. $__count) {
;$z = $x * $y;
  }
$__perf->disable;
return $__perf->results;
}

The 0 case measures for branch-misses:

1804 1802 1804 2053 1796 1803 2014 1806 1805 1803

The run with the benchmark code for branch-misses:

1826 1828 2042 1824 1829 2077 1828 1829 2130 1824

So there's a fair amount of variability in these raw numbers, but it's minor in comparison to the number of branches which was:

zero case:

38260 38260 38260 39393 38260 38260 38260 38260 38260 38260

benchmark case:

64306 64306 65318 64306 64306 65439 64306 64306 65723 64306

iabyn · 2025-01-22T12:56:03Z

On Mon, Jan 20, 2025 at 03:46:02AM -0800, Tomasz Konojacki wrote: It's a misconception that branchless code is automatically better. In fact, "branchless" often means "unconditionally does more work". Modern branch predictors are really good. That said, I don't see why using bitwise AND would result in branchless code. Anyway, let's see what GCC outputs. For example, let's have a look at [pp.c:1354](https://github.com/Perl/perl5/blob/b7db3ff4abeee359d5a78c227250e26a41ad1105/pp.c#L1354), this the hottest branch of pp_multiply, this line saves the result of the multiplication of two non-overflowing IVs: ```c TARGi(il * ir, 0); /* args not GMG, so can't be tainted */

Note that this line contains one of the rare-ish usages of TARG[iun] which has a constant 0 as the second arg. This is designed to skip taint checking. In fact the whole second arm of the 'and' (bitwise or logical) should be optimised away when do_taint is compile-time set to 0. So in this particular case, the generated code should be equivalent regardless of & or &&. Which branch of the main conditional is inline will be down to the vaguarities of the particular compiler and version. So you really need to look at code generated from one of the TARGi(...,1) entries.

…

-- Indomitable in retreat, invincible in advance, insufferable in victory -- Churchill on Montgomery

pp.h: replace erroneous bitwise AND with logical AND

bf026b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pp.h: replace erroneous bitwise AND with logical AND #22937

pp.h: replace erroneous bitwise AND with logical AND #22937

xenu commented Jan 20, 2025

iabyn commented Jan 20, 2025

xenu commented Jan 20, 2025

tonycoz commented Jan 21, 2025

iabyn commented Jan 22, 2025 via email

pp.h: replace erroneous bitwise AND with logical AND #22937

Are you sure you want to change the base?

pp.h: replace erroneous bitwise AND with logical AND #22937

Conversation

xenu commented Jan 20, 2025

iabyn commented Jan 20, 2025

xenu commented Jan 20, 2025

tonycoz commented Jan 21, 2025

iabyn commented Jan 22, 2025 via email