Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pp.h: replace erroneous bitwise AND with logical AND #22937

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

xenu
Copy link
Member

@xenu xenu commented Jan 20, 2025

This set of changes does not require a perldelta entry.

@iabyn
Copy link
Contributor

iabyn commented Jan 20, 2025

Using '&' rather than '&&' in the TARGi/u/n macros is intentional: it reduces the number of branches in hot code.

@xenu
Copy link
Member Author

xenu commented Jan 20, 2025

It's a misconception that branchless code is automatically better. In fact, "branchless" often means "unconditionally does more work". Modern branch predictors are really good. That said, I don't see why using bitwise AND would result in branchless code.

Anyway, let's see what GCC outputs. For example, let's have a look at pp.c:1354, this the hottest branch of pp_multiply, this line saves the result of the multiplication of two non-overflowing IVs:

                TARGi(il * ir, 0); /* args not GMG, so can't be tainted */

This what it compiles to on blead (i.e. with bitwise AND):

# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	mov	edx, DWORD PTR 12[r12]	# _29, iftmp.314_142->sv_flags
	imul	rsi, r8	# TARGi_iv, ir
	mov	ecx, edx	# tmp251, _29
	and	ecx, -1719596801	# tmp251,
	cmp	ecx, 1	# tmp251,
	je	.L1251	#,
.L1284:
# pp.c:1475:                             TARGi(NEGATE_2IV(product_low), 1);
	mov	rdi, r12	#, iftmp.314_142
	call	Perl_sv_setiv_mg@PLT	#
# inline.h:881:     *--PL_stack_sp = sv;
	mov	rax, QWORD PTR PL_stack_sp[rip]	# prephitmp_153, PL_stack_sp
	jmp	.L1252	#

#
# [ ~200 lines later... ]
#

.L1251:
# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	or	dh, 17	# tmp252,
	mov	QWORD PTR 16[r12], rsi	# iftmp.314_142->sv_u.svu_iv, TARGi_iv
	mov	DWORD PTR 12[r12], edx	# iftmp.314_142->sv_flags, tmp252
	jmp	.L1252	#

And this is how it compiles on my branch (logical AND):

# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	mov	edx, DWORD PTR 12[r12]	# _29, iftmp.314_139->sv_flags
	imul	rsi, r8	# TARGi_iv, ir
	mov	ecx, edx	# tmp232, _29
	and	ecx, -1719596801	# tmp232,
	cmp	ecx, 1	# tmp232,
	jne	.L1249	#,
# pp.c:1354:                 TARGi(il * ir, 0); /* args not GMG, so can't be tainted */
	or	dh, 17	# tmp233,
	mov	QWORD PTR 16[r12], rsi	# iftmp.314_139->sv_u.svu_iv, TARGi_iv
	mov	DWORD PTR 12[r12], edx	# iftmp.314_139->sv_flags, tmp233
	jmp	.L1218	#

#
# [ ~300 lines later... ]
#

.L1249:
# pp.c:1475:                             TARGi(NEGATE_2IV(product_low), 1);
	mov	rdi, r12	#, iftmp.314_139
	call	Perl_sv_setiv_mg@PLT	#
# inline.h:881:     *--PL_stack_sp = sv;
	mov	rax, QWORD PTR PL_stack_sp[rip]	# prephitmp_125, PL_stack_sp
	jmp	.L1218	#

The instructions (and the number of branches) are exactly the same. However, the code layout is different; Apparently, LIKELY() was ignored in the "bitwise AND" version, so the unlikely/slow path (calling Perl_sv_setiv_mg) was prioritized, and the fast path (assignment) was moved far away from the test. This pattern repeats in e.g. pp_subtract and pp_add.

If anything, this seems like a pessimisation, not an optimisation. And even if it were a real optimisation, such unusual code deserves a comment (and ideally some proof!)

@tonycoz
Copy link
Contributor

tonycoz commented Jan 21, 2025

I remember looking at the original change here and thinking it was ugly, but based on the optimization guidance from Intel (avoid branches!!!) it seemed like it would be a performance improvement.

But I've recently been working on a tool to measure performance, and with the artificial benchmarks from t/perf/benchmarks I've been seeing remarkably few branch misses (AMD EPYC 7713 64-Core Processor)):

expr::arith::mult_lex_ni:

  • branch-misses: 25
  • branches: 26046
  • cache-misses: 57
  • cache-references: 131
  • context-switches: 0
  • cpu-migrations: 0
  • cycles: 49473
  • instructions: 148164
  • page-faults: 0
  • task-clock: 16481

(the values are (roughly*) medians from 10 runs of 100 iterations each)

The numbers come from Linux libperf which uses the CPU reported values.

ARM (Ampere here) is similar:

expr::arith::mult_lex_ni:

  • branch-misses: 27
  • branches: 24024
  • cache-misses: -3
  • cache-references: 98204
  • context-switches: 0
  • cpu-migrations: 0
  • cycles: 51661
  • instructions: 176559
  • page-faults: 0
  • task-clock: 30800

(negative cache-misses due to statistical variation, see the expandable section below for info on how this stuff is calculated)

Note: the x86_64 results are from a non-threaded build and the aarch64 from a threaded build.

* this code is very WIP

Raw measurements for expr::arith::mult_lex_ni

The code measures 100 iterations with the initialization and pre/post code in the loop, and 100 iterations with the pre/benchmark/post code in the loop.

e.g.

no warnings;
no strict;sub {
  srand(0);
  my $__count = shift;
  my $__perf = Linux::libperf::Simple->new;
  my ($x,$y,$z) = (1, 2.2, 3.3);;
$__perf->enable;
  for my $__iter (1 .. $__count) {
;
  }
$__perf->disable;
return $__perf->results;
}

vs:

no warnings;
no strict;
sub {
  srand(0);
  my $__count = shift;
  my $__perf = Linux::libperf::Simple->new;
  my ($x,$y,$z) = (1, 2.2, 3.3);;
$__perf->enable;
  for my $__iter (1 .. $__count) {
;$z = $x * $y;
  }
$__perf->disable;
return $__perf->results;
}

The 0 case measures for branch-misses:

1804 1802 1804 2053 1796 1803 2014 1806 1805 1803

The run with the benchmark code for branch-misses:

1826 1828 2042 1824 1829 2077 1828 1829 2130 1824

So there's a fair amount of variability in these raw numbers, but it's minor in comparison to the number of branches which was:

zero case:

38260 38260 38260 39393 38260 38260 38260 38260 38260 38260

benchmark case:

64306 64306 65318 64306 64306 65439 64306 64306 65723 64306

@iabyn
Copy link
Contributor

iabyn commented Jan 22, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants