Unprofitable ctpop vectorization #57476

nikic · 2022-08-31T13:01:12Z

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i64 @test(ptr %arr) {
entry:
  br label %loop

loop:
  %accum = phi i64 [ %accum.next, %loop ], [ 0, %entry ]
  %iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
  %iv.next = add nuw i64 %iv, 1
  %gep = getelementptr inbounds i64, ptr %arr, i64 %iv
  %value = load i64, ptr %gep, align 8
  %ctpop = tail call i64 @llvm.ctpop.i64(i64 %value)
  %accum.next = add i64 %accum, %ctpop
  %exitcond = icmp eq i64 %iv.next, 2
  br i1 %exitcond, label %exit, label %loop

exit:
  %lcssa = phi i64 [ %accum.next, %loop ]
  ret i64 %lcssa
}

declare i64 @llvm.ctpop.i64(i64)

This two-iteration loop gets vectorized by opt -loop-vectorize -mcpu=znver2 (https://llvm.godbolt.org/z/M8qTdTbfE), because we assign cost 1 to scalar ctpop and cost 3 to the vector ctpop, so it's nominally "profitable". At least for low iteration count, this is not actually the case.

The text was updated successfully, but these errors were encountered:

RKSimon · 2022-08-31T13:30:15Z

Vectorization only cares about throughput costs.

The v2i64 vector ctpop rthroughput cost is pretty accurate for the pshufb/psadbw expansion.

The i64 scalar cost is far too high - znver2 can technically run 4 popcnt ops at a time!

nikic · 2022-09-29T15:26:43Z

The i64 scalar cost is far too high - znver2 can technically run 4 popcnt ops at a time!

And because we treat rthroughput as an integer, we can't represent that, right?

RKSimon · 2022-09-29T16:07:59Z

yes - I've wondered about whether we could include the VF in the cost functions to return the cost of VF * op? But we'd still need to override znver costs even then as most Intel arch only has ctpop on Pipe1 (so I'd expect the vectorization to be a gain on most Intel targets).

We're not ready to use scheduler model specific costs in most cases yet - I began researching this back on D46276 and hit so many concerns (often due to the quality of the scheduler models) that I ended up creating the fuzz script for D103695 instead, which has ended up being a massive triage investigation into trying to get cost tables and scheduler models to sort of agree.

If its really critical, we could add a hack such as 'TuningFastPOPCNT' but I really don't like that idea.

nikic added the vectorizers label Aug 31, 2022

nikic mentioned this issue Aug 31, 2022

Codegen weirdness for sum of count_ones over an array rust-lang/rust#101060

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unprofitable ctpop vectorization #57476

Unprofitable ctpop vectorization #57476

nikic commented Aug 31, 2022 •

edited

Loading

RKSimon commented Aug 31, 2022

nikic commented Sep 29, 2022

RKSimon commented Sep 29, 2022

Unprofitable ctpop vectorization #57476

Unprofitable ctpop vectorization #57476

Comments

nikic commented Aug 31, 2022 • edited Loading

RKSimon commented Aug 31, 2022

nikic commented Sep 29, 2022

RKSimon commented Sep 29, 2022

nikic commented Aug 31, 2022 •

edited

Loading