Skip to content

Unprofitable ctpop vectorization #57476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nikic opened this issue Aug 31, 2022 · 3 comments
Open

Unprofitable ctpop vectorization #57476

nikic opened this issue Aug 31, 2022 · 3 comments

Comments

@nikic
Copy link
Contributor

nikic commented Aug 31, 2022

Example from rust-lang/rust#101060:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i64 @test(ptr %arr) {
entry:
  br label %loop

loop:
  %accum = phi i64 [ %accum.next, %loop ], [ 0, %entry ]
  %iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
  %iv.next = add nuw i64 %iv, 1
  %gep = getelementptr inbounds i64, ptr %arr, i64 %iv
  %value = load i64, ptr %gep, align 8
  %ctpop = tail call i64 @llvm.ctpop.i64(i64 %value)
  %accum.next = add i64 %accum, %ctpop
  %exitcond = icmp eq i64 %iv.next, 2
  br i1 %exitcond, label %exit, label %loop

exit:
  %lcssa = phi i64 [ %accum.next, %loop ]
  ret i64 %lcssa
}

declare i64 @llvm.ctpop.i64(i64)

This two-iteration loop gets vectorized by opt -loop-vectorize -mcpu=znver2 (https://llvm.godbolt.org/z/M8qTdTbfE), because we assign cost 1 to scalar ctpop and cost 3 to the vector ctpop, so it's nominally "profitable". At least for low iteration count, this is not actually the case.

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 31, 2022

Vectorization only cares about throughput costs.

The v2i64 vector ctpop rthroughput cost is pretty accurate for the pshufb/psadbw expansion.

The i64 scalar cost is far too high - znver2 can technically run 4 popcnt ops at a time!

@nikic
Copy link
Contributor Author

nikic commented Sep 29, 2022

The i64 scalar cost is far too high - znver2 can technically run 4 popcnt ops at a time!

And because we treat rthroughput as an integer, we can't represent that, right?

@RKSimon
Copy link
Collaborator

RKSimon commented Sep 29, 2022

yes - I've wondered about whether we could include the VF in the cost functions to return the cost of VF * op? But we'd still need to override znver costs even then as most Intel arch only has ctpop on Pipe1 (so I'd expect the vectorization to be a gain on most Intel targets).

We're not ready to use scheduler model specific costs in most cases yet - I began researching this back on D46276 and hit so many concerns (often due to the quality of the scheduler models) that I ended up creating the fuzz script for D103695 instead, which has ended up being a massive triage investigation into trying to get cost tables and scheduler models to sort of agree.

If its really critical, we could add a hack such as 'TuningFastPOPCNT' but I really don't like that idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants