Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freeze heavily pessimizes SIMD code. #42316

Open
chriselrod opened this issue Sep 20, 2021 · 8 comments
Open

freeze heavily pessimizes SIMD code. #42316

chriselrod opened this issue Sep 20, 2021 · 8 comments
Labels
compiler:simd instruction-level vectorization

Comments

@chriselrod
Copy link
Contributor

chriselrod commented Sep 20, 2021

This is autovectorized code.
The pessimization is because the SIMD vectors are scalarized (via a series of extractelements), the scalars frozen, and then reassembled (via insertelement). LLVM is not able to clean up/remove this round trip, resulting in a heavy performance penalty vs earlier versions that did not require freeze.

With LLVM12:

function branch_test_1!(x)
  @inbounds for i  eachindex(x)
    xᵢ = x[i]
    Nd = round(xᵢ * inv(π))
    if (mod(Base.unsafe_trunc(Int32, Nd), Int32(2)) == one(Int32))
      xᵢ = -xᵢ
    end
    x[i] = xᵢ
  end
end

@code_llvm debuginfo=:none branch_test_1!(x)

Julia 1.5:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 1, %index
  %13 = add i64 %offset.idx, 0
  %14 = add nsw i64 %13, -1
  %15 = getelementptr inbounds double, double* %12, i64 %14
  %16 = getelementptr inbounds double, double* %15, i32 0
  %17 = bitcast double* %16 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %17, align 8
  %18 = fmul <8 x double> %wide.load, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>
  %19 = call <8 x double> @llvm.rint.v8f64(<8 x double> %18)
  %20 = fptosi <8 x double> %19 to <8 x i32>
  %21 = sdiv <8 x i32> %20, <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
  %22 = shl nsw <8 x i32> %21, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %23 = icmp ne <8 x i32> %22, %20
  %24 = icmp slt <8 x i32> %20, zeroinitializer
  %25 = and <8 x i1> %24, %23
  %26 = zext <8 x i1> %25 to <8 x i32>
  %27 = sub nsw <8 x i32> %21, %26
  %28 = shl <8 x i32> %27, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %29 = sub <8 x i32> %20, %28
  %30 = icmp eq <8 x i32> %29, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %31 = fsub <8 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %wide.load
  %32 = select <8 x i1> %30, <8 x double> %31, <8 x double> %wide.load
  %33 = bitcast double* %16 to <8 x double>*
  store <8 x double> %32, <8 x double>* %33, align 8
  %index.next = add i64 %index, 8
  %34 = icmp eq i64 %index.next, %n.vec
  br i1 %34, label %middle.block, label %vector.body

With LLVM12:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %13 = getelementptr inbounds double, double addrspace(13)* %12, i64 %index, !dbg !40
  %14 = bitcast double addrspace(13)* %13 to <8 x double> addrspace(13)*, !dbg !40
  %wide.load = load <8 x double>, <8 x double> addrspace(13)* %14, align 8, !dbg !40, !tbaa !43
  %15 = fmul <8 x double> %wide.load, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>, !dbg !46
  %16 = call <8 x double> @llvm.rint.v8f64(<8 x double> %15), !dbg !50
  %17 = fptosi <8 x double> %16 to <8 x i32>, !dbg !55
  %18 = extractelement <8 x i32> %17, i32 0, !dbg !55
  %19 = freeze i32 %18, !dbg !55
  %20 = extractelement <8 x i32> %17, i32 1, !dbg !55
  %21 = freeze i32 %20, !dbg !55
  %22 = extractelement <8 x i32> %17, i32 2, !dbg !55
  %23 = freeze i32 %22, !dbg !55
  %24 = extractelement <8 x i32> %17, i32 3, !dbg !55
  %25 = freeze i32 %24, !dbg !55
  %26 = extractelement <8 x i32> %17, i32 4, !dbg !55
  %27 = freeze i32 %26, !dbg !55
  %28 = extractelement <8 x i32> %17, i32 5, !dbg !55
  %29 = freeze i32 %28, !dbg !55
  %30 = extractelement <8 x i32> %17, i32 6, !dbg !55
  %31 = freeze i32 %30, !dbg !55
  %32 = extractelement <8 x i32> %17, i32 7, !dbg !55
  %33 = freeze i32 %32, !dbg !55
  %34 = insertelement <8 x i32> poison, i32 %19, i32 0
  %35 = insertelement <8 x i32> %34, i32 %21, i32 1
  %36 = insertelement <8 x i32> %35, i32 %23, i32 2
  %37 = insertelement <8 x i32> %36, i32 %25, i32 3
  %38 = insertelement <8 x i32> %37, i32 %27, i32 4
  %39 = insertelement <8 x i32> %38, i32 %29, i32 5
  %40 = insertelement <8 x i32> %39, i32 %31, i32 6
  %41 = insertelement <8 x i32> %40, i32 %33, i32 7
  %42 = sdiv <8 x i32> %41, <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
  %43 = shl nsw <8 x i32> %42, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, !dbg !58
  %44 = icmp ne <8 x i32> %43, %41, !dbg !67
  %45 = icmp slt <8 x i32> %41, zeroinitializer, !dbg !72
  %46 = and <8 x i1> %45, %44, !dbg !72
  %47 = zext <8 x i1> %46 to <8 x i32>, !dbg !75
  %48 = sub nsw <8 x i32> %47, %42, !dbg !87
  %49 = shl <8 x i32> %48, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, !dbg !89
  %50 = add <8 x i32> %49, %41, !dbg !90
  %51 = icmp eq <8 x i32> %50, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, !dbg !91
  %52 = fneg <8 x double> %wide.load, !dbg !57
  %53 = select <8 x i1> %51, <8 x double> %52, <8 x double> %wide.load, !dbg !57
  store <8 x double> %53, <8 x double> addrspace(13)* %14, align 8, !dbg !92, !tbaa !43
  %index.next = add i64 %index, 8
  %54 = icmp eq i64 %index.next, %n.vec
  br i1 %54, label %middle.block, label %vector.body, !llvm.loop !95

Benchmarks:

julia> @btime branch_test_1!($x) # LLVM  9, Julia 1.5
  366.833 ns (0 allocations: 0 bytes)

julia> @btime branch_test_1!($x) # LLVM 12, Julia master
  1.166 μs (0 allocations: 0 bytes)

The assembly matches the LLVM: the vectors are decomposed and reassembled, without actually doing anything to the scalars.

@chriselrod chriselrod changed the title freezing poison prevents vectorization freeze heavily pessimizes SIMD code. Sep 20, 2021
@oscardssmith
Copy link
Member

Is this a julia bug or LLVM bug?

@chriselrod
Copy link
Contributor Author

chriselrod commented Sep 20, 2021

Is this a julia bug or LLVM bug?

I'm not sure, but would guess LLVM, as Julia is producing scalar code, which LLVM then vectorizes -- except for the "noop" freeze instruction.

But it could also be a pass ordering/missing pass type problem. I'm too ignorant of LLVM to really say much here.
I find it unlikely that C or C++ have this problem.

If folks want, I could file an issue on the LLVM bugzilla.

@vchuravy
Copy link
Member

We insert the freeze instructions. But yes maybe we need to file an LLVM bug about the impact of freeze on the autovectorizer.

@chriselrod
Copy link
Contributor Author

We insert the freeze instructions. But yes maybe we need to file an LLVM bug about the impact of freeze on the autovectorizer.

Does Clang simply not freeze care about the possibility of poison?
https://godbolt.org/z/51r61qzPc
What's the motivation for inserting freeze in Julia?

@oscardssmith
Copy link
Member

@chriselrod the reason to insert freeze here is that it makes program behavior more predictable #42031 currently has a good example. Without it, unsafe code is less useful because the optimizer is able to totally break your code.

@chriselrod
Copy link
Contributor Author

chriselrod commented Sep 20, 2021

@oscardssmith On the one hand, I'll point out that C get's away with it. On the other hand, there are plenty of complaints about undefined behavior in C.
freeze turns poison into undef, so it's also not immediately clear to me how that helps.
I think the compiler will produce poison in the IR given constants, but not otherwise. Although (to repeat) I don't know LLVM.

The problem seems to have been fixed with LLVM 13.
Using LLVM 13 directly to vectorize the loop using opt:
https://godbolt.org/z/nW88KjMoc
Note the vectorized freeze instruction:

  %.fr = freeze <4 x i32> %18

We can confirm that the above LLVM produces good code by using llc (I had to delete the %2 argument for it to compile):
https://godbolt.org/z/r9qqv8aGj

This is all vectorized (note that the xmm registers are because <4 x i32> are 128 bits; these are still SIMD).

So running just the default opt -O3 passes gives us SIMD freeze that is then removed in codegen, but Julia's optimization passes result in a scalarized freeze where the scalarization persists into codegen.

I got the original "unoptimized" IR by starting Julia with -O1. This prevented the initial IR from being vectorized at all, but still cleaned it up significantly compared to the unoptimized IR.

Switching to LLVM 12 shows the same codegen we currently get in Julia.
So seems like we'll have this problem in Julia 1.7 (unless we backport the commit that fixed this), but hopefully Julia 1.8 can use LLVM 13 once it is released.

@KristofferC
Copy link
Member

KristofferC commented Sep 20, 2021

freeze turns poison into undef, so it's also not immediately clear to me how that helps.

I think freeze is stronger. undef can give an arbitrary value at every access. So x + x is semantically different from 2x when x is undef. After a freeze` the value is arbitrary but is the same for all accesses.

Does Clang simply not freeze care about the possibility of poison?

6.3.1.4 Real floating and integer

1 When a finite value of real floating type is converted to an integer type other than _Bool the fractional part is discarded (i.e. the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type the behavior is undefined.

@chriselrod
Copy link
Contributor Author

undef can give an arbitrary value at every access. So x + x is semantically different from 2x when x is undef. After a freeze` the value is arbitrary but is the same for all accesses.

Ah, that makes sense. The former allows optimizing away the program, the latter does not.

@brenhinkeller brenhinkeller added the compiler:simd instruction-level vectorization label Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization
Projects
None yet
Development

No branches or pull requests

5 participants