freeze heavily pessimizes SIMD code. #42316

chriselrod · 2021-09-20T00:02:59Z

This is autovectorized code.
The pessimization is because the SIMD vectors are scalarized (via a series of extractelements), the scalars frozen, and then reassembled (via insertelement). LLVM is not able to clean up/remove this round trip, resulting in a heavy performance penalty vs earlier versions that did not require freeze.

With LLVM12:

function branch_test_1!(x)
  @inbounds for i ∈ eachindex(x)
    xᵢ = x[i]
    Nd = round(xᵢ * inv(π))
    if (mod(Base.unsafe_trunc(Int32, Nd), Int32(2)) == one(Int32))
      xᵢ = -xᵢ
    end
    x[i] = xᵢ
  end
end

@code_llvm debuginfo=:none branch_test_1!(x)

Julia 1.5:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 1, %index
  %13 = add i64 %offset.idx, 0
  %14 = add nsw i64 %13, -1
  %15 = getelementptr inbounds double, double* %12, i64 %14
  %16 = getelementptr inbounds double, double* %15, i32 0
  %17 = bitcast double* %16 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %17, align 8
  %18 = fmul <8 x double> %wide.load, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>
  %19 = call <8 x double> @llvm.rint.v8f64(<8 x double> %18)
  %20 = fptosi <8 x double> %19 to <8 x i32>
  %21 = sdiv <8 x i32> %20, <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
  %22 = shl nsw <8 x i32> %21, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %23 = icmp ne <8 x i32> %22, %20
  %24 = icmp slt <8 x i32> %20, zeroinitializer
  %25 = and <8 x i1> %24, %23
  %26 = zext <8 x i1> %25 to <8 x i32>
  %27 = sub nsw <8 x i32> %21, %26
  %28 = shl <8 x i32> %27, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %29 = sub <8 x i32> %20, %28
  %30 = icmp eq <8 x i32> %29, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %31 = fsub <8 x double> <double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00, double -0.000000e+00>, %wide.load
  %32 = select <8 x i1> %30, <8 x double> %31, <8 x double> %wide.load
  %33 = bitcast double* %16 to <8 x double>*
  store <8 x double> %32, <8 x double>* %33, align 8
  %index.next = add i64 %index, 8
  %34 = icmp eq i64 %index.next, %n.vec
  br i1 %34, label %middle.block, label %vector.body

With LLVM12:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %13 = getelementptr inbounds double, double addrspace(13)* %12, i64 %index, !dbg !40
  %14 = bitcast double addrspace(13)* %13 to <8 x double> addrspace(13)*, !dbg !40
  %wide.load = load <8 x double>, <8 x double> addrspace(13)* %14, align 8, !dbg !40, !tbaa !43
  %15 = fmul <8 x double> %wide.load, <double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883, double 0x3FD45F306DC9C883>, !dbg !46
  %16 = call <8 x double> @llvm.rint.v8f64(<8 x double> %15), !dbg !50
  %17 = fptosi <8 x double> %16 to <8 x i32>, !dbg !55
  %18 = extractelement <8 x i32> %17, i32 0, !dbg !55
  %19 = freeze i32 %18, !dbg !55
  %20 = extractelement <8 x i32> %17, i32 1, !dbg !55
  %21 = freeze i32 %20, !dbg !55
  %22 = extractelement <8 x i32> %17, i32 2, !dbg !55
  %23 = freeze i32 %22, !dbg !55
  %24 = extractelement <8 x i32> %17, i32 3, !dbg !55
  %25 = freeze i32 %24, !dbg !55
  %26 = extractelement <8 x i32> %17, i32 4, !dbg !55
  %27 = freeze i32 %26, !dbg !55
  %28 = extractelement <8 x i32> %17, i32 5, !dbg !55
  %29 = freeze i32 %28, !dbg !55
  %30 = extractelement <8 x i32> %17, i32 6, !dbg !55
  %31 = freeze i32 %30, !dbg !55
  %32 = extractelement <8 x i32> %17, i32 7, !dbg !55
  %33 = freeze i32 %32, !dbg !55
  %34 = insertelement <8 x i32> poison, i32 %19, i32 0
  %35 = insertelement <8 x i32> %34, i32 %21, i32 1
  %36 = insertelement <8 x i32> %35, i32 %23, i32 2
  %37 = insertelement <8 x i32> %36, i32 %25, i32 3
  %38 = insertelement <8 x i32> %37, i32 %27, i32 4
  %39 = insertelement <8 x i32> %38, i32 %29, i32 5
  %40 = insertelement <8 x i32> %39, i32 %31, i32 6
  %41 = insertelement <8 x i32> %40, i32 %33, i32 7
  %42 = sdiv <8 x i32> %41, <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
  %43 = shl nsw <8 x i32> %42, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, !dbg !58
  %44 = icmp ne <8 x i32> %43, %41, !dbg !67
  %45 = icmp slt <8 x i32> %41, zeroinitializer, !dbg !72
  %46 = and <8 x i1> %45, %44, !dbg !72
  %47 = zext <8 x i1> %46 to <8 x i32>, !dbg !75
  %48 = sub nsw <8 x i32> %47, %42, !dbg !87
  %49 = shl <8 x i32> %48, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, !dbg !89
  %50 = add <8 x i32> %49, %41, !dbg !90
  %51 = icmp eq <8 x i32> %50, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>, !dbg !91
  %52 = fneg <8 x double> %wide.load, !dbg !57
  %53 = select <8 x i1> %51, <8 x double> %52, <8 x double> %wide.load, !dbg !57
  store <8 x double> %53, <8 x double> addrspace(13)* %14, align 8, !dbg !92, !tbaa !43
  %index.next = add i64 %index, 8
  %54 = icmp eq i64 %index.next, %n.vec
  br i1 %54, label %middle.block, label %vector.body, !llvm.loop !95

Benchmarks:

julia> @btime branch_test_1!($x) # LLVM  9, Julia 1.5
  366.833 ns (0 allocations: 0 bytes)

julia> @btime branch_test_1!($x) # LLVM 12, Julia master
  1.166 μs (0 allocations: 0 bytes)

The assembly matches the LLVM: the vectors are decomposed and reassembled, without actually doing anything to the scalars.

The text was updated successfully, but these errors were encountered:

oscardssmith · 2021-09-20T00:32:33Z

Is this a julia bug or LLVM bug?

chriselrod · 2021-09-20T00:37:57Z

Is this a julia bug or LLVM bug?

I'm not sure, but would guess LLVM, as Julia is producing scalar code, which LLVM then vectorizes -- except for the "noop" freeze instruction.

But it could also be a pass ordering/missing pass type problem. I'm too ignorant of LLVM to really say much here.
I find it unlikely that C or C++ have this problem.

If folks want, I could file an issue on the LLVM bugzilla.

vchuravy · 2021-09-20T02:45:05Z

We insert the freeze instructions. But yes maybe we need to file an LLVM bug about the impact of freeze on the autovectorizer.

chriselrod · 2021-09-20T04:32:24Z

We insert the freeze instructions. But yes maybe we need to file an LLVM bug about the impact of freeze on the autovectorizer.

Does Clang simply not freeze care about the possibility of poison?
https://godbolt.org/z/51r61qzPc
What's the motivation for inserting freeze in Julia?

oscardssmith · 2021-09-20T05:15:30Z

@chriselrod the reason to insert freeze here is that it makes program behavior more predictable #42031 currently has a good example. Without it, unsafe code is less useful because the optimizer is able to totally break your code.

chriselrod · 2021-09-20T06:03:00Z

@oscardssmith On the one hand, I'll point out that C get's away with it. On the other hand, there are plenty of complaints about undefined behavior in C.
freeze turns poison into undef, so it's also not immediately clear to me how that helps.
I think the compiler will produce poison in the IR given constants, but not otherwise. Although (to repeat) I don't know LLVM.

The problem seems to have been fixed with LLVM 13.
Using LLVM 13 directly to vectorize the loop using opt:
https://godbolt.org/z/nW88KjMoc
Note the vectorized freeze instruction:

  %.fr = freeze <4 x i32> %18

We can confirm that the above LLVM produces good code by using llc (I had to delete the %2 argument for it to compile):
https://godbolt.org/z/r9qqv8aGj

This is all vectorized (note that the xmm registers are because <4 x i32> are 128 bits; these are still SIMD).

So running just the default opt -O3 passes gives us SIMD freeze that is then removed in codegen, but Julia's optimization passes result in a scalarized freeze where the scalarization persists into codegen.

I got the original "unoptimized" IR by starting Julia with -O1. This prevented the initial IR from being vectorized at all, but still cleaned it up significantly compared to the unoptimized IR.

Switching to LLVM 12 shows the same codegen we currently get in Julia.
So seems like we'll have this problem in Julia 1.7 (unless we backport the commit that fixed this), but hopefully Julia 1.8 can use LLVM 13 once it is released.

KristofferC · 2021-09-20T10:42:33Z

freeze turns poison into undef, so it's also not immediately clear to me how that helps.

I think freeze is stronger. undef can give an arbitrary value at every access. So x + x is semantically different from 2x when x is undef. After a freeze` the value is arbitrary but is the same for all accesses.

Does Clang simply not freeze care about the possibility of poison?

6.3.1.4 Real floating and integer

1 When a finite value of real ﬂoating type is converted to an integer type other than _Bool the fractional part is discarded (i.e. the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type the behavior is undefined.

chriselrod · 2021-09-23T16:05:03Z

undef can give an arbitrary value at every access. So x + x is semantically different from 2x when x is undef. After a freeze` the value is arbitrary but is the same for all accesses.

Ah, that makes sense. The former allows optimizing away the program, the latter does not.

chriselrod changed the title ~~freezing poison prevents vectorization~~ freeze heavily pessimizes SIMD code. Sep 20, 2021

brenhinkeller added the compiler:simd instruction-level vectorization label Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

freeze heavily pessimizes SIMD code. #42316

freeze heavily pessimizes SIMD code. #42316

chriselrod commented Sep 20, 2021 •

edited

Loading

oscardssmith commented Sep 20, 2021

chriselrod commented Sep 20, 2021 •

edited

Loading

vchuravy commented Sep 20, 2021

chriselrod commented Sep 20, 2021

oscardssmith commented Sep 20, 2021

chriselrod commented Sep 20, 2021 •

edited

Loading

KristofferC commented Sep 20, 2021 •

edited

Loading

chriselrod commented Sep 23, 2021

freeze heavily pessimizes SIMD code. #42316

freeze heavily pessimizes SIMD code. #42316

Comments

chriselrod commented Sep 20, 2021 • edited Loading

oscardssmith commented Sep 20, 2021

chriselrod commented Sep 20, 2021 • edited Loading

vchuravy commented Sep 20, 2021

chriselrod commented Sep 20, 2021

oscardssmith commented Sep 20, 2021

chriselrod commented Sep 20, 2021 • edited Loading

KristofferC commented Sep 20, 2021 • edited Loading

chriselrod commented Sep 23, 2021

chriselrod commented Sep 20, 2021 •

edited

Loading

chriselrod commented Sep 20, 2021 •

edited

Loading

chriselrod commented Sep 20, 2021 •

edited

Loading

KristofferC commented Sep 20, 2021 •

edited

Loading