Hoist object allocation before inner field initialization #45093

Keno · 2022-04-26T00:24:27Z

Consider the following pattern for building up nested objects

%obj = Expr(:new, obj, ...)
%obj_wrapper = Expr(:new, obj_wrapper, ..., %obj)
%obj_wrapper2 = Expr(:new, obj_wrapper2, ..., %obj_wrapper)
%outer = Expr(:new, outer, %obj_wrapper2)

Asssuming everything except struct outer is struct-inlineable,
the LLVM IR we emit looks something like the following:

%obj = alloca
%obj_wrapper = alloca
%obj_wrapper_wrapper = alloca
%obj = alloca

init(%obj, <obj>)
init(%obj_wrapper, <obj_wrapper>); memcpy(%obj_wrapper, %obj)
init(%obj_wrapper2, <obj_wrapper2>); memcpy(%obj_wrapper2, %obj_wrapper)
init(%outer, <outer>); memcpy(%outer, %obj_wrapper2)

%outer_boxed = julia.gc_alloc
memcpy(%outer_boxed, %outer)

While LLVM is capable of removing all the allocas and memcpys, it's
taking an unreasonable amount of time to do so.

This PR introduces a small optimization into the frontend lowering
for :new: If all the :new calls are in the same LLVM basic block,
then we delete the alloca, and hoist the allocation of the object
to the earliest point before the initialization of the fields.

This gives essentially the same result as LLVM would have given us
post-optimization, but is much cheaper to do because we don't have
to perform any analysis to tell us that it is a legal optimization.

In the above example, we would end up with something like:

%outer_boxed = julia.gc_alloc
init(%outer_boxed, <obj>)
init(%outer_boxed, <obj_wrapper>);
init(%outer_boxed, <obj_wrapper2>);
init(%outer_boxed, <outer>);

Of course this does extend the liftime of the outer object, but I
don't think that's a particular problem as long as we're careful
not to hoist any boxings out of error paths. In the current
implementation, I only allow this optimization to occur in the
same LLVM basic block, but I think it should be fine to extend
it to allow the same julia basic block or more generally, any
allocation that post-dominates the relevant promotion points.

Timings

On the benchmark from #44998, this does quite well, essentially fixing the issue
modulo the noted issue where SLPVectorizer spends a significant chunk of time
on this function without actually doing anything. Our timings are as follows
(ignore the memory allocations, those depend on whether inference had already
run or not when I benchmarked it, which I wasn't careful about because it
only takes ~2 seconds):

master:

julia> @time code_llvm(devnull, torture)
365.975286 seconds (12.67 M allocations: 1.344 GiB, 0.38% gc time, 96.59% compilation time)

master - SLPVectorizer

julia> @time code_llvm(devnull, torture)
106.107186 seconds (5.62 M allocations: 613.950 MiB, 0.55% gc time, 94.83% compilation time)

This PR:

julia> @time code_llvm(devnull, torture)
134.521880 seconds (12.24 M allocations: 644.399 MiB, 0.38% gc time, 99.18% compilation time)

This PR - SLPVectorizer:

julia> @time code_llvm(devnull, torture)
  6.975649 seconds (12.24 M allocations: 644.399 MiB, 7.49% gc time, 83.60% compilation time)

Consider the following pattern for building up nested objects ``` %obj = Expr(:new, obj, ...) %obj_wrapper = Expr(:new, obj_wrapper, ..., %obj) %obj_wrapper2 = Expr(:new, obj_wrapper2, ..., %obj_wrapper) %outer = Expr(:new, outer, %obj_wrapper2) ``` Asssuming everything except `struct outer` is struct-inlineable, the LLVM IR we emit looks something like the following: ``` %obj = alloca %obj_wrapper = alloca %obj_wrapper_wrapper = alloca %obj = alloca init(%obj, <obj>) init(%obj_wrapper, <obj_wrapper>); memcpy(%obj_wrapper, %obj) init(%obj_wrapper2, <obj_wrapper2>); memcpy(%obj_wrapper2, %obj_wrapper) init(%outer, <outer>); memcpy(%outer, %obj_wrapper2) %outer_boxed = julia.gc_alloc memcpy(%outer_boxed, %outer) ``` While LLVM is capable of removing all the allocas and memcpys, it's taking an unreasonable amount of time to do so. This PR introduces a small optimization into the frontend lowering for `:new`: If all the :new calls are in the same LLVM basic block, then we delete the alloca, and hoist the allocation of the object to the earliest point before the initialization of the fields. This gives essentially the same result as LLVM would have given us post-optimization, but is much cheaper to do because we don't have to perform any analysis to tell us that it is a legal optimization. In the above example, we would end up with something like: ``` %outer_boxed = julia.gc_alloc init(%outer_boxed, <obj>) init(%outer_boxed, <obj_wrapper>); init(%outer_boxed, <obj_wrapper2>); init(%outer_boxed, <outer>); ``` Of course this does extend the liftime of the outer object, but I don't think that's a particular problem as long as we're careful not to hoist any boxings out of error paths. In the current implementation, I only allow this optimization to occur in the same LLVM basic block, but I think it should be fine to extend it to allow the same julia basic block or more generally, any allocation that post-dominates the relevant promotion points.

Keno · 2022-04-26T20:23:26Z

@gbaraldi will look at the CI regressions here.

gbaraldi · 2022-04-27T13:29:43Z

abstractarrays failure minimized to

function foo()
    for sz in ((5, 3), (7, 11))
        for idxs in ((1:sz[1], 1:sz[2]), (1:sz[1], 2:2:sz[2]),)
        end
    end
end
foo()

Here we generate bad IR %51 uses %55 and %55 uses %61

   %51 = getelementptr inbounds { [2 x i64], [3 x i64] }, { [2 x i64], [3 x i64] }* %55, i32 0, i32 1
     %52 = getelementptr inbounds [3 x i64], [3 x i64]* %51, i32 0, i32 0
     store i64 2, i64* %52, align 8
     %53 = getelementptr inbounds [3 x i64], [3 x i64]* %51, i32 0, i32 1
     store i64 2, i64* %53, align 8
     %54 = getelementptr inbounds [3 x i64], [3 x i64]* %51, i32 0, i32 2
     store i64 %50, i64* %54, align 8
; └└└
  %55 = getelementptr inbounds { [2 x [2 x i64]], { [2 x i64], [3 x i64] } }, { [2 x [2 x i64]], { [2 x i64], [3 x i64] } }* %61, i32 0, i32 1
  %56 = getelementptr inbounds { [2 x i64], [3 x i64] }, { [2 x i64], [3 x i64] }* %55, i32 0, i32 0
  %57 = bitcast [2 x i64]* %56 to i8*
  %58 = bitcast [2 x i64]* %5 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %57, i8* %58, i64 16, i1 false)
  %59 = bitcast {}*** %8 to {}**
  %current_task6 = getelementptr inbounds {}*, {}** %59, i64 -12
  %60 = call noalias nonnull {}* @julia.gc_alloc_obj({}** %current_task6, i64 72, {}* inttoptr (i64 140644222720112 to {}*)) #5
  %61 = bitcast {}* %60 to { [2 x [2 x i64]], { [2 x i64], [3 x i64] } }*

The iterators error is very similar

oscardssmith added performance Must go faster compiler:latency Compiler latency labels Apr 26, 2022

JeffBezanson added compiler:codegen Generation of LLVM IR and native code and removed performance Must go faster labels Apr 26, 2022

Keno closed this May 12, 2022

Keno mentioned this pull request May 13, 2022

Hoist object allocation before inner field initialization #45153

Merged

vtjnash deleted the kf/heaphoist branch May 16, 2022 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoist object allocation before inner field initialization #45093

Hoist object allocation before inner field initialization #45093

Keno commented Apr 26, 2022

Keno commented Apr 26, 2022

gbaraldi commented Apr 27, 2022 •

edited

Loading

Hoist object allocation before inner field initialization #45093

Hoist object allocation before inner field initialization #45093

Conversation

Keno commented Apr 26, 2022

Timings

Keno commented Apr 26, 2022

gbaraldi commented Apr 27, 2022 • edited Loading

gbaraldi commented Apr 27, 2022 •

edited

Loading