Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoist object allocation before inner field initialization #45093

Closed
wants to merge 1 commit into from

Conversation

Keno
Copy link
Member

@Keno Keno commented Apr 26, 2022

Consider the following pattern for building up nested objects

%obj = Expr(:new, obj, ...)
%obj_wrapper = Expr(:new, obj_wrapper, ..., %obj)
%obj_wrapper2 = Expr(:new, obj_wrapper2, ..., %obj_wrapper)
%outer = Expr(:new, outer, %obj_wrapper2)

Asssuming everything except struct outer is struct-inlineable,
the LLVM IR we emit looks something like the following:

%obj = alloca
%obj_wrapper = alloca
%obj_wrapper_wrapper = alloca
%obj = alloca

init(%obj, <obj>)
init(%obj_wrapper, <obj_wrapper>); memcpy(%obj_wrapper, %obj)
init(%obj_wrapper2, <obj_wrapper2>); memcpy(%obj_wrapper2, %obj_wrapper)
init(%outer, <outer>); memcpy(%outer, %obj_wrapper2)

%outer_boxed = julia.gc_alloc
memcpy(%outer_boxed, %outer)

While LLVM is capable of removing all the allocas and memcpys, it's
taking an unreasonable amount of time to do so.

This PR introduces a small optimization into the frontend lowering
for :new: If all the :new calls are in the same LLVM basic block,
then we delete the alloca, and hoist the allocation of the object
to the earliest point before the initialization of the fields.

This gives essentially the same result as LLVM would have given us
post-optimization, but is much cheaper to do because we don't have
to perform any analysis to tell us that it is a legal optimization.

In the above example, we would end up with something like:

%outer_boxed = julia.gc_alloc
init(%outer_boxed, <obj>)
init(%outer_boxed, <obj_wrapper>);
init(%outer_boxed, <obj_wrapper2>);
init(%outer_boxed, <outer>);

Of course this does extend the liftime of the outer object, but I
don't think that's a particular problem as long as we're careful
not to hoist any boxings out of error paths. In the current
implementation, I only allow this optimization to occur in the
same LLVM basic block, but I think it should be fine to extend
it to allow the same julia basic block or more generally, any
allocation that post-dominates the relevant promotion points.

Timings

On the benchmark from #44998, this does quite well, essentially fixing the issue
modulo the noted issue where SLPVectorizer spends a significant chunk of time
on this function without actually doing anything. Our timings are as follows
(ignore the memory allocations, those depend on whether inference had already
run or not when I benchmarked it, which I wasn't careful about because it
only takes ~2 seconds):

master:

julia> @time code_llvm(devnull, torture)
365.975286 seconds (12.67 M allocations: 1.344 GiB, 0.38% gc time, 96.59% compilation time)

master - SLPVectorizer

julia> @time code_llvm(devnull, torture)
106.107186 seconds (5.62 M allocations: 613.950 MiB, 0.55% gc time, 94.83% compilation time)

This PR:

julia> @time code_llvm(devnull, torture)
134.521880 seconds (12.24 M allocations: 644.399 MiB, 0.38% gc time, 99.18% compilation time)

This PR - SLPVectorizer:

julia> @time code_llvm(devnull, torture)
  6.975649 seconds (12.24 M allocations: 644.399 MiB, 7.49% gc time, 83.60% compilation time)

Consider the following pattern for building up nested objects
```
%obj = Expr(:new, obj, ...)
%obj_wrapper = Expr(:new, obj_wrapper, ..., %obj)
%obj_wrapper2 = Expr(:new, obj_wrapper2, ..., %obj_wrapper)
%outer = Expr(:new, outer, %obj_wrapper2)
```

Asssuming everything except `struct outer` is struct-inlineable,
the LLVM IR we emit looks something like the following:

```
%obj = alloca
%obj_wrapper = alloca
%obj_wrapper_wrapper = alloca
%obj = alloca

init(%obj, <obj>)
init(%obj_wrapper, <obj_wrapper>); memcpy(%obj_wrapper, %obj)
init(%obj_wrapper2, <obj_wrapper2>); memcpy(%obj_wrapper2, %obj_wrapper)
init(%outer, <outer>); memcpy(%outer, %obj_wrapper2)

%outer_boxed = julia.gc_alloc
memcpy(%outer_boxed, %outer)
```

While LLVM is capable of removing all the allocas and memcpys, it's
taking an unreasonable amount of time to do so.

This PR introduces a small optimization into the frontend lowering
for `:new`: If all the :new calls are in the same LLVM basic block,
then we delete the alloca, and hoist the allocation of the object
to the earliest point before the initialization of the fields.

This gives essentially the same result as LLVM would have given us
post-optimization, but is much cheaper to do because we don't have
to perform any analysis to tell us that it is a legal optimization.

In the above example, we would end up with something like:
```
%outer_boxed = julia.gc_alloc
init(%outer_boxed, <obj>)
init(%outer_boxed, <obj_wrapper>);
init(%outer_boxed, <obj_wrapper2>);
init(%outer_boxed, <outer>);
```

Of course this does extend the liftime of the outer object, but I
don't think that's a particular problem as long as we're careful
not to hoist any boxings out of error paths. In the current
implementation, I only allow this optimization to occur in the
same LLVM basic block, but I think it should be fine to extend
it to allow the same julia basic block or more generally, any
allocation that post-dominates the relevant promotion points.
@oscardssmith oscardssmith added performance Must go faster compiler:latency Compiler latency labels Apr 26, 2022
@JeffBezanson JeffBezanson added compiler:codegen Generation of LLVM IR and native code and removed performance Must go faster labels Apr 26, 2022
@Keno
Copy link
Member Author

Keno commented Apr 26, 2022

@gbaraldi will look at the CI regressions here.

@gbaraldi
Copy link
Member

gbaraldi commented Apr 27, 2022

abstractarrays failure minimized to

function foo()
    for sz in ((5, 3), (7, 11))
        for idxs in ((1:sz[1], 1:sz[2]), (1:sz[1], 2:2:sz[2]),)
        end
    end
end
foo()

Here we generate bad IR %51 uses %55 and %55 uses %61

   %51 = getelementptr inbounds { [2 x i64], [3 x i64] }, { [2 x i64], [3 x i64] }* %55, i32 0, i32 1
     %52 = getelementptr inbounds [3 x i64], [3 x i64]* %51, i32 0, i32 0
     store i64 2, i64* %52, align 8
     %53 = getelementptr inbounds [3 x i64], [3 x i64]* %51, i32 0, i32 1
     store i64 2, i64* %53, align 8
     %54 = getelementptr inbounds [3 x i64], [3 x i64]* %51, i32 0, i32 2
     store i64 %50, i64* %54, align 8
; └└└
  %55 = getelementptr inbounds { [2 x [2 x i64]], { [2 x i64], [3 x i64] } }, { [2 x [2 x i64]], { [2 x i64], [3 x i64] } }* %61, i32 0, i32 1
  %56 = getelementptr inbounds { [2 x i64], [3 x i64] }, { [2 x i64], [3 x i64] }* %55, i32 0, i32 0
  %57 = bitcast [2 x i64]* %56 to i8*
  %58 = bitcast [2 x i64]* %5 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %57, i8* %58, i64 16, i1 false)
  %59 = bitcast {}*** %8 to {}**
  %current_task6 = getelementptr inbounds {}*, {}** %59, i64 -12
  %60 = call noalias nonnull {}* @julia.gc_alloc_obj({}** %current_task6, i64 72, {}* inttoptr (i64 140644222720112 to {}*)) #5
  %61 = bitcast {}* %60 to { [2 x [2 x i64]], { [2 x i64], [3 x i64] } }*

The iterators error is very similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code compiler:latency Compiler latency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants