-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Create local gc frame for throw
#11508
Conversation
As for the sysimg size, this PR increases it by 176 bytes compare to current master after stripping. Probably a combination of exceptions with arguments rarely be used and that gc frame does not need a lot of instructions. |
Merging GC frames should be working now. As shown below, the initialization of the sub GC frame elements are not removed yet (first three lines in the I'm pretty sure the debugging info for the GC root of the subframes will be broken after the merge (or maybe llvm does some magic to preserve the debugging info when a instruction is replaced?) @Keno So far the logic are all quite simple. But before I go much further, what's the plan for using llvm's gc support feature? Is that going to deprecate small optimizations like this? @carnaval julia> @noinline g(a::ANY, b::ANY, c::ANY) = BoundsError(a, b)
g (generic function with 1 method)
julia> function f(a, b, c)
if (a == 0)
throw(g(a, b, c))
end
a
end
f (generic function with 1 method)
julia> function k(a, b, c)
if (a == 0)
throw(g(a, b, c))
end
g(a, a, a)
end
k (generic function with 1 method)
julia> @code_llvm f(1, 2, 3)
define i64 @julia_f_20959(i64, i64, i64) {
top:
%3 = icmp eq i64 %0, 0
br i1 %3, label %if, label %L
if: ; preds = %top
%4 = alloca [5 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [5 x %jl_value_t*]* %4, i64 0, i64 0
%5 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 2
%6 = bitcast [5 x %jl_value_t*]* %4 to i64*
store i64 6, i64* %6, align 8
%7 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 1
%8 = bitcast %jl_value_t** %7 to %jl_value_t***
%9 = load %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t** %9, %jl_value_t*** %8, align 8
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t* null, %jl_value_t** %5, align 8
%10 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 3
store %jl_value_t* null, %jl_value_t** %10, align 8
%11 = getelementptr [5 x %jl_value_t*]* %4, i64 0, i64 4
store %jl_value_t* null, %jl_value_t** %11, align 8
%12 = call %jl_value_t* @jl_box_int64(i64 signext 0)
store %jl_value_t* %12, %jl_value_t** %5, align 8
%13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
store %jl_value_t* %13, %jl_value_t** %10, align 8
%14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
store %jl_value_t* %14, %jl_value_t** %11, align 8
%15 = call %jl_value_t* @julia_g_20960(%jl_value_t* inttoptr (i64 140728708276624 to %jl_value_t*), %jl_value_t** %5, i32 3)
call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 3)
%16 = load %jl_value_t*** %8, align 8
store %jl_value_t** %16, %jl_value_t*** @jl_pgcstack, align 8
br label %L
L: ; preds = %if, %top
ret i64 %0
}
julia> @code_llvm k(1, 2, 3)
define %jl_value_t* @julia_k_20967(i64, i64, i64) {
top:
%3 = alloca [5 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [5 x %jl_value_t*]* %3, i64 0, i64 0
%4 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 2
%5 = bitcast [5 x %jl_value_t*]* %3 to i64*
store i64 6, i64* %5, align 8
%6 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 1
%7 = bitcast %jl_value_t** %6 to %jl_value_t***
%8 = load %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t** %8, %jl_value_t*** %7, align 8
store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
store %jl_value_t* null, %jl_value_t** %4, align 8
%9 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 3
store %jl_value_t* null, %jl_value_t** %9, align 8
%10 = getelementptr [5 x %jl_value_t*]* %3, i64 0, i64 4
store %jl_value_t* null, %jl_value_t** %10, align 8
%11 = icmp eq i64 %0, 0
br i1 %11, label %if, label %L
if: ; preds = %top
store %jl_value_t* null, %jl_value_t** %4, align 8
store %jl_value_t* null, %jl_value_t** %9, align 8
store %jl_value_t* null, %jl_value_t** %10, align 8
%12 = call %jl_value_t* @jl_box_int64(i64 signext 0)
store %jl_value_t* %12, %jl_value_t** %4, align 8
%13 = call %jl_value_t* @jl_box_int64(i64 signext %1)
store %jl_value_t* %13, %jl_value_t** %9, align 8
%14 = call %jl_value_t* @jl_box_int64(i64 signext %2)
store %jl_value_t* %14, %jl_value_t** %10, align 8
%15 = call %jl_value_t* @julia_g_20960(%jl_value_t* inttoptr (i64 140728708276624 to %jl_value_t*), %jl_value_t** %4, i32 3)
call void @jl_throw_with_superfluous_argument(%jl_value_t* %15, i32 3)
br label %L
L: ; preds = %if, %top
%16 = call %jl_value_t* @jl_box_int64(i64 signext %0)
store %jl_value_t* %16, %jl_value_t** %4, align 8
%17 = call %jl_value_t* @jl_box_int64(i64 signext %0)
store %jl_value_t* %17, %jl_value_t** %9, align 8
%18 = call %jl_value_t* @jl_box_int64(i64 signext %0)
store %jl_value_t* %18, %jl_value_t** %10, align 8
%19 = call %jl_value_t* @julia_g_20960(%jl_value_t* inttoptr (i64 140728708276624 to %jl_value_t*), %jl_value_t** %4, i32 3)
%20 = load %jl_value_t*** %7, align 8
store %jl_value_t** %20, %jl_value_t*** @jl_pgcstack, align 8
ret %jl_value_t* %19
} |
The logic used here should be the same with adjusting/restoring Example: Can the following happen? V # somehow not a local variable and not rooted
L:
# Some code that uses `V` and we restore `argDepth` after codgen this part.
V = .... # Assigning to V that need boxing and we root V here since argDepth < maxDepth
# the slot we assign can be override by the line above. cause a missing GC root.
goto L OK I know in this case |
The unneeded This change actually causes a
|
So about your question, no, only local variables (including gensyms) survive control flow edges (in fact it is stronger than that : it is the only way for a value to survive statement boundary). I like the fact that your implementation is "minimal disturbance" but as you said it's not very general since it only handles what happens inside the argument list of throw. It would eventually go away if we switch to the statepoint based infrastructure but we are nowhere near that. I'm on the fence about merging this since it clutters the code a bit (but not much) and we do not desperately need it either. I don't think it would be very hard to make this construct happen on the minimal dominator of the throw call, just a bit more bookkeeping. It would make it more resilient to code transforms such as putting a throw argument in a local. |
Good. That makes this kind of optimization much easier....
By code cluttering do you mean sth like these 1 2 3. There are some cleanup commits (including the ones above). I didn't squash all commits this time because I think it might be better to split out those commits in another PR. Should I do that to make what this PR is doing more clear?
I'm experimenting with that. |
Actually I've tried to create a local gc frame for all function with |
@carnaval The clean up part of this PR is split out to #11569 (which is still included in this PR) and only the last 3 commits in this PR is about setting up local GC frames. See the links below for the real diff. (The commit should also be relatively clear on what they are doing, I hope.) yuyichao/julia@gcframe-cleanup...yuyichao:throw-local-gc |
This PR is becoming not limited to codegen of the local GC frame anymore. In the last two commits, I've added two additional optimizations.
Benchmark follow soon. |
It turns out that my old benchmark where mainly benchmarking the jlcall wrapper and other overhead like that. I've updated my code to do c call directly. #!/usr/bin/julia -f
macro throw_interp(typ, arg1, arg2)
:(throw(typ(string("Random format string, ", $arg1, ", ", $arg2))))
end
@noinline throw1(typ::ANY, arg1, arg2) = throw(typ(arg1, arg2))
@noinline throw_interp(typ::Any, arg1, arg2) =
@throw_interp(typ, arg1, arg2)
function f0(a, b)
if b === 0
a
end
a
end
function f1(a, b)
if b === 0
throw1(BoundsError, a, b)
end
a
end
function f2(a, b)
if b === 0
throw(BoundsError(a, b))
end
a
end
function f3(a, b)
if b === 0
throw_interp(ArgumentError, a, b)
end
a
end
function f4(a, b)
if b === 0
@throw_interp(ArgumentError, a, b)
end
a
end
macro time_func(f, args...)
args = eval(current_module(), Expr(:tuple, args...))::Tuple
argnames = Symbol[gensym() for i in 1:length(args)]
types = map(typeof, args)
quote
function wrapper($(argnames...))
$(Expr(:meta, :noinline))
$f($(argnames...))
end
function timing_wrapper()
println($f, $types)
wrapper($(args...))
gc()
@time for i in 1:1000000000
wrapper($(args...))
end
gc()
end
timing_wrapper()
end
end
@code_llvm f0(0, 1)
@code_llvm f1(0, 1)
@code_llvm f2(0, 1)
@code_llvm f3(0, 1)
@code_llvm f4(0, 1)
v = Base.svec(1, 2)
@code_llvm getindex(v, 1)
@time_func(f0, 0, 1)
@time_func(f1, 0, 1)
@time_func(f2, 0, 1)
@time_func(f3, 0, 1)
@time_func(f4, 0, 1)
@time_func(getindex, v, 1) This now covers the majority (if not all) of the usecase in Benchmark result in next comment. Edit: Script update |
The benchmarks are done on
Result (Only numbers are pasted here, for full output including the LLVM IR, see links to the corresponding GIST):
This PR might not have a big impact on the performance in current |
It turns out that I was still benchmarking the wrapper (althought not dynamic dispatch) for the last function. Benchmark script and result updated. |
👍 💯 This would have saved me a LOT of grief, if it had already been in Julia... please get this merged ASAP 😀 I love the thorough micro benchmarking! |
The commit that turns off julia level inlining for "throw expression" doesn't have a very big impact on the string interpolation case but brings the number for simple exception constructor case down to the noinline-thrower case. (Edit: and was comparable to the string interpolation case (~10% slower) without noinline) The drawbacks of this change are:
|
👍 I'm not qualified to review this code, but I really like the outcomes here. It will be very nice to not need to worry about |
Interestingly this time AppVeyor didn't freeze ..... |
…alid instructions in finalize_gc_frame", which seems to cause windows CI hanging." This reverts commit 7b1895a.
Update for 190afa6 |
this allows the lazy emission & addition of static gc roots plus, this eliminates the need to delete the gc frame (since llvm can trivially do so) and makes emit_gcpop implicit at all ret instructions, rather than explicit
Close in favor of @vtjnash 's jn/codegen_rewrite2 branch. The approach there is more general and should also fit better with other changes we need in codegen. There are so many changed on master so this need to be rewriten anyway... |
This PR inplements the idea proposed by @carnaval in #11284 (comment) to create a local gc frame for throwing errors.
TL;DR
I think this idea can be slightly generalized and combine with #11284 to get the best performance for (not) throwing errors.
Benchmark:
The script used to benchmark can be found here.
Timing output on current master (Full output with llvm ir)
Timing output with this PR (Full output with llvm ir)
(Relaxant number is the timing for
f2
andgetindex
)As shown in the benchmark, creating a local gc frame (if there isn't a global one yet) can boost the performance for complex functions but a #11284 style transformation can still be useful especally for simple functions.
Currently this PR only has a special case for throw, which is easy because the control flow terminates here and no rooted variables can be leaked. It is probably possible to generalize this to other blocks that doesn't have too complicated control flow (or at least when calling functions that does not return). In that case, it can probably be used together with #11284 for the best performance.