-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Stack allocate some genericmemory #52382
base: master
Are you sure you want to change the base?
Conversation
Example 2 (after some pipeline changes that may go behind O3)julia> function xinc(x)
return [x, x+1, x+2]
end;
julia> function loopinc()
y = 0
for i = 1:10^7
ret = xinc(i)
y += ret[2]
end
return y
end;
julia> @code_llvm loopinc() Before: ; Function Signature: loopinc()
; @ REPL[2]:1 within `loopinc`
define i64 @julia_loopinc_855() #0 {
top:
br label %L44
L44: ; preds = %L44, %top
%value_phi = phi i64 [ 1, %top ], [ %0, %L44 ]
; @ REPL[2]:4 within `loopinc`
; ┌ @ REPL[1]:2 within `xinc`
; │┌ @ int.jl:87 within `+`
%0 = add nuw nsw i64 %value_phi, 1
%1 = add nuw nsw i64 %value_phi, 2
; │└
; │┌ @ array.jl:161 within `vect`
; ││┌ @ boot.jl:573 within `Array`
; │││┌ @ boot.jl:513 within `GenericMemory`
%"Memory{Int64}[]" = call {}* @jl_alloc_genericmemory({}* @"+Core.GenericMemory#859.jit", i64 3)
; │││└
; │││ @ boot.jl:574 within `Array`
%2 = bitcast {}* %"Memory{Int64}[]" to { i64, {}** }*
%.data_ptr = getelementptr inbounds { i64, {}** }, { i64, {}** }* %2, i64 0, i32 1
%3 = load {}**, {}*** %.data_ptr, align 8
; ││└
; ││ @ array.jl:163 within `vect`
; ││┌ @ array.jl:983 within `__safe_setindex!`
%4 = bitcast {}** %3 to i64*
store i64 %value_phi, i64* %4, align 8
%"new::Tuple.sroa.2.0..sroa_idx4144" = getelementptr inbounds {}*, {}** %3, i64 1
%5 = bitcast {}** %"new::Tuple.sroa.2.0..sroa_idx4144" to i64*
store i64 %0, i64* %5, align 8
%"new::Tuple.sroa.3.0..sroa_idx4245" = getelementptr inbounds {}*, {}** %3, i64 2
%6 = bitcast {}** %"new::Tuple.sroa.3.0..sroa_idx4245" to i64*
store i64 %1, i64* %6, align 8
; └└└
; @ REPL[2]:6 within `loopinc`
; ┌ @ range.jl:902 within `iterate`
; │┌ @ promotion.jl:620 within `==`
%.not = icmp eq i64 %value_phi, 10000000
; └└
br i1 %.not, label %L60, label %L44
L60: ; preds = %L44
; @ REPL[2]:7 within `loopinc`
ret i64 50000015000000
} After: ; Function Signature: loopinc()
; @ REPL[2]:1 within `loopinc`
define i64 @julia_loopinc_692() #0 {
top:
; @ REPL[2]:7 within `loopinc`
ret i64 50000015000000
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting work!
IRBuilder<> builder(&*F.getEntryBlock().getFirstInsertionPt()); | ||
StringRef origName = orig->getName(); | ||
auto T_size = pass.DL->getIntPtrType(builder.getContext()); | ||
auto data = builder.CreateAlloca(Type::getInt8Ty(builder.getContext()), ConstantInt::get(T_size, maxStackAlloc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a dynamically sized alloca here? Instead of maxStackAlloc? Not sure if it makes a difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's a few reasons I don't want a dynamic alloca:
- SROA won't look at any alloca that's outside of the entry block. To be fair it's unlikely that SROA can get rid of the array anyways, but it's still an opportunity I would like to give.
- We can't just stick the dynamically sized alloca in the entry block, since we don't know the chain of computations leading to the length of the array and whether or not that can be entirely hoisted.
- On a related note, alloca isn't freed until the end of the function. If you don't put your allocas in the entry block, you are very liable to put one in a loop, which will almost certainly cause a stack overflow in a nigh undebuggable manner.
- The above related note is actually why gcc and clang will not inline functions that have dynamically sized allocas, or perhaps even any explicit allocas at all (I forget the details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is LLVM able to reduce the size of the alloca if it sees we don't use the whole thing? Not sure if that's an optimization that may happen.
My concern is paying the cost of the big alloca for something that may be only 3 or 4 elements long. Though I guess if it's dynamic then there isn't much we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback size is something that can probably be up for debate (I chose 4096 to make it consistent with sized allocations, and I wanted most sized allocations to be stack allocated), but yeah there's not much else we can do about it. Incidentally, this is basically turning every genericmemory into a SmallVector, so maybe we can take some size hints off of that? I would at least like an 8x8 8-byte element array to be stack allocated though (so my personal minimum is 512 bytes for unsized stack allocations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the size we are willing to do this to should vary depending on whether we know the size. when we know the size, doing this for moderately sized arrays makes sense, but fragmenting the stack has a cost so imo, we shouldn't be as willing to alloca 4k for a list that might have 2 elements.
@vtjnash I'm doing a few experiments with partial escape analysis, to stack allocate even when we haven't eliminated the bounds error, and I'm running into the issue of the data pointer being untrackable in address space 0 (so I can't replace it with a sunken allocation's data pointer in just the error blocks). What was the reason we made that pointer be in address space 0, and is there any way for me to track the pointer? |
What are you trying to track about it? Unless there is a load from it (which will cast it to addrspace 13 first), the only valid uses of it are typically as a sort of token |
So my example is here: https://godbolt.org/z/ffnxhe5TG Essentially the important part is the genericmemory allocation has its data pointer loaded in %10 as an The problem right now is that I'm doing partial escape analysis, so my transformation essentially needs to turn any stack pointers in the error block into heap pointers. I can't do that without full knowledge of what is and isn't a pointer that's going to be on the stack. That being said, for now I think I can get around the problem by rerunning escape analysis on every load of the data pointer and bailing out if it ever becomes untrackable outside of an error block, but it's kind of expensive and conservative to do that. |
So with the latest changes I've added partial escape analysis, to be able to stack allocate arrays even when bounds errors exist. That plus the addition of an extra instcombine and loopfullunroll pass (vs the standard pipeline) have optimized the examples I've tried this on to ~optimal as far as I can tell, so I don't think I have much to do on the bitsarray allocation front. I also personally don't plan to extend this to pointer arrays (no reason why it can't be, just I don't know enough about late-gc-lowering/garbage collection to do it myself). In terms of results, I see ~10 arrays were affected by escape analysis during building the system image (and another ~10 during Pkg precompilation), which isn't super great. I would like to see if it has an impact on more numerical code, if anyone wants to try that out. |
@chriselrod you have some examples to test here, right? |
Yeah, would be good to compare up the task local storage version here https://spmd.org/posts/multithreadedallocations/ |
I just tested this and saw no difference from Julia's master branch, except that while building Julia I saw the occasional printing of
or
Mind pinging me when I should expect to see a difference? Master: julia> GC.gc(); @time do_multithreaded_work!(gccexpm!, Bs, As, testrange);
0.860287 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(gccexpm!, Bs, As, testrange);
0.854966 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(clangexpm!, Bs, As, testrange);
0.776054 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(clangexpm!, Bs, As, testrange);
0.761762 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(expm_tls!, Bs, As, testrange);
1.223061 seconds (21.29 M allocations: 728.865 MiB, 4.70% gc time)
julia> GC.gc(); @time do_multithreaded_work!(expm_tls!, Bs, As, testrange);
1.255723 seconds (21.29 M allocations: 728.853 MiB, 4.51% gc time)
julia> GC.gc(); @time do_multithreaded_work!(expm_custommul!, Bs, As, testrange);
4.376751 seconds (84.66 M allocations: 36.832 GiB, 63.57% gc time)
julia> GC.gc(); @time do_multithreaded_work!(expm_custommul!, Bs, As, testrange);
4.446788 seconds (84.66 M allocations: 36.832 GiB, 64.02% gc time) PR: julia> GC.gc(); @time do_multithreaded_work!(gccexpm!, Bs, As, testrange);
0.892432 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(gccexpm!, Bs, As, testrange);
0.854607 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(clangexpm!, Bs, As, testrange);
0.766006 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(clangexpm!, Bs, As, testrange);
0.762248 seconds (8.41 k allocations: 4.532 MiB)
julia> GC.gc(); @time do_multithreaded_work!(expm_tls!, Bs, As, testrange);
1.248361 seconds (21.29 M allocations: 728.869 MiB, 5.36% gc time)
julia> GC.gc(); @time do_multithreaded_work!(expm_tls!, Bs, As, testrange);
1.250094 seconds (21.29 M allocations: 728.877 MiB, 5.42% gc time)
julia> GC.gc(); @time do_multithreaded_work!(expm_custommul!, Bs, As, testrange);
4.388915 seconds (84.71 M allocations: 36.950 GiB, 63.50% gc time)
julia> GC.gc(); @time do_multithreaded_work!(expm_custommul!, Bs, As, testrange);
4.421541 seconds (84.71 M allocations: 36.950 GiB, 63.47% gc time)
julia> rmap(sizeof, Bs)
(((32, 72, 128, 200), (64, 144, 256, 400), (96, 216, 384, 600), (128, 288, 512, 800), (160, 360, 640, 1000), (192, 432, 768, 1200), (224, 504, 896, 1400), (256, 576, 1024, 1600), (288, 648, 1152, 1800)), ((32, 72, 128, 200), (128, 288, 512, 800), (192, 432, 768, 1200), (256, 576, 1024, 1600), (320, 720, 1280, 2000), (384, 864, 1536, 2400), (448, 1008, 1792, 2800), (512, 1152, 2048, 3200), (576, 1296, 2304, 3600)), ((32, 72, 128, 200), (192, 432, 768, 1200), (288, 648, 1152, 1800), (384, 864, 1536, 2400), (480, 1080, 1920, 3000), (576, 1296, 2304, 3600), (672, 1512, 2688, 4200), (768, 1728, 3072, 4800), (864, 1944, 3456, 5400))) This was done using the SIMD.jl ForwardDiff PR: JuliaDiff/ForwardDiff.jl#570 Also, on this PR, I get julia> using StaticArrays
allocation was unsized
[ Info: Precompiling StaticArrays [90137ffa-7385-5640-81b9-e52037218182]
[166847] signal 11 (1): Segmentation fault
in expression starting at none:0
Allocations: 9910646 (Pool: 9910136; Big: 510); GC: 9
ERROR: Failed to precompile StaticArrays [90137ffa-7385-5640-81b9-e52037218182] to "/home/chriselrod/.julia/compiled/v1.11/StaticArrays/jl_u4booz".
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
@ Base ./loading.jl:2601
[3] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
@ Base ./loading.jl:2473 [inlined]
[4] (::Base.var"#1034#1035"{Base.PkgId})()
@ Base ./loading.jl:2075
[5] mkpidlock(f::Base.var"#1034#1035"{Base.PkgId}, at::String, pid::Int32; kwopts::@Kwargs{stale_age::Int64, wait::Bool})
@ FileWatching.Pidfile ~/Documents/languages/juliadev/usr/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:95
[6] #mkpidlock#6
@ FileWatching.Pidfile ~/Documents/languages/juliadev/usr/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:90 [inlined]
[7] #trymkpidlock#11
@ ~/Documents/languages/juliadev/usr/share/julia/stdlib/v1.11/FileWatching/src/pidfile.jl:116
[8] #invokelatest#2
@ Base ./essentials.jl:957 [inlined]
[9] invokelatest
@ Base ./essentials.jl:952 [inlined]
[10] maybe_cachefile_lock(f::Base.var"#1034#1035"{Base.PkgId}, pkg::Base.PkgId, srcpath::String; stale_age::Int64)
@ Base ./loading.jl:3181
[11] maybe_cachefile_lock
@ Base ./loading.jl:3178 [inlined]
[12] _require(pkg::Base.PkgId, env::String)
@ Base ./loading.jl:2071
[13] __require_prelocked(uuidkey::Base.PkgId, env::String)
@ Base ./loading.jl:1916
[14] #invoke_in_world#3
@ Base ./essentials.jl:989 [inlined]
[15] invoke_in_world
@ Base ./essentials.jl:986 [inlined]
[16] _require_prelocked(uuidkey::Base.PkgId, env::String)
@ Base ./loading.jl:1907
[17] macro expansion
@ Base ./loading.jl:1845 [inlined]
[18] macro expansion
@ Base ./lock.jl:267 [inlined]
[19] __require(into::Module, mod::Symbol)
@ Base ./loading.jl:1806
[20] #invoke_in_world#3
@ Base ./essentials.jl:989 [inlined]
[21] invoke_in_world
@ Base ./essentials.jl:986 [inlined]
[22] require(into::Module, mod::Symbol)
@ Base ./loading.jl:1799 Which is odd, because ForwardDiff depends on StaticArrays. |
So those was sized/unsized messages are arrays being stack allocated. It's possible that the arrays that are being stack allocated are not the ones that contribute most to GC time (which is unfortunate). I can look into why those are escaping probably sometime next weekend. |
Just trying this out.
Examples:
Example 1
julia> function my_cumsum(arr) length(arr) < 1 && return zero(eltype(arr)) temp = zeros(eltype(arr), length(arr)) @inbounds temp[1] = arr[1] for i in 2:length(temp) @inbounds temp[i] = temp[i-1] + arr[i] end @inbounds temp[length(arr)] end my_cumsum (generic function with 1 method) julia> @code_llvm my_cumsum([0, 1, 2, 3])
Before:
After: