WIP: on-stack gc allocation #8134

vtjnash · 2014-08-26T07:30:47Z

This is an improvement to the GC by pre-computing the GC escape analysis and entirely avoiding malloc when it is detected that the value cannot leave the stack frame. In it's current form, this appears to have no impact on the perf tests (as expected), since it doesn't trace values across function calls. Through the addition of a gc-analysis pass to inference.jl, this can be easily improved however. Where this is most helpful is in allocating boxes for ccall for very cheap, allowing the elimination of the & special syntax without performance penalty:

Some initial function to be merged into base:

julia> type Ref{T}
        x::T
       end

julia> Base.getindex(x::Ref) = x.x
getindex (generic function with 171 methods)

julia> Base.convert{T}(::Type{Ptr{T}}, b::Ref{T}) = ccall(:jl_value_ptr,Ptr{T},(Ptr{Ref},),&b)
convert (generic function with 452 methods)

And an example:

julia> function call_fortran()
    x = Ref{Int}(0)
    ccall(:jl_, Void, (Ptr{Int}, Ptr{Int},), x, Ref{Int}(1))
    return x[]
end
call_fortran (generic function with 1 method)

julia> code_llvm(call_fortran,())

; Function Attrs: sspreq
define i64 @"julia_call_fortran;40340"() #2 {
top:
  %0 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 2, !dbg !1136
  store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8
  %2 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !1136
  %3 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 1, !dbg !1136
  %.c = bitcast %jl_value_t** %2 to %jl_value_t*, !dbg !1136
  store %jl_value_t* %.c, %jl_value_t** %3, align 8, !dbg !1136
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !1136
  store %jl_value_t* null, %jl_value_t** %1, align 8, !dbg !1136
  %4 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 3, !dbg !1136
  store %jl_value_t* null, %jl_value_t** %4, align 8, !dbg !1136
  %5 = load i8** @jl_alloca_stack, align 8, !dbg !1136
  %6 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %7 = alloca [3 x i8*], align 8, !dbg !1137
  %.sub1 = getelementptr inbounds [3 x i8*]* %7, i64 0, i64 0
  %8 = load i8** @jl_alloca_stack, align 8, !dbg !1137
  store i8* %8, i8** %.sub1, align 8, !dbg !1137
  %9 = bitcast [3 x i8*]* %7 to i8*, !dbg !1137
  store i8* %9, i8** @jl_alloca_stack, align 8, !dbg !1137
  %10 = getelementptr [3 x i8*]* %7, i64 0, i64 1, !dbg !1137
  %11 = bitcast i8** %10 to %jl_value_t*, !dbg !1137, !isAlloca !1139
  store i8* inttoptr (i64 140638926508672 to i8*), i8** %10, align 8, !dbg !1137
  store %jl_value_t* %11, %jl_value_t** %6, align 8, !dbg !1137
  %12 = getelementptr [3 x i8*]* %7, i64 0, i64 2, !dbg !1137
  %13 = load i64* inttoptr (i64 140638888978504 to i64*), align 8, !dbg !1137
  %.c2 = inttoptr i64 %13 to i8*, !dbg !1137
  store i8* %.c2, i8** %12, align 8, !dbg !1137, !tbaa %jtbaa_user
  store %jl_value_t* %11, %jl_value_t** %1, align 8, !dbg !1137
  %14 = alloca [3 x i8*], align 8, !dbg !1142
  %.sub3 = getelementptr inbounds [3 x i8*]* %14, i64 0, i64 0
  %15 = load i8** @jl_alloca_stack, align 8, !dbg !1142
  store i8* %15, i8** %.sub3, align 8, !dbg !1142
  %16 = bitcast [3 x i8*]* %14 to i8*, !dbg !1142
  store i8* %16, i8** @jl_alloca_stack, align 8, !dbg !1142
  %17 = getelementptr [3 x i8*]* %14, i64 0, i64 1, !dbg !1142
  %18 = bitcast i8** %17 to %jl_value_t*, !dbg !1142, !isAlloca !1139
  store i8* inttoptr (i64 140638926508672 to i8*), i8** %17, align 8, !dbg !1142
  store %jl_value_t* %18, %jl_value_t** %6, align 8, !dbg !1142
  %19 = getelementptr [3 x i8*]* %14, i64 0, i64 2, !dbg !1142
  %20 = load i64* inttoptr (i64 140638888978536 to i64*), align 8, !dbg !1142
  %.c4 = inttoptr i64 %20 to i8*, !dbg !1142
  store i8* %.c4, i8** %19, align 8, !dbg !1142, !tbaa %jtbaa_user
  store %jl_value_t* %18, %jl_value_t** %4, align 8, !dbg !1142
  %21 = bitcast i8** %12 to i64*, !dbg !1142
  %22 = bitcast i8** %19 to i64*, !dbg !1142
  call void inttoptr (i64 4318594960 to void (i64*, i64*)*)(i64* %21, i64* %22), !dbg !1142
  %23 = load i64* %21, align 8, !dbg !1143, !tbaa %jtbaa_user
  %24 = load %jl_value_t** %3, align 8, !dbg !1143
  %25 = getelementptr inbounds %jl_value_t* %24, i64 0, i32 0, !dbg !1143
  store %jl_value_t** %25, %jl_value_t*** @jl_pgcstack, align 8, !dbg !1143
  store i8* %5, i8** @jl_alloca_stack, align 8, !dbg !1143
  ret i64 %23, !dbg !1143
}

No allocobj!

TODO items:

remove the alloca stack frame when it isn't used
frob the stack less
future work: implement more complete gc analysis pass in julia (store info in Expr objects)
correctly handle: x = new(T, getfield(x,2), getfield(x,1))

timholy · 2014-08-26T10:56:41Z

👏. Very exciting to see big steps in this direction!

JeffBezanson · 2014-08-30T17:20:22Z

This is very good progress. But I would greatly prefer to do this in a way that does not require the jl_alloca_stack global variable. Needing to update global state and modify the GC code nearly (though of course not entirely) defeats the purpose of stack allocation. It would be better, at first, to only stack allocate objects that don't contain heap references.

vtjnash · 2014-08-31T07:32:31Z

that's not relevant to why i needed the jl_alloca_stack – i needed it so that i can unmark it after the gc pass finishes. i think it can go away (in most cases?) with the new incremental gc

JeffBezanson · 2014-08-31T15:21:42Z

The best form of stack allocation removes objects from the gc's reach
entirely.

vtjnash · 2014-08-31T22:00:53Z

that seems like that would require a better escape analysis pass than currently implemented. my thought is that this can be part of every expr (like .typ) and thus computed in inference.jl. my thought also is that all levels of stack allocation can be used together. however, with this PR, my intent was to add just enough to make the example more efficient in most cases, so that the & operator can be eliminated without penalty (and also adding cheap ccall out-args in the bargain).

also, my hope is that even though the GC can still see these, it saves a lot of effort on managing the allocation and free lists, especially where you are making a lot of small objects in a loop, and then quickly discarding them (since the GC also doesn't count these allocations in its totals)

the main benefit of the incremental gc to be relevant here is the ability to have additional flag bits so that reflection can detect whether it has a stack or heap allocated value

JeffBezanson · 2014-09-01T15:39:09Z

jl_alloca_stack is just not ok. If something is stack allocated the GC doesn't need to touch it at all; that's the whole point. If this were just a bit sub-optimal it might be ok, but adding an extra piece of global runtime state is really bad.

vtjnash · 2014-09-02T23:14:40Z

you make it sound like this global variable, which helps us malloc memory more efficiently and faster, would make everything slower due to the fact that in certain cases it could be even faster. and that having the general case would make implementing the special case harder, rather than easier. that just doesn't make sense to me.

JeffBezanson · 2014-09-03T00:32:35Z

Could you explain more about why stack-allocated objects in this scheme get marked by the GC, requiring them to be unmarked eventually? After all, we do technically stack allocate lots of objects already without this issue.

Maybe it would help to have the GC identify stack-allocated objects by address range, and not mark them?

… is safe to do so

…is doing

…ting gc frame

vtjnash · 2014-09-10T05:38:56Z

@JeffBezanson is this more inline with what you were wanting? (I know that many more optimizations are possible, the question is whether this represents a sufficient improvement to merge).

carnaval · 2015-06-12T15:09:41Z

@vtjnash forgot about this. About the unmarking, we still have to scan through those stack allocated objects if they contain pointers to non stack memory ? Or do you only stack alloc things where no internal pointer escaped either ?

Anyway, if we don't want to unmark them we could keep them marked all the time, as long as they are not the sole root of some other objects. If they are it's a bit more tricky, we could have the write barrier correct this but it requires a few modifications.

vtjnash · 2015-06-12T15:19:57Z

probably both, or whatever you want to suggest

JeffBezanson · 2015-06-13T04:32:20Z

I still think doing a conservative stack scan is likely a better approach to this, since it simplifies everything greatly (except in the scanning code itself, where the complexity belongs).

StefanKarpinski · 2015-06-13T14:10:17Z

+1 to conservative stack scanning. It seems to me that the chances of a random pointer-like value on the stack pointing exactly at an object large enough to care about freeing are pretty slim. @carnaval, didn't you have the conservative stack scanning approach largely implemented? Could we maybe try it out?

yuyichao · 2015-06-16T01:06:56Z

Does this have any special handling of the finalizer or does it treat finalizer registration as a leak of object pointer?

Only works with copy_stacks for now. The code is not used anywhere in codegen right now, it can be tested by using (:stknew typ args...) instead of :new. It is essentially the GC side of #8134 but without the global alloca stack & unmarking. Stack objects are kept marked (+young) permanently and are only scanned when coming from the owning task.

vtjnash force-pushed the jn/gc_stack branch 2 times, most recently from 079423d to cebf733 Compare August 28, 2014 02:21

vtjnash added 9 commits September 9, 2014 23:45

allocate objects on the stack if simple escape analysis shows that it…

ce3bb10

… is safe to do so

first pass at giving alloca it's own gc stack

25c9292

improve escape analysis to handle more situations, better

a0ba748

fix some mismatch escape info in ccall/llvmcall

0c89467

add missing task alloca stack initializer

e87169f

two more builtins needed for escape analysis

3bef2c9

make Makefile more informative about what julia sysimg build step it …

d6317e2

…is doing

this reverts giving alloca it's own stack, and simply reuses the exis…

d926698

…ting gc frame

move alloca boxes to the first basic block, so they get reused

12fe44f

vtjnash force-pushed the jn/gc_stack branch from 2a922d9 to 12fe44f Compare September 10, 2014 05:27

vtjnash changed the title ~~WIP: gc on-stack~~ on-stack gc allocation Sep 10, 2014

simonster mentioned this pull request Sep 11, 2014

Export dotu with generic fallbacks #8300

Closed

vtjnash changed the title ~~on-stack gc allocation~~ WIP: on-stack gc allocation Sep 12, 2014

jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06

simonster mentioned this pull request Oct 23, 2014

Aliased vector slices #8795

Closed

jiahao force-pushed the master branch from cdde4df to 7fdc860 Compare October 28, 2014 04:20

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

vtjnash mentioned this pull request Dec 28, 2014

devdocs notes on memory layout and runtime initialisation #9468

Merged

vtjnash mentioned this pull request Jan 24, 2015

WIP: change lowering of ccall cconvert arguments #9913

Closed

vtjnash mentioned this pull request Feb 3, 2015

RFC: change lowering of ccall cconvert arguments and add Ref{T} type #9986

Merged

JeffBezanson mentioned this pull request Apr 20, 2015

Eliding SubArray construction (getting greedy) #10899

Closed

simonster mentioned this pull request May 12, 2015

Iterating a SortedDict causes memory allocation JuliaCollections/DataStructures.jl#93

Closed

simonster mentioned this pull request Jun 2, 2015

performance of Ref #11531

Closed

KristofferC mentioned this pull request Jun 12, 2015

Revert "Use Ref{T} type for BLAS calls" #11683

Merged

vtjnash mentioned this pull request Jun 12, 2015

Refs for BLAS calls causes performance hit for small matrices. #11681

Closed

StefanKarpinski mentioned this pull request Jun 15, 2015

conservative stack scanning? #11714

Closed

carnaval mentioned this pull request Jul 18, 2015

GC support for stack allocated objects #12205

Closed

vtjnash mentioned this pull request Aug 15, 2015

Feature Request: Static Variables #12627

Closed

wbhart mentioned this pull request Aug 17, 2016

Stack allocation of tuples/immutables containing heap references #18084

Closed

tkoolen mentioned this pull request Oct 16, 2016

Crazy idea: static char arrays instead of global frame_names JuliaRobotics/RigidBodyDynamics.jl#79

Closed

vtjnash mentioned this pull request Dec 22, 2016

deprecate &x syntax in ccall, replace with Ref #6080

Closed

stevengj mentioned this pull request Aug 16, 2017

use Ref, not preallocated arrays, for passing parameters by reference JuliaMath/SpecialFunctions.jl#44

Merged

vtjnash closed this Aug 18, 2017

DilumAluthge deleted the jn/gc_stack branch January 12, 2021 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: on-stack gc allocation #8134

WIP: on-stack gc allocation #8134

vtjnash commented Aug 26, 2014

timholy commented Aug 26, 2014

JeffBezanson commented Aug 30, 2014

vtjnash commented Aug 31, 2014

JeffBezanson commented Aug 31, 2014

vtjnash commented Aug 31, 2014

JeffBezanson commented Sep 1, 2014

vtjnash commented Sep 2, 2014

JeffBezanson commented Sep 3, 2014

vtjnash commented Sep 10, 2014

carnaval commented Jun 12, 2015

vtjnash commented Jun 12, 2015

JeffBezanson commented Jun 13, 2015

StefanKarpinski commented Jun 13, 2015

yuyichao commented Jun 16, 2015

WIP: on-stack gc allocation #8134

WIP: on-stack gc allocation #8134

Conversation

vtjnash commented Aug 26, 2014

timholy commented Aug 26, 2014

JeffBezanson commented Aug 30, 2014

vtjnash commented Aug 31, 2014

JeffBezanson commented Aug 31, 2014

vtjnash commented Aug 31, 2014

JeffBezanson commented Sep 1, 2014

vtjnash commented Sep 2, 2014

JeffBezanson commented Sep 3, 2014

vtjnash commented Sep 10, 2014

carnaval commented Jun 12, 2015

vtjnash commented Jun 12, 2015

JeffBezanson commented Jun 13, 2015

StefanKarpinski commented Jun 13, 2015

yuyichao commented Jun 16, 2015