Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: on-stack gc allocation #8134

Closed
wants to merge 9 commits into from
Closed

WIP: on-stack gc allocation #8134

wants to merge 9 commits into from

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Aug 26, 2014

This is an improvement to the GC by pre-computing the GC escape analysis and entirely avoiding malloc when it is detected that the value cannot leave the stack frame. In it's current form, this appears to have no impact on the perf tests (as expected), since it doesn't trace values across function calls. Through the addition of a gc-analysis pass to inference.jl, this can be easily improved however. Where this is most helpful is in allocating boxes for ccall for very cheap, allowing the elimination of the & special syntax without performance penalty:

Some initial function to be merged into base:

julia> type Ref{T}
        x::T
       end

julia> Base.getindex(x::Ref) = x.x
getindex (generic function with 171 methods)

julia> Base.convert{T}(::Type{Ptr{T}}, b::Ref{T}) = ccall(:jl_value_ptr,Ptr{T},(Ptr{Ref},),&b)
convert (generic function with 452 methods)

And an example:

julia> function call_fortran()
    x = Ref{Int}(0)
    ccall(:jl_, Void, (Ptr{Int}, Ptr{Int},), x, Ref{Int}(1))
    return x[]
end
call_fortran (generic function with 1 method)

julia> code_llvm(call_fortran,())

; Function Attrs: sspreq
define i64 @"julia_call_fortran;40340"() #2 {
top:
  %0 = alloca [5 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [5 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 2, !dbg !1136
  store %jl_value_t* inttoptr (i64 6 to %jl_value_t*), %jl_value_t** %.sub, align 8
  %2 = load %jl_value_t*** @jl_pgcstack, align 8, !dbg !1136
  %3 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 1, !dbg !1136
  %.c = bitcast %jl_value_t** %2 to %jl_value_t*, !dbg !1136
  store %jl_value_t* %.c, %jl_value_t** %3, align 8, !dbg !1136
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8, !dbg !1136
  store %jl_value_t* null, %jl_value_t** %1, align 8, !dbg !1136
  %4 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 3, !dbg !1136
  store %jl_value_t* null, %jl_value_t** %4, align 8, !dbg !1136
  %5 = load i8** @jl_alloca_stack, align 8, !dbg !1136
  %6 = getelementptr [5 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %7 = alloca [3 x i8*], align 8, !dbg !1137
  %.sub1 = getelementptr inbounds [3 x i8*]* %7, i64 0, i64 0
  %8 = load i8** @jl_alloca_stack, align 8, !dbg !1137
  store i8* %8, i8** %.sub1, align 8, !dbg !1137
  %9 = bitcast [3 x i8*]* %7 to i8*, !dbg !1137
  store i8* %9, i8** @jl_alloca_stack, align 8, !dbg !1137
  %10 = getelementptr [3 x i8*]* %7, i64 0, i64 1, !dbg !1137
  %11 = bitcast i8** %10 to %jl_value_t*, !dbg !1137, !isAlloca !1139
  store i8* inttoptr (i64 140638926508672 to i8*), i8** %10, align 8, !dbg !1137
  store %jl_value_t* %11, %jl_value_t** %6, align 8, !dbg !1137
  %12 = getelementptr [3 x i8*]* %7, i64 0, i64 2, !dbg !1137
  %13 = load i64* inttoptr (i64 140638888978504 to i64*), align 8, !dbg !1137
  %.c2 = inttoptr i64 %13 to i8*, !dbg !1137
  store i8* %.c2, i8** %12, align 8, !dbg !1137, !tbaa %jtbaa_user
  store %jl_value_t* %11, %jl_value_t** %1, align 8, !dbg !1137
  %14 = alloca [3 x i8*], align 8, !dbg !1142
  %.sub3 = getelementptr inbounds [3 x i8*]* %14, i64 0, i64 0
  %15 = load i8** @jl_alloca_stack, align 8, !dbg !1142
  store i8* %15, i8** %.sub3, align 8, !dbg !1142
  %16 = bitcast [3 x i8*]* %14 to i8*, !dbg !1142
  store i8* %16, i8** @jl_alloca_stack, align 8, !dbg !1142
  %17 = getelementptr [3 x i8*]* %14, i64 0, i64 1, !dbg !1142
  %18 = bitcast i8** %17 to %jl_value_t*, !dbg !1142, !isAlloca !1139
  store i8* inttoptr (i64 140638926508672 to i8*), i8** %17, align 8, !dbg !1142
  store %jl_value_t* %18, %jl_value_t** %6, align 8, !dbg !1142
  %19 = getelementptr [3 x i8*]* %14, i64 0, i64 2, !dbg !1142
  %20 = load i64* inttoptr (i64 140638888978536 to i64*), align 8, !dbg !1142
  %.c4 = inttoptr i64 %20 to i8*, !dbg !1142
  store i8* %.c4, i8** %19, align 8, !dbg !1142, !tbaa %jtbaa_user
  store %jl_value_t* %18, %jl_value_t** %4, align 8, !dbg !1142
  %21 = bitcast i8** %12 to i64*, !dbg !1142
  %22 = bitcast i8** %19 to i64*, !dbg !1142
  call void inttoptr (i64 4318594960 to void (i64*, i64*)*)(i64* %21, i64* %22), !dbg !1142
  %23 = load i64* %21, align 8, !dbg !1143, !tbaa %jtbaa_user
  %24 = load %jl_value_t** %3, align 8, !dbg !1143
  %25 = getelementptr inbounds %jl_value_t* %24, i64 0, i32 0, !dbg !1143
  store %jl_value_t** %25, %jl_value_t*** @jl_pgcstack, align 8, !dbg !1143
  store i8* %5, i8** @jl_alloca_stack, align 8, !dbg !1143
  ret i64 %23, !dbg !1143
}

No allocobj!

TODO items:

  • remove the alloca stack frame when it isn't used
  • frob the stack less
  • future work: implement more complete gc analysis pass in julia (store info in Expr objects)
  • correctly handle: x = new(T, getfield(x,2), getfield(x,1))

@timholy
Copy link
Member

timholy commented Aug 26, 2014

👏. Very exciting to see big steps in this direction!

@vtjnash vtjnash force-pushed the jn/gc_stack branch 2 times, most recently from 079423d to cebf733 Compare August 28, 2014 02:21
@JeffBezanson
Copy link
Member

This is very good progress. But I would greatly prefer to do this in a way that does not require the jl_alloca_stack global variable. Needing to update global state and modify the GC code nearly (though of course not entirely) defeats the purpose of stack allocation. It would be better, at first, to only stack allocate objects that don't contain heap references.

@vtjnash
Copy link
Member Author

vtjnash commented Aug 31, 2014

that's not relevant to why i needed the jl_alloca_stack – i needed it so that i can unmark it after the gc pass finishes. i think it can go away (in most cases?) with the new incremental gc

@JeffBezanson
Copy link
Member

The best form of stack allocation removes objects from the gc's reach
entirely.

@vtjnash
Copy link
Member Author

vtjnash commented Aug 31, 2014

that seems like that would require a better escape analysis pass than currently implemented. my thought is that this can be part of every expr (like .typ) and thus computed in inference.jl. my thought also is that all levels of stack allocation can be used together. however, with this PR, my intent was to add just enough to make the example more efficient in most cases, so that the & operator can be eliminated without penalty (and also adding cheap ccall out-args in the bargain).

also, my hope is that even though the GC can still see these, it saves a lot of effort on managing the allocation and free lists, especially where you are making a lot of small objects in a loop, and then quickly discarding them (since the GC also doesn't count these allocations in its totals)

the main benefit of the incremental gc to be relevant here is the ability to have additional flag bits so that reflection can detect whether it has a stack or heap allocated value

@JeffBezanson
Copy link
Member

jl_alloca_stack is just not ok. If something is stack allocated the GC doesn't need to touch it at all; that's the whole point. If this were just a bit sub-optimal it might be ok, but adding an extra piece of global runtime state is really bad.

@vtjnash
Copy link
Member Author

vtjnash commented Sep 2, 2014

you make it sound like this global variable, which helps us malloc memory more efficiently and faster, would make everything slower due to the fact that in certain cases it could be even faster. and that having the general case would make implementing the special case harder, rather than easier. that just doesn't make sense to me.

@JeffBezanson
Copy link
Member

Could you explain more about why stack-allocated objects in this scheme get marked by the GC, requiring them to be unmarked eventually? After all, we do technically stack allocate lots of objects already without this issue.

Maybe it would help to have the GC identify stack-allocated objects by address range, and not mark them?

@vtjnash
Copy link
Member Author

vtjnash commented Sep 10, 2014

@JeffBezanson is this more inline with what you were wanting? (I know that many more optimizations are possible, the question is whether this represents a sufficient improvement to merge).

@vtjnash vtjnash changed the title WIP: gc on-stack on-stack gc allocation Sep 10, 2014
@vtjnash vtjnash changed the title on-stack gc allocation WIP: on-stack gc allocation Sep 12, 2014
@jiahao jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06
@simonster simonster mentioned this pull request Oct 23, 2014
@carnaval
Copy link
Contributor

@vtjnash forgot about this. About the unmarking, we still have to scan through those stack allocated objects if they contain pointers to non stack memory ? Or do you only stack alloc things where no internal pointer escaped either ?

Anyway, if we don't want to unmark them we could keep them marked all the time, as long as they are not the sole root of some other objects. If they are it's a bit more tricky, we could have the write barrier correct this but it requires a few modifications.

@vtjnash
Copy link
Member Author

vtjnash commented Jun 12, 2015

probably both, or whatever you want to suggest

@JeffBezanson
Copy link
Member

I still think doing a conservative stack scan is likely a better approach to this, since it simplifies everything greatly (except in the scanning code itself, where the complexity belongs).

@StefanKarpinski
Copy link
Member

+1 to conservative stack scanning. It seems to me that the chances of a random pointer-like value on the stack pointing exactly at an object large enough to care about freeing are pretty slim. @carnaval, didn't you have the conservative stack scanning approach largely implemented? Could we maybe try it out?

@yuyichao
Copy link
Contributor

Does this have any special handling of the finalizer or does it treat finalizer registration as a leak of object pointer?

carnaval added a commit that referenced this pull request Jul 18, 2015
Only works with copy_stacks for now. The code is not used anywhere in
codegen right now, it can be tested by using (:stknew typ args...)
instead of :new. It is essentially the GC side of #8134 but without the
global alloca stack & unmarking.

Stack objects are kept marked (+young) permanently and are only scanned
when coming from the owning task.
carnaval added a commit that referenced this pull request Jul 18, 2015
Only works with copy_stacks for now. The code is not used anywhere in
codegen right now, it can be tested by using (:stknew typ args...)
instead of :new. It is essentially the GC side of #8134 but without the
global alloca stack & unmarking.

Stack objects are kept marked (+young) permanently and are only scanned
when coming from the owning task.
carnaval added a commit that referenced this pull request Jul 18, 2015
Only works with copy_stacks for now. The code is not used anywhere in
codegen right now, it can be tested by using (:stknew typ args...)
instead of :new. It is essentially the GC side of #8134 but without the
global alloca stack & unmarking.

Stack objects are kept marked (+young) permanently and are only scanned
when coming from the owning task.
carnaval added a commit that referenced this pull request Mar 2, 2016
Only works with copy_stacks for now. The code is not used anywhere in
codegen right now, it can be tested by using (:stknew typ args...)
instead of :new. It is essentially the GC side of #8134 but without the
global alloca stack & unmarking.

Stack objects are kept marked (+young) permanently and are only scanned
when coming from the owning task.
@vtjnash vtjnash closed this Aug 18, 2017
@DilumAluthge DilumAluthge deleted the jn/gc_stack branch January 12, 2021 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants