Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary GC root for getfield of SSA immutable object #15402

Closed
JeffBezanson opened this issue Mar 8, 2016 · 20 comments
Closed

Unnecessary GC root for getfield of SSA immutable object #15402

JeffBezanson opened this issue Mar 8, 2016 · 20 comments
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster regression Regression in behavior compared to a previous version

Comments

@JeffBezanson
Copy link
Member

Test case:

immutable Wrap{T}
  x::T
end

function f(a::Wrap)
  s = 0.0
  for i = 1:length(a.x)
    @inbounds s += a.x[i]
  end
  s
end

@code_llvm f(Wrap(rand(2,2)))

Inner loop code in v0.4:

  %"#s1.0" = phi i64 [ %13, %L ], [ 1, %L.preheader ]
  %s.0 = phi double [ %17, %L ], [ 0.000000e+00, %L.preheader ]
  %13 = add i64 %"#s1.0", 1
  %14 = add i64 %"#s1.0", -1
  %15 = getelementptr double* %12, i64 %14
  %16 = load double* %15, align 8
  %17 = fadd double %s.0, %16
  %18 = icmp eq i64 %"#s1.0", %7
  br i1 %18, label %L3, label %L

Inner loop code on master:

  %s.02 = phi double [ %22, %if ], [ 0.000000e+00, %if.preheader ]
  %"#s1.01" = phi i64 [ %15, %if ], [ 1, %if.preheader ]
  %15 = add i64 %"#s1.01", 1
  %16 = load %jl_value_t*, %jl_value_t** %8, align 8
  store %jl_value_t* %16, %jl_value_t** %3, align 8
  %17 = add i64 %"#s1.01", -1
  %18 = bitcast %jl_value_t* %16 to double**
  %19 = load double*, double** %18, align 8
  %20 = getelementptr double, double* %19, i64 %17
  %21 = load double, double* %20, align 8
  %22 = fadd double %s.02, %21
  %23 = icmp eq i64 %"#s1.01", %12
  br i1 %23, label %L2.loopexit, label %if

It seems to be rooting a.x on every iteration where we didn't before. This is one of the issues affecting #15146.

@JeffBezanson JeffBezanson added performance Must go faster regression Regression in behavior compared to a previous version compiler:codegen Generation of LLVM IR and native code labels Mar 8, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Mar 8, 2016

Yeah, I've also just seen this for array stores.

@yuyichao
Copy link
Contributor

On master the IR becames

if:                                               ; preds = %if.preheader, %if
  %s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
  %"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
  %17 = add i64 %"#temp#.03", 1
  store %jl_value_t* %0, %jl_value_t** %4, align 8
  %18 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %18, %jl_value_t** %5, align 8
  %19 = add i64 %"#temp#.03", -1
  %20 = bitcast %jl_value_t* %18 to double**
  %21 = load double*, double** %20, align 8
  %22 = getelementptr double, double* %21, i64 %19
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L2.loopexit, label %if

So we are rooting a at every iteration too and this starts to affect containers of bitstype too. Ref #15717

@yuyichao
Copy link
Contributor

yuyichao commented May 5, 2016

I think the inner loop is free of gc frame store now.

if:                                               ; preds = %if.lr.ph, %if
  %s.04 = phi double [ 0.000000e+00, %if.lr.ph ], [ %24, %if ]
  %"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %20, %if ]
  %20 = add i64 %"#temp#.03", 1
  %21 = add i64 %"#temp#.03", -1
  %22 = getelementptr double, double* %18, i64 %21
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L.L2_crit_edge, label %if

However, the function still has two gc slot and it shouldn't need any.

@JeffBezanson
Copy link
Member Author

Oh, very nice. I wonder what fixed that. Anyway this might be enough to get #7258 off the ground.

@yuyichao
Copy link
Contributor

yuyichao commented May 5, 2016

I believe it's #13463. Commenting the tbaa decoration out results in

if:                                               ; preds = %if.preheader, %if
  %s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
  %"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
  %17 = add i64 %"#temp#.03", 1
  %18 = load %jl_value_t*, %jl_value_t** %10, align 8
  store %jl_value_t* %18, %jl_value_t** %4, align 8
  %19 = add i64 %"#temp#.03", -1
  %20 = bitcast %jl_value_t* %18 to double**
  %21 = load double*, double** %20, align 8
  %22 = getelementptr double, double* %21, i64 %19
  %23 = load double, double* %22, align 8
  %24 = fadd double %s.04, %23
  %25 = icmp eq i64 %"#temp#.03", %14
  br i1 %25, label %L2.loopexit, label %if

This is also why the GC slots are still there, the metadata merely allows llvm to move it out of the loop, not deleting it...

@yuyichao yuyichao changed the title excess GC frame store when iterating over a wrapped object Unnecessary GC root for getfield of SSA immutable object May 5, 2016
@JeffBezanson
Copy link
Member Author

JeffBezanson commented May 5, 2016

Ok, I've tried #7258 again. Here's where we are for this test case:

g(n) = [ i/10 for i=1:n ]

The inner loop:

if:                                               ; preds = %if.lr.ph, %if
  %i.09 = phi i64 [ %2, %if.lr.ph ], [ %15, %if ]
  %st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %st.08, 1
  %10 = sitofp i64 %st.08 to double
  %11 = fdiv double %10, 1.000000e+01
  %12 = add i64 %i.09, -1
  %13 = load double*, double** %8, align 8
  %14 = getelementptr double, double* %13, i64 %12
  store double %11, double* %14, align 8
  %15 = add i64 %i.09, 1
  %16 = load i64, i64* %4, align 8
  %17 = icmp eq i64 %st.08, %16
  br i1 %17, label %L5.loopexit, label %if

There are two extra loads, one for the array pointer and one for the range endpoint. So there is still a ~2x performance gap.

@yuyichao
Copy link
Contributor

yuyichao commented May 5, 2016

Does the store load gets moved out of the loop with -O3?

@JeffBezanson
Copy link
Member Author

Interesting; with -O3 the array pointer load is hoisted, but the extra i64 load is still there.

@yuyichao
Copy link
Contributor

yuyichao commented May 5, 2016

I noticed a similar effect in #15717 (#13463 (comment)) The load of the end point is interesting. Maybe check Base.code_llvm(STDOUT, g, Tuple{Int}, false, true) to see why does LLVM think it is not ok to move it out? (Unless LLVM does this iteratively and don't want to move too many loads out at once....)

@JeffBezanson
Copy link
Member Author

JeffBezanson commented May 5, 2016

Here's the full output:

; ModuleID = 'collect_to!'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

%jl_value_t = type { %jl_value_t* }
%Generator.62 = type { [0 x i1], %UnitRange }
%UnitRange = type { i64, i64 }

define %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t*, %Generator.62*, i64, i64) #0 {
top:
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !22, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !31, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !32, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %jl_value_t* %0, i64 0, metadata !22, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata %Generator.62* %1, i64 0, metadata !23, metadata !39), !dbg !38
  call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !24, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %3, i64 0, metadata !25, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !26, metadata !37), !dbg !38
  %4 = getelementptr inbounds %Generator.62, %Generator.62* %1, i64 0, i32 1, i32 1, !dbg !40
  %5 = load i64, i64* %4, align 8, !dbg !40
  %6 = add i64 %5, 1, !dbg !40
  %7 = icmp eq i64 %6, %3, !dbg !40
  br i1 %7, label %L5, label %if.lr.ph, !dbg !40

if.lr.ph:                                         ; preds = %top
  %8 = bitcast %jl_value_t* %0 to double**, !dbg !41
  %.pre = load double*, double** %8, align 8, !dbg !41, !tbaa !42
  br label %if, !dbg !40

L5.loopexit:                                      ; preds = %if
  br label %L5, !dbg !46

L5:                                               ; preds = %L5.loopexit, %top
  ret %jl_value_t* %0, !dbg !46

if:                                               ; preds = %if.lr.ph, %if
  %i.09 = phi i64 [ %2, %if.lr.ph ], [ %14, %if ]
  %st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %st.08, 1, !dbg !47
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !35, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !36, metadata !37), !dbg !38
  %10 = sitofp i64 %st.08 to double, !dbg !51
  %11 = fdiv double %10, 1.000000e+01, !dbg !51
  call void @llvm.dbg.value(metadata double %11, i64 0, metadata !27, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !25, metadata !37), !dbg !38
  call void @llvm.dbg.value(metadata i1 true, i64 0, metadata !33, metadata !37), !dbg !38
  %12 = add i64 %i.09, -1, !dbg !41
  %13 = getelementptr double, double* %.pre, i64 %12, !dbg !41
  store double %11, double* %13, align 8, !dbg !41, !tbaa !52
  %14 = add i64 %i.09, 1, !dbg !53
  call void @llvm.dbg.value(metadata i64 %14, i64 0, metadata !26, metadata !37), !dbg !38
  %15 = load i64, i64* %4, align 8, !dbg !40
  %16 = icmp eq i64 %st.08, %15, !dbg !40
  br i1 %16, label %L5.loopexit, label %if, !dbg !40
}

define %jl_value_t* @"jlcall_collect_to!_55312"(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = call %jl_value_t*** @jl_get_ptls_states()
  %4 = alloca [3 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 0
  %5 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 2
  store %jl_value_t* null, %jl_value_t** %5, align 8, !tbaa !54
  %6 = bitcast [3 x %jl_value_t*]* %4 to i64*
  store i64 2, i64* %6, align 8, !tbaa !54
  %7 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 1
  %8 = bitcast %jl_value_t*** %3 to i64*
  %9 = load i64, i64* %8, align 8
  %10 = bitcast %jl_value_t** %7 to i64*
  store i64 %9, i64* %10, align 8, !tbaa !54
  store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
  %11 = load %jl_value_t*, %jl_value_t** %1, align 8
  %12 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  %13 = bitcast %jl_value_t** %12 to %Generator.62**
  %14 = load %Generator.62*, %Generator.62** %13, align 8
  %15 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
  %16 = bitcast %jl_value_t** %15 to i64**
  %17 = load i64*, i64** %16, align 8
  %18 = load i64, i64* %17, align 8
  %19 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 3
  %20 = bitcast %jl_value_t** %19 to i64**
  %21 = load i64*, i64** %20, align 8
  %22 = load i64, i64* %21, align 8
  %23 = call %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t* %11, %Generator.62* %14, i64 %18, i64 %22) #0
  store %jl_value_t* %23, %jl_value_t** %5, align 8, !tbaa !54
  %24 = load i64, i64* %10, align 8, !tbaa !54
  store i64 %24, i64* %8, align 8, !tbaa !54
  ret %jl_value_t* %23
}

; Function Attrs: nounwind readnone
declare %jl_value_t*** @jl_get_ptls_states() #1

; Function Attrs: nounwind readnone
declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

; Function Attrs: nounwind readnone
declare void @llvm.dbg.value(metadata, i64, metadata, metadata) #1

attributes #0 = { "no-frame-pointer-elim"="true" }
attributes #1 = { nounwind readnone }

!llvm.module.flags = !{!0, !1}
!llvm.dbg.cu = !{!2}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"Debug Info Version", i32 3}
!2 = distinct !DICompileUnit(language: DW_LANG_C89, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: 1, enums: !4, subprograms: !5)
!3 = !DIFile(filename: "array.jl", directory: ".")
!4 = !{}
!5 = !{!6}
!6 = !DISubprogram(name: "collect_to!", linkageName: "julia_collect_to!_55328", scope: null, file: !3, type: !7, isLocal: false, isDefinition: true, isOptimized: true, function: %jl_value_t* (%jl_value_t*, %Generator.62*, i64, i64)* @"julia_collect_to!_55312", variables: !19)
!7 = !DISubroutineType(types: !8)
!8 = !{!9, !13, !18, !18}
!9 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !10, size: 64, align: 64)
!10 = !DICompositeType(tag: DW_TAG_structure_type, name: "jl_value_t", file: !11, line: 71, align: 64, elements: !12)
!11 = !DIFile(filename: "julia.h", directory: "")
!12 = !{!9}
!13 = !DICompositeType(tag: DW_TAG_structure_type, name: "Generator", size: 128, align: 64, elements: !14, runtimeLang: DW_LANG_Julia)
!14 = !{!15, !16}
!15 = !DICompositeType(tag: DW_TAG_structure_type, name: "#a", align: 8, elements: !4, runtimeLang: DW_LANG_Julia)
!16 = !DICompositeType(tag: DW_TAG_structure_type, name: "UnitRange", size: 128, align: 64, elements: !17, runtimeLang: DW_LANG_Julia)
!17 = !{!18, !18}
!18 = !DIBasicType(name: "Int64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!19 = !{!20, !22, !23, !24, !25, !26, !27, !29, !30, !31, !32, !33, !35, !36, !29}
!20 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "#self#", arg: 1, scope: !6, file: !3, line: 267, type: !21)
!21 = !DICompositeType(tag: DW_TAG_structure_type, name: "#collect_to!", elements: !4, runtimeLang: DW_LANG_Julia)
!22 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "dest", arg: 2, scope: !6, file: !3, line: 267, type: !9)
!23 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "itr", arg: 3, scope: !6, file: !3, line: 267, type: !13)
!24 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "offs", arg: 4, scope: !6, file: !3, line: 267, type: !18)
!25 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "st", arg: 5, scope: !6, file: !3, line: 267, type: !18)
!26 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "i", scope: !6, file: !3, line: 267, type: !18)
!27 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "el", scope: !6, file: !3, line: 267, type: !28)
!28 = !DIBasicType(name: "Float64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!29 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !18)
!30 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "S", scope: !6, file: !3, line: 267, type: !9)
!31 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "R", scope: !6, file: !3, line: 267, type: !9)
!32 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "new", scope: !6, file: !3, line: 267, type: !9)
!33 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !34)
!34 = !DIBasicType(name: "Bool", size: 1, align: 8, encoding: DW_ATE_unsigned)
!35 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "v", scope: !6, file: !3, line: 267, type: !18)
!36 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "s2", scope: !6, file: !3, line: 267, type: !18)
!37 = !DIExpression()
!38 = !DILocation(line: 267, scope: !6)
!39 = !DIExpression(DW_OP_deref)
!40 = !DILocation(line: 268, column: 1, scope: !6)
!41 = !DILocation(line: 272, column: 1, scope: !6)
!42 = !{!"jtbaa_arrayptr", !43}
!43 = !{!"jtbaa_array", !44}
!44 = !{!"jtbaa_value", !45}
!45 = !{!"jtbaa"}
!46 = !DILocation(line: 282, column: 1, scope: !6)
!47 = !DILocation(line: 21, column: 1, scope: !48, inlinedAt: !50)
!48 = !DILexicalBlockFile(scope: !6, file: !49, discriminator: 0)
!49 = !DIFile(filename: "generator.jl", directory: ".")
!50 = !DILocation(line: 269, column: 1, scope: !6)
!51 = !DILocation(line: 22, column: 1, scope: !48, inlinedAt: !50)
!52 = !{!"jtbaa_user", !45}
!53 = !DILocation(line: 273, column: 1, scope: !6)
!54 = !{!"jtbaa_gcframe", !45}

@yuyichao
Copy link
Contributor

yuyichao commented May 5, 2016

What type is %Generator.62? Is it mutable? It looks like we are somehow missing a tbaa_user (if it's mutable) or tbaa_immut (if it's immutable) for the load from %4.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2016

Reproduced with

f(i) = i / 10
immutable Gen
    iter::UnitRange{Int}
end
Base.start(g::Gen) = start(g.iter)
Base.done(g::Gen, s) = done(g.iter, s)
function Base.next(g::Gen, s)
    v, s2 = next(g.iter, s)
    f(v), s2
end
gen = Gen(1:10)
function collect_to2!{T}(dest::AbstractArray{T}, itr)
    i = 1
    st = start(itr)
    while !done(itr, st)
        el, st = next(itr, st)
        @inbounds dest[i] = el
        i += 1
    end
    return dest
end
@code_llvm collect_to2!(Vector{Float64}(10), gen)
@show collect_to2!(Vector{Float64}(10), gen)

seems to be unrelated to GC root (the above function doesn't have one) Might be related to #13104

@tkelman
Copy link
Contributor

tkelman commented May 12, 2016

@yuyichao is this completely fixed by #16230 or is more work needed?

@yuyichao
Copy link
Contributor

yuyichao commented May 12, 2016

It still need work

@JeffBezanson
Copy link
Member Author

We are now at performance parity for [ i/10 for i=1:n ]. Excellent progress. 🎉

@JeffBezanson
Copy link
Member Author

JeffBezanson commented May 13, 2016

The next case is 2d, which is much harder as it uses a Product iterator. Here's a test case:

G = ((i+j)/1000 for i=1:2000, j=1:2000)

@code_llvm Base.collect_to!(zeros(2000,2000), G, 1, start(G))

Here's the inner loop:

if:                                               ; preds = %if.lr.ph, %if
  %i.07 = phi i64 [ %2, %if.lr.ph ], [ %35, %if ]
  call void @julia_next_54517({ double, { i64, i64, %Nullable.27, i8 } }* nonnull sret %4, %Generator.58* %1, { i64, i64, %Nullable.27, i8 }* nonnull %st) #0
  %26 = load i64, i64* %20, align 8
  %27 = load i64, i64* %21, align 8
  %28 = load i64, i64* %22, align 8
  %29 = load i8, i8* %23, align 8
  %30 = load i64, i64* %24, align 8
  %31 = load i8, i8* %25, align 8
  store i64 %27, i64* %6, align 8
  store i64 %28, i64* %8, align 8
  store i8 %29, i8* %10, align 8
  store i64 %30, i64* %12, align 8
  store i8 %31, i8* %14, align 8
  %32 = add i64 %i.07, -1
  %33 = getelementptr double, double* %19, i64 %32
  %34 = bitcast double* %33 to i64*
  store i64 %26, i64* %34, align 8
  %35 = add i64 %i.07, 1
  %36 = and i8 %31, 1
  %37 = icmp eq i8 %36, 0
  br i1 %37, label %if, label %L4.loopexit

Somewhat amazingly, this is only 30% slower than 0.4. Adding @_inline_meta for Generator iteration removes the function call and gets us within ~10%. However, another interesting feature of the above code is that it gets much slower (almost 2x) with -O3, since it gets "vectorized" but the vectors are constantly stored to the stack (at least I think that's the problem). In the better, fully-inlined version -O3 doesn't cause a problem.

@yuyichao
Copy link
Contributor

Somewhat amazingly, this is only 30% slower than 0.4. Adding @_inline_meta for Generator iteration removes the function call and gets us within ~10%.

I assume the current version is basically fully inlined so if the performance is only different by 10% with inlining, this seems to be a inlining problem?

@JeffBezanson
Copy link
Member Author

Well, I'm happy to add the extra @inline declaration, but from there a 10% slowdown is still not good and is probably due to the messy product iterator state.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 9, 2016

The consensus right now seems to be track these in a single issue. So move to #15369 (comment)

@yuyichao yuyichao closed this as completed Jun 9, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Jun 9, 2016

(Given that the performance issue for the inner loop isn't there anymore)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

3 participants