-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary GC root for getfield of SSA immutable object #15402
Comments
Yeah, I've also just seen this for array stores. |
On master the IR becames if: ; preds = %if.preheader, %if
%s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
%"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
%17 = add i64 %"#temp#.03", 1
store %jl_value_t* %0, %jl_value_t** %4, align 8
%18 = load %jl_value_t*, %jl_value_t** %10, align 8
store %jl_value_t* %18, %jl_value_t** %5, align 8
%19 = add i64 %"#temp#.03", -1
%20 = bitcast %jl_value_t* %18 to double**
%21 = load double*, double** %20, align 8
%22 = getelementptr double, double* %21, i64 %19
%23 = load double, double* %22, align 8
%24 = fadd double %s.04, %23
%25 = icmp eq i64 %"#temp#.03", %14
br i1 %25, label %L2.loopexit, label %if So we are rooting |
I think the inner loop is free of gc frame store now. if: ; preds = %if.lr.ph, %if
%s.04 = phi double [ 0.000000e+00, %if.lr.ph ], [ %24, %if ]
%"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %20, %if ]
%20 = add i64 %"#temp#.03", 1
%21 = add i64 %"#temp#.03", -1
%22 = getelementptr double, double* %18, i64 %21
%23 = load double, double* %22, align 8
%24 = fadd double %s.04, %23
%25 = icmp eq i64 %"#temp#.03", %14
br i1 %25, label %L.L2_crit_edge, label %if However, the function still has two gc slot and it shouldn't need any. |
Oh, very nice. I wonder what fixed that. Anyway this might be enough to get #7258 off the ground. |
I believe it's #13463. Commenting the tbaa decoration out results in if: ; preds = %if.preheader, %if
%s.04 = phi double [ %24, %if ], [ 0.000000e+00, %if.preheader ]
%"#temp#.03" = phi i64 [ %17, %if ], [ 1, %if.preheader ]
%17 = add i64 %"#temp#.03", 1
%18 = load %jl_value_t*, %jl_value_t** %10, align 8
store %jl_value_t* %18, %jl_value_t** %4, align 8
%19 = add i64 %"#temp#.03", -1
%20 = bitcast %jl_value_t* %18 to double**
%21 = load double*, double** %20, align 8
%22 = getelementptr double, double* %21, i64 %19
%23 = load double, double* %22, align 8
%24 = fadd double %s.04, %23
%25 = icmp eq i64 %"#temp#.03", %14
br i1 %25, label %L2.loopexit, label %if This is also why the GC slots are still there, the metadata merely allows llvm to move it out of the loop, not deleting it... |
Ok, I've tried #7258 again. Here's where we are for this test case: g(n) = [ i/10 for i=1:n ] The inner loop: if: ; preds = %if.lr.ph, %if
%i.09 = phi i64 [ %2, %if.lr.ph ], [ %15, %if ]
%st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
%9 = add i64 %st.08, 1
%10 = sitofp i64 %st.08 to double
%11 = fdiv double %10, 1.000000e+01
%12 = add i64 %i.09, -1
%13 = load double*, double** %8, align 8
%14 = getelementptr double, double* %13, i64 %12
store double %11, double* %14, align 8
%15 = add i64 %i.09, 1
%16 = load i64, i64* %4, align 8
%17 = icmp eq i64 %st.08, %16
br i1 %17, label %L5.loopexit, label %if There are two extra loads, one for the array pointer and one for the range endpoint. So there is still a ~2x performance gap. |
Does the |
Interesting; with |
I noticed a similar effect in #15717 (#13463 (comment)) The load of the end point is interesting. Maybe check |
Here's the full output: ; ModuleID = 'collect_to!'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
%jl_value_t = type { %jl_value_t* }
%Generator.62 = type { [0 x i1], %UnitRange }
%UnitRange = type { i64, i64 }
define %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t*, %Generator.62*, i64, i64) #0 {
top:
call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !22, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !31, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata %jl_value_t* null, i64 0, metadata !32, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata %jl_value_t* %0, i64 0, metadata !22, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata %Generator.62* %1, i64 0, metadata !23, metadata !39), !dbg !38
call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !24, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata i64 %3, i64 0, metadata !25, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata i64 %2, i64 0, metadata !26, metadata !37), !dbg !38
%4 = getelementptr inbounds %Generator.62, %Generator.62* %1, i64 0, i32 1, i32 1, !dbg !40
%5 = load i64, i64* %4, align 8, !dbg !40
%6 = add i64 %5, 1, !dbg !40
%7 = icmp eq i64 %6, %3, !dbg !40
br i1 %7, label %L5, label %if.lr.ph, !dbg !40
if.lr.ph: ; preds = %top
%8 = bitcast %jl_value_t* %0 to double**, !dbg !41
%.pre = load double*, double** %8, align 8, !dbg !41, !tbaa !42
br label %if, !dbg !40
L5.loopexit: ; preds = %if
br label %L5, !dbg !46
L5: ; preds = %L5.loopexit, %top
ret %jl_value_t* %0, !dbg !46
if: ; preds = %if.lr.ph, %if
%i.09 = phi i64 [ %2, %if.lr.ph ], [ %14, %if ]
%st.08 = phi i64 [ %3, %if.lr.ph ], [ %9, %if ]
%9 = add i64 %st.08, 1, !dbg !47
call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !35, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !36, metadata !37), !dbg !38
%10 = sitofp i64 %st.08 to double, !dbg !51
%11 = fdiv double %10, 1.000000e+01, !dbg !51
call void @llvm.dbg.value(metadata double %11, i64 0, metadata !27, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata i64 %9, i64 0, metadata !25, metadata !37), !dbg !38
call void @llvm.dbg.value(metadata i1 true, i64 0, metadata !33, metadata !37), !dbg !38
%12 = add i64 %i.09, -1, !dbg !41
%13 = getelementptr double, double* %.pre, i64 %12, !dbg !41
store double %11, double* %13, align 8, !dbg !41, !tbaa !52
%14 = add i64 %i.09, 1, !dbg !53
call void @llvm.dbg.value(metadata i64 %14, i64 0, metadata !26, metadata !37), !dbg !38
%15 = load i64, i64* %4, align 8, !dbg !40
%16 = icmp eq i64 %st.08, %15, !dbg !40
br i1 %16, label %L5.loopexit, label %if, !dbg !40
}
define %jl_value_t* @"jlcall_collect_to!_55312"(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
%3 = call %jl_value_t*** @jl_get_ptls_states()
%4 = alloca [3 x %jl_value_t*], align 8
%.sub = getelementptr inbounds [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 0
%5 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 2
store %jl_value_t* null, %jl_value_t** %5, align 8, !tbaa !54
%6 = bitcast [3 x %jl_value_t*]* %4 to i64*
store i64 2, i64* %6, align 8, !tbaa !54
%7 = getelementptr [3 x %jl_value_t*], [3 x %jl_value_t*]* %4, i64 0, i64 1
%8 = bitcast %jl_value_t*** %3 to i64*
%9 = load i64, i64* %8, align 8
%10 = bitcast %jl_value_t** %7 to i64*
store i64 %9, i64* %10, align 8, !tbaa !54
store %jl_value_t** %.sub, %jl_value_t*** %3, align 8
%11 = load %jl_value_t*, %jl_value_t** %1, align 8
%12 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
%13 = bitcast %jl_value_t** %12 to %Generator.62**
%14 = load %Generator.62*, %Generator.62** %13, align 8
%15 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
%16 = bitcast %jl_value_t** %15 to i64**
%17 = load i64*, i64** %16, align 8
%18 = load i64, i64* %17, align 8
%19 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 3
%20 = bitcast %jl_value_t** %19 to i64**
%21 = load i64*, i64** %20, align 8
%22 = load i64, i64* %21, align 8
%23 = call %jl_value_t* @"julia_collect_to!_55312"(%jl_value_t* %11, %Generator.62* %14, i64 %18, i64 %22) #0
store %jl_value_t* %23, %jl_value_t** %5, align 8, !tbaa !54
%24 = load i64, i64* %10, align 8, !tbaa !54
store i64 %24, i64* %8, align 8, !tbaa !54
ret %jl_value_t* %23
}
; Function Attrs: nounwind readnone
declare %jl_value_t*** @jl_get_ptls_states() #1
; Function Attrs: nounwind readnone
declare void @llvm.dbg.declare(metadata, metadata, metadata) #1
; Function Attrs: nounwind readnone
declare void @llvm.dbg.value(metadata, i64, metadata, metadata) #1
attributes #0 = { "no-frame-pointer-elim"="true" }
attributes #1 = { nounwind readnone }
!llvm.module.flags = !{!0, !1}
!llvm.dbg.cu = !{!2}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"Debug Info Version", i32 3}
!2 = distinct !DICompileUnit(language: DW_LANG_C89, file: !3, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: 1, enums: !4, subprograms: !5)
!3 = !DIFile(filename: "array.jl", directory: ".")
!4 = !{}
!5 = !{!6}
!6 = !DISubprogram(name: "collect_to!", linkageName: "julia_collect_to!_55328", scope: null, file: !3, type: !7, isLocal: false, isDefinition: true, isOptimized: true, function: %jl_value_t* (%jl_value_t*, %Generator.62*, i64, i64)* @"julia_collect_to!_55312", variables: !19)
!7 = !DISubroutineType(types: !8)
!8 = !{!9, !13, !18, !18}
!9 = !DIDerivedType(tag: DW_TAG_pointer_type, baseType: !10, size: 64, align: 64)
!10 = !DICompositeType(tag: DW_TAG_structure_type, name: "jl_value_t", file: !11, line: 71, align: 64, elements: !12)
!11 = !DIFile(filename: "julia.h", directory: "")
!12 = !{!9}
!13 = !DICompositeType(tag: DW_TAG_structure_type, name: "Generator", size: 128, align: 64, elements: !14, runtimeLang: DW_LANG_Julia)
!14 = !{!15, !16}
!15 = !DICompositeType(tag: DW_TAG_structure_type, name: "#a", align: 8, elements: !4, runtimeLang: DW_LANG_Julia)
!16 = !DICompositeType(tag: DW_TAG_structure_type, name: "UnitRange", size: 128, align: 64, elements: !17, runtimeLang: DW_LANG_Julia)
!17 = !{!18, !18}
!18 = !DIBasicType(name: "Int64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!19 = !{!20, !22, !23, !24, !25, !26, !27, !29, !30, !31, !32, !33, !35, !36, !29}
!20 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "#self#", arg: 1, scope: !6, file: !3, line: 267, type: !21)
!21 = !DICompositeType(tag: DW_TAG_structure_type, name: "#collect_to!", elements: !4, runtimeLang: DW_LANG_Julia)
!22 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "dest", arg: 2, scope: !6, file: !3, line: 267, type: !9)
!23 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "itr", arg: 3, scope: !6, file: !3, line: 267, type: !13)
!24 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "offs", arg: 4, scope: !6, file: !3, line: 267, type: !18)
!25 = !DILocalVariable(tag: DW_TAG_arg_variable, name: "st", arg: 5, scope: !6, file: !3, line: 267, type: !18)
!26 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "i", scope: !6, file: !3, line: 267, type: !18)
!27 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "el", scope: !6, file: !3, line: 267, type: !28)
!28 = !DIBasicType(name: "Float64", size: 64, align: 64, encoding: DW_ATE_unsigned)
!29 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !18)
!30 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "S", scope: !6, file: !3, line: 267, type: !9)
!31 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "R", scope: !6, file: !3, line: 267, type: !9)
!32 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "new", scope: !6, file: !3, line: 267, type: !9)
!33 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "#temp#", scope: !6, file: !3, line: 267, type: !34)
!34 = !DIBasicType(name: "Bool", size: 1, align: 8, encoding: DW_ATE_unsigned)
!35 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "v", scope: !6, file: !3, line: 267, type: !18)
!36 = !DILocalVariable(tag: DW_TAG_auto_variable, name: "s2", scope: !6, file: !3, line: 267, type: !18)
!37 = !DIExpression()
!38 = !DILocation(line: 267, scope: !6)
!39 = !DIExpression(DW_OP_deref)
!40 = !DILocation(line: 268, column: 1, scope: !6)
!41 = !DILocation(line: 272, column: 1, scope: !6)
!42 = !{!"jtbaa_arrayptr", !43}
!43 = !{!"jtbaa_array", !44}
!44 = !{!"jtbaa_value", !45}
!45 = !{!"jtbaa"}
!46 = !DILocation(line: 282, column: 1, scope: !6)
!47 = !DILocation(line: 21, column: 1, scope: !48, inlinedAt: !50)
!48 = !DILexicalBlockFile(scope: !6, file: !49, discriminator: 0)
!49 = !DIFile(filename: "generator.jl", directory: ".")
!50 = !DILocation(line: 269, column: 1, scope: !6)
!51 = !DILocation(line: 22, column: 1, scope: !48, inlinedAt: !50)
!52 = !{!"jtbaa_user", !45}
!53 = !DILocation(line: 273, column: 1, scope: !6)
!54 = !{!"jtbaa_gcframe", !45} |
What type is |
Reproduced with f(i) = i / 10
immutable Gen
iter::UnitRange{Int}
end
Base.start(g::Gen) = start(g.iter)
Base.done(g::Gen, s) = done(g.iter, s)
function Base.next(g::Gen, s)
v, s2 = next(g.iter, s)
f(v), s2
end
gen = Gen(1:10)
function collect_to2!{T}(dest::AbstractArray{T}, itr)
i = 1
st = start(itr)
while !done(itr, st)
el, st = next(itr, st)
@inbounds dest[i] = el
i += 1
end
return dest
end
@code_llvm collect_to2!(Vector{Float64}(10), gen)
@show collect_to2!(Vector{Float64}(10), gen) seems to be unrelated to GC root (the above function doesn't have one) Might be related to #13104 |
It still need work |
We are now at performance parity for |
The next case is 2d, which is much harder as it uses a G = ((i+j)/1000 for i=1:2000, j=1:2000)
@code_llvm Base.collect_to!(zeros(2000,2000), G, 1, start(G)) Here's the inner loop: if: ; preds = %if.lr.ph, %if
%i.07 = phi i64 [ %2, %if.lr.ph ], [ %35, %if ]
call void @julia_next_54517({ double, { i64, i64, %Nullable.27, i8 } }* nonnull sret %4, %Generator.58* %1, { i64, i64, %Nullable.27, i8 }* nonnull %st) #0
%26 = load i64, i64* %20, align 8
%27 = load i64, i64* %21, align 8
%28 = load i64, i64* %22, align 8
%29 = load i8, i8* %23, align 8
%30 = load i64, i64* %24, align 8
%31 = load i8, i8* %25, align 8
store i64 %27, i64* %6, align 8
store i64 %28, i64* %8, align 8
store i8 %29, i8* %10, align 8
store i64 %30, i64* %12, align 8
store i8 %31, i8* %14, align 8
%32 = add i64 %i.07, -1
%33 = getelementptr double, double* %19, i64 %32
%34 = bitcast double* %33 to i64*
store i64 %26, i64* %34, align 8
%35 = add i64 %i.07, 1
%36 = and i8 %31, 1
%37 = icmp eq i8 %36, 0
br i1 %37, label %if, label %L4.loopexit Somewhat amazingly, this is only 30% slower than 0.4. Adding |
I assume the current version is basically fully inlined so if the performance is only different by 10% with inlining, this seems to be a inlining problem? |
Well, I'm happy to add the extra |
The consensus right now seems to be track these in a single issue. So move to #15369 (comment) |
(Given that the performance issue for the inner loop isn't there anymore) |
Test case:
Inner loop code in v0.4:
Inner loop code on master:
It seems to be rooting
a.x
on every iteration where we didn't before. This is one of the issues affecting #15146.The text was updated successfully, but these errors were encountered: