Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline callsite of invoke when possible (invoke improvement No. 1) #10964

Closed
wants to merge 1 commit into from

Conversation

yuyichao
Copy link
Contributor

This is the refactor of my previous attempt to improve the performance of invoke (#9642).

As I mentioned on the mailing list, the previous PR need to be refactored and I decide to start sth relatively self-contained and (hopefully) decoupled from other parts. Hopefully this can also make it easier to be reviewed.

The patch only deal with the new signature of invoke (i.e. the second parameter is a tuple type rather than a tuple of types) mainly because it's much easier to handle. It should not do anything for the old signature.

Thanks @vtjnash for very helpful suggestions.

Tested with the following script.

@noinline f1(a, b) = a + b

g1_1() = f1(1, 2)
g1_2() = invoke(f1, Tuple{Any, Any}, 1, 2)

@assert g1_1() == g1_2()

@code_llvm g1_1()
@code_llvm g1_2()

Output on current master

define i64 @julia_g1_1_43908() {
top:
  %0 = call i64 @julia_f1_43896(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_2_43911() {
top:
  %0 = alloca [6 x %jl_value_t*], align 8
  %.sub = getelementptr inbounds [6 x %jl_value_t*]* %0, i64 0, i64 0
  %1 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 2
  %2 = bitcast [6 x %jl_value_t*]* %0 to i64*
  store i64 8, i64* %2, align 8
  %3 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 1
  %4 = bitcast %jl_value_t** %3 to %jl_value_t***
  %5 = load %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t** %5, %jl_value_t*** %4, align 8
  store %jl_value_t** %.sub, %jl_value_t*** @jl_pgcstack, align 8
  store %jl_value_t* null, %jl_value_t** %1, align 8
  %6 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 3
  store %jl_value_t* null, %jl_value_t** %6, align 8
  %7 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 4
  store %jl_value_t* null, %jl_value_t** %7, align 8
  %8 = getelementptr [6 x %jl_value_t*]* %0, i64 0, i64 5
  store %jl_value_t* null, %jl_value_t** %8, align 8
  %9 = load %jl_value_t** inttoptr (i64 140525719411568 to %jl_value_t**), align 16
  store %jl_value_t* %9, %jl_value_t** %1, align 8
  %10 = load %jl_value_t** inttoptr (i64 140525686087088 to %jl_value_t**), align 16
  store %jl_value_t* %10, %jl_value_t** %6, align 8
  %11 = load %jl_value_t** inttoptr (i64 140525686091216 to %jl_value_t**), align 16
  store %jl_value_t* %11, %jl_value_t** %7, align 8
  %12 = load %jl_value_t** inttoptr (i64 140525686091216 to %jl_value_t**), align 16
  store %jl_value_t* %12, %jl_value_t** %8, align 8
  %13 = call %jl_value_t* @jl_f_instantiate_type(%jl_value_t* null, %jl_value_t** %6, i32 3)
  store %jl_value_t* %13, %jl_value_t** %6, align 8
  store %jl_value_t* inttoptr (i64 140525686161872 to %jl_value_t*), %jl_value_t** %7, align 8
  store %jl_value_t* inttoptr (i64 140525686161968 to %jl_value_t*), %jl_value_t** %8, align 8
  %14 = call %jl_value_t* @jl_f_invoke(%jl_value_t* null, %jl_value_t** %1, i32 4)
  %15 = bitcast %jl_value_t* %14 to i64*
  %16 = load i64* %15, align 8
  %17 = load %jl_value_t*** %4, align 8
  store %jl_value_t** %17, %jl_value_t*** @jl_pgcstack, align 8
  ret i64 %16
}

With this patch

define i64 @julia_g1_1_43911() {
top:
  %0 = call i64 @julia_f1_43899(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_2_43914() {
top:
  %0 = call i64 @julia_f1_43899(i64 1, i64 2)
  ret i64 %0
}

@yuyichao yuyichao mentioned this pull request Apr 23, 2015
@yuyichao
Copy link
Contributor Author

A slightly more complete test (doesn't include the @code_llvm output of the master version since it's way too long....)

@noinline f1(a, b) = a + b

g1_1() = f1(1, 2)
g1_2() = invoke(f1, Tuple{Any, Any}, 1, 2)

@inline t1() = Tuple{Any, Any}
@noinline t2() = Tuple{Any, Any}

g1_3() = invoke(f1, t1(), 1, 2)
g1_4() = invoke(f1, t2(), 1, 2)

@assert g1_1() == g1_2() == g1_3() == g1_4()

@code_llvm g1_1()
@code_llvm g1_2()
@code_llvm g1_3()
@code_llvm g1_4()

function timing(f, args...)
    println(f, args)
    f(args...)
    gc()
    @time for i in 1:10000000
        f(args...)
    end
    gc()
end

timing(g1_1)
timing(g1_2)
timing(g1_3)
timing(g1_4)

master without @code_llvm output

g1_1()
elapsed time: 0.308610973 seconds (0 bytes allocated)
g1_2()
elapsed time: 1.89693006 seconds (1449 MB allocated, 1.36% gc time in 66 pauses with 0 full sweep)
g1_3()
elapsed time: 1.892066647 seconds (1449 MB allocated, 1.41% gc time in 66 pauses with 0 full sweep)
g1_4()
elapsed time: 1.883954768 seconds (1449 MB allocated, 1.41% gc time in 66 pauses with 0 full sweep)

with this PR with @code_llvm output

define i64 @julia_g1_1_43914() {
top:
  %0 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_2_43917() {
top:
  %0 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_3_43918() {
top:
  %0 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %0
}

define i64 @julia_g1_4_43919() {
top:
  %0 = call %jl_value_t* @julia_t2_43907()
  %1 = call i64 @julia_f1_43903(i64 1, i64 2)
  ret i64 %1
}
g1_1()
elapsed time: 0.319919904 seconds (0 bytes allocated)
g1_2()
elapsed time: 0.372261468 seconds (0 bytes allocated)
g1_3()
elapsed time: 0.323118056 seconds (0 bytes allocated)
g1_4()
elapsed time: 1.10516864 seconds (1449 MB allocated, 2.79% gc time in 66 pauses with 0 full sweep)

@yuyichao
Copy link
Contributor Author

I don't know what up with the error on Travis-CI (which seems to be popping up randomly on almost all commits) and is AppVeyor having some connection issue?

@pao
Copy link
Member

pao commented Apr 24, 2015

The 32-bit error on Travis is indeed tupocalypse fallout and it's being worked on.

@tkelman
Copy link
Contributor

tkelman commented Apr 24, 2015

I restarted the builds, though https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.4069/job/wwdnwkpxha87m2me looked like a timeout when trying to execute a trivial command with win64 Julia without using sys.dll. We've had subtle issues there before, will see if it happens again a second time.

@yuyichao
Copy link
Contributor Author

Was the issue before repeatable?
I asked whether it's a connection issue because both build 1 2 stops responding at ~11mins at different places.....

@tkelman
Copy link
Contributor

tkelman commented Apr 24, 2015

Was the issue before repeatable?

Yes, and that's why we added a new test for it.

This second link https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.4066/job/o3777iciq624y3jd looks like something intermittent that we have seen before (possibly #7942), where the freeze occurs during one of the first few linalg tests. The first link (build 4069) froze during the newly added test for executing a trivial command without sys.dll present. If you didn't change any substantial code between the two builds, maybe build 4069 was a fluke.

@yuyichao
Copy link
Contributor Author

@tkelman Well, I rebased to latest master between the two build I refered to. I haven't changed anything since the second one and it seems that the build already passed the point where it got stuck before.

@tkelman
Copy link
Contributor

tkelman commented Apr 24, 2015

I'm fine with calling it a probably-#7942-related fluke then

@@ -1730,6 +1730,78 @@ DLLEXPORT jl_value_t *jl_gf_invoke_lookup(jl_function_t *gf, jl_datatype_t *type
return (jl_value_t*)m;
}

jl_function_t*
jl_gf_invoke_get_specialization(jl_function_t *gf, jl_tupletype_t *types,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be a lot of code duplication between this method and jl_gf_invoke. can you refactor the two so that the common bits are shared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about just convert the argument to jl_gf_invoke to tupletype and use this function to get the specialized function. I didn't bother too much (not that I like this kind of code duplication either though) because IMHO this is basically the same duplication of jl_method_table_assoc_exact and jl_method_table_assoc_exact_by_type.

Another way I've thought about is to just use a (more C friendly) (pointer to) array of types but again that kind of mean rewrite/merge jl_method_table_assoc_exact and jl_method_table_assoc_exact_by_type and I hoped to make this patch as small as possible.

Which one do you prefer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@yuyichao
Copy link
Contributor Author

@vtjnash So I fixed two of the three issues you brought up.

For the thrid one, it seems that emit_call made a lot of assumptions of what emit_known_call might do and I feel like I would have too pass back too much information (nargs/args, f) without gaining much.

I've tried to replace the emit_nthptr_recast with just mfunc->linfo->functionObject and got the same SegFault with before.

@yuyichao
Copy link
Contributor Author

And somehow the x86_64 test freeze again............. :(

@yuyichao yuyichao force-pushed the inline-invoke1 branch 6 times, most recently from c8c79a9 to cb78497 Compare April 28, 2015 04:30
@yuyichao yuyichao changed the title Inline callsite of invoke when possible Inline callsite of invoke when possible (invoke improvement No. 1) May 6, 2015
@yuyichao
Copy link
Contributor Author

I think I've finally understand how these code works (thanks to #11439) and why it was crashing before.

As @carnaval guessed, it was related to GC and as I noted in the commit 8dce033#diff-6d4d21428a67320600faf5a1a9f3a16aR2437, the function object can be collected when the function object is passed to the jlcall function. This shouldn't be an issue now and not even with #11439 since a non-wrapper jlcall specialization never accesses the function object (first) argument. I'm still not entirely happy that there's a useless and possibly invalid pointer in the generated code though...................

And I have changed it to use f->linfo->functionObject instead of emit_nthptr_recast as @vtjnash suggested and I don't really understand why it didn't work before...........

@yuyichao yuyichao added the performance Must go faster label Jun 30, 2015
@yuyichao
Copy link
Contributor Author

Replaced by #18444

@yuyichao yuyichao closed this Sep 11, 2016
@yuyichao yuyichao deleted the inline-invoke1 branch September 11, 2016 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants