-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary search for pkgimage gc #48940
Conversation
Seems to improve GC times in the OmniPackage benchmark:
I didn't profile it yet to see what fraction of the GC time is spent on the lookup algorithm itself, though. |
I get some slightly larger numbers for OmniPackage, with:
jl_object_in_image: eyt_obj_in_img: |
Thanks this is great! Ideally we would have |
An alternative would be to use an extra GC bit to encode whether the object is in the sysimg. Though it would be preferable to save GC bits in case we want to extend the number of generations in the future. |
We could special-case the sysimg in particular since we expect that to be large and have many objects, but I doubt that's extensible to all package images without using another GC bit. |
The profile is quite massive, but the GC time seems to be mark dominated, with some part of it being to calculate the tree. |
How much is used to calculate the tree? Given that it's just a sort-and-iterate over all of the packages once per GC, I can't really imagine that being most of the time spent. |
Just a second, because I reduced the number of threads and now profiling deadlocks :). |
profile.zip Most GCs look like this. |
I don't think we'll be able to push the binary search approach much farther than it is right now: the inner assembly loop is roughly 7 instructions and runs for 10 iterations with 500 packages. The log(n) growth is helpful in reducing the actual runtime as package count rises, but it's also hard to supplement with a more complicated filtering setup, as the filter needs to eliminate ~50% of the search space with just 7 instructions. |
Yeah, the loop is already as fast as it can be I imagine. It's already a nice improvement. Though it would be nice to have some way of it being |
FWIW, we search through tens of thousands of pools currently to find the applicable metadata bits, and it does not show up in that profile at all. We should be able to search through hundreds of package images without much performance cost too. |
Would it be possible to move this into staticdata.c, so that all object_in_image queries could be accelerated, not just those for GC? |
Seconding Jameson's point. The |
You can use pprof directly from the |
Note that while |
It would be great if we could have all the image data in a certain address range. |
That should be very possible, right? |
But the precompilation happens in a separate process; can you capture its
I'd hazard a guess that making
Sounds feasible if we allocate a giant block at the beginning. But I'm guessing there are downsides to that? I don't know many memory tricks, maybe others know other ways. |
Oh, my pprof data was from using, for precompiling it's quite a bit more annoying. |
As a note right now we permalloc pkgimages, there is a discussion (x-ref: #48215) where we want to keep the data section. My original thinking was that we want to We have three options IIRC:
If we permalloc it might be feasible to reserve a large range of address-space. It's not clear to me what's the best solution is. |
Theoretically, 32-bit has a lot more free GC bits (36) than 64-bit has (at least 4, and up to 20) |
Looking forward to seeing how much this helps... julia> @time GC.gc()
17.412473 seconds (100.00% gc time)
julia> @time GC.gc()
20.713867 seconds (99.92% gc time)
julia> @time GC.gc()
17.479863 seconds (100.00% gc time) |
I can't reproduce this on my machine:
|
Timings from a gc heavy benchmark on Julia master: 40.410133 seconds (50.76 M allocations: 7.422 GiB, 62.57% gc time, 88.83% compilation time)
78.515218 seconds (379.61 M allocations: 95.621 GiB, 61.52% gc time, 97.92% compilation time)
14.417356 seconds (31.38 M allocations: 6.467 GiB, 66.60% gc time, 61.01% compilation time)
1.062358 seconds (18.91 M allocations: 5.552 GiB, 37.68% gc time)
17.594696 seconds (334.57 M allocations: 93.050 GiB, 41.82% gc time)
17.100889 seconds (334.57 M allocations: 93.050 GiB, 42.83% gc time) On this PR: 32.199591 seconds (47.61 M allocations: 7.376 GiB, 49.19% gc time, 92.05% compilation time)
57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)
15.197605 seconds (39.56 M allocations: 6.859 GiB, 53.85% gc time, 68.14% compilation time)
0.828262 seconds (18.92 M allocations: 5.553 GiB, 59.38% gc time)
11.451439 seconds (333.75 M allocations: 92.927 GiB, 65.75% gc time)
12.137801 seconds (333.75 M allocations: 92.927 GiB, 67.98% gc time) This PR does seem to help a lot. |
🤔 |
It's pretty common for I think this may be most common with multithreaded code? This has been the case for a long time, and is unrelated to this PR. |
Can this also be run on 1.8? |
Since we're migrating to type inference on multiple threads, it's hard to define what the notion of compilation time percentage means (% of time per thread? % of total time where any thread was compiling anything?). When I did the timer rework, I defined compilation time as the total time spent among all threads within the compiler, and since multiple threads can be within the compiler that can result in larger than 100% compilation times. It definitely shouldn't exceed threads * 100% though, so if that's being observed that's an issue. |
So this PR makes % GC time significantly worse, but absolute runtime time significantly better, working out to net zero absolute GC time change?? |
Timings look quite different: 21.184783 seconds (73.82 M allocations: 8.159 GiB, 9.07% gc time, 88.86% compilation time)
77.685086 seconds (418.40 M allocations: 96.752 GiB, 45.92% gc time, 36.74% compilation time)
9.935663 seconds (39.48 M allocations: 6.901 GiB, 14.77% gc time, 57.80% compilation time)
1.467435 seconds (18.92 M allocations: 5.553 GiB, 51.89% gc time)
40.684633 seconds (333.70 M allocations: 92.923 GiB, 78.39% gc time)
39.284163 seconds (333.70 M allocations: 92.923 GiB, 77.99% gc time) That first compile time improved, but the second matches master (and is worse than on this PR), while runtimes are much worse.
I have 36 threads, so that is not being observed. EDIT: Reducing the number of allocations on the last two examples (but also optimizing the runtime performance)... 4.896281 seconds (200.20 M allocations: 39.717 GiB, 61.95% gc time)
4.844229 seconds (200.20 M allocations: 39.717 GiB, 62.22% gc time) These correspond to 11.451439 seconds (333.75 M allocations: 92.927 GiB, 65.75% gc time)
12.137801 seconds (333.75 M allocations: 92.927 GiB, 67.98% gc time) |
This seems highly suspect, I can't imagine a case where we are doing so much better in absolute time but gc time is unaffected, given that this is a fairly large algorithmic change that primarily affects the gc mark loop and shows up on Gabriel's profile. |
seems fairly concerning, from the worker which just loaded a bunch of these jll package images.
|
I can reproduce that failure just by trying to precompile CSV. Once with the error above, and another with the following Failed to precompile CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b] to "/var/folders/6f/7ndzy5jj18s0z4954jsb9kbr0000gn/T/jl_VOVn5a/compiled/v1.10/CSV/jl_5rkxii".
Assertion failed: (((i % 2 == 1) || eytzinger_image_tree.items[eyt_idx] == (void*)eyt_val) && "Eytzinger tree failed to find object!"), function rebuild_image_blob_tree, file gc.c, line 275. I will check if ASAN does something on linux here EDIT: Asan found a failure at the same place
|
There's an off by one error here somewhere. |
I get
when I try backport this to 1.9. Maybe I made some mistake... Could you try it @pchintalapudi? Edit: But not on my linux machine.. Guess I will have to try again on the mac later. |
Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit bc33c81) (cherry picked from commit 40692cca1619a06991bebabc162e531255d99ddd)
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
This addresses the binary search TODO above
jl_object_in_image
by creating an Eytzinger binary search tree to make that function much faster in GC marking. Might help with #48923.