Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary search for pkgimage gc #48940

Merged
merged 5 commits into from
Mar 29, 2023
Merged

Binary search for pkgimage gc #48940

merged 5 commits into from
Mar 29, 2023

Conversation

pchintalapudi
Copy link
Member

This addresses the binary search TODO above jl_object_in_image by creating an Eytzinger binary search tree to make that function much faster in GC marking. Might help with #48923.

@d-netto
Copy link
Member

d-netto commented Mar 8, 2023

Seems to improve GC times in the OmniPackage benchmark:

julia> @time include("src/OmniPackage.jl")
 69.209397 seconds (85.27 M allocations: 5.660 GiB, 43.88% gc time, 21.21% compilation time: 89% of which was recompilation)
Julia Version 1.10.0-DEV.749
Commit 0f5f62cf5a* (2023-03-06 19:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
julia> @time include("src/OmniPackage.jl")
 48.207285 seconds (85.58 M allocations: 5.680 GiB, 21.36% gc time, 23.49% compilation time: 85% of which was recompilation)
Julia Version 1.10.0-DEV.750
Commit 77d47d52ee (2023-03-08 00:42 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

I didn't profile it yet to see what fraction of the GC time is spent on the lookup algorithm itself, though.

@pchintalapudi
Copy link
Member Author

I get some slightly larger numbers for OmniPackage, with:

rm -rf ~/.julia; make -j80 -C src && ./julia -E 'import Pkg; Pkg.add(url="https://github.com/JuliaComputing/OmniPackage.jl"); Pkg.precompile(); @time using OmniPackage;'

jl_object_in_image:
119.805792 seconds (80.69 M allocations: 5.354 GiB, 48.71% gc time, 16.33% compilation time: 80% of which was recompilation)

eyt_obj_in_img:
81.350924 seconds (80.78 M allocations: 5.360 GiB, 25.09% gc time, 19.05% compilation time: 78% of which was recompilation)

@vchuravy
Copy link
Member

vchuravy commented Mar 8, 2023

Thanks this is great! Ideally we would have O(1) rejection, e.g. this is not in the sysimg

@d-netto
Copy link
Member

d-netto commented Mar 8, 2023

Ideally we would have O(1) rejection

An alternative would be to use an extra GC bit to encode whether the object is in the sysimg. Though it would be preferable to save GC bits in case we want to extend the number of generations in the future.

@pchintalapudi
Copy link
Member Author

We could special-case the sysimg in particular since we expect that to be large and have many objects, but I doubt that's extensible to all package images without using another GC bit.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

The profile is quite massive, but the GC time seems to be mark dominated, with some part of it being to calculate the tree.

@pchintalapudi
Copy link
Member Author

How much is used to calculate the tree? Given that it's just a sort-and-iterate over all of the packages once per GC, I can't really imagine that being most of the time spent.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

Just a second, because I reduced the number of threads and now profiling deadlocks :).

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

profile.zip
This is a pprof profile of it. Though it might be lacking source not sure exactly how it will work

image

Most GCs look like this.

With this being how we spend the time in the new code
image

@pchintalapudi
Copy link
Member Author

I don't think we'll be able to push the binary search approach much farther than it is right now: the inner assembly loop is roughly 7 instructions and runs for 10 iterations with 500 packages. The log(n) growth is helpful in reducing the actual runtime as package count rises, but it's also hard to supplement with a more complicated filtering setup, as the filter needs to eliminate ~50% of the search space with just 7 instructions.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

Yeah, the loop is already as fast as it can be I imagine. It's already a nice improvement. Though it would be nice to have some way of it being O(1)

@vtjnash
Copy link
Member

vtjnash commented Mar 8, 2023

FWIW, we search through tens of thousands of pools currently to find the applicable metadata bits, and it does not show up in that profile at all. We should be able to search through hundreds of package images without much performance cost too.

@vchuravy vchuravy added performance Must go faster GC Garbage collector backport 1.9 Change should be backported to release-1.9 pkgimage labels Mar 8, 2023
@vtjnash
Copy link
Member

vtjnash commented Mar 8, 2023

Would it be possible to move this into staticdata.c, so that all object_in_image queries could be accelerated, not just those for GC?

@timholy
Copy link
Member

timholy commented Mar 8, 2023

Seconding Jameson's point. The object_in_image queries are presumably a big cost in precompilation (I'm not yet a pprof user so this is harder to profile due to the separate process), so it might speed up precompilation of "late-in-the-dependency-chain" packages quite a bit.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

You can use pprof directly from the @profile data. That's what I did.

@pchintalapudi
Copy link
Member Author

Note that while jl_object_in_image is now faster, external_blob_index still retains the linear search behavior and will still continue to be a cost. We'd need to store even more metadata in order to make that method fast (basically a second eytzinger tree).

@JeffBezanson
Copy link
Member

It would be great if we could have all the image data in a certain address range.

@oscardssmith
Copy link
Member

That should be very possible, right?

@timholy
Copy link
Member

timholy commented Mar 8, 2023

You can use pprof directly from the @profile data. That's what I did.

But the precompilation happens in a separate process; can you capture its @profile data so easily?

Note that while jl_object_in_image is now faster, external_blob_index still retains the linear search behavior and will still continue to be a cost.

I'd hazard a guess that making jl_object_in_image fast is probably the biggest win.

It would be great if we could have all the image data in a certain address range.

Sounds feasible if we allocate a giant block at the beginning. But I'm guessing there are downsides to that? I don't know many memory tricks, maybe others know other ways.

@gbaraldi
Copy link
Member

gbaraldi commented Mar 8, 2023

Oh, my pprof data was from using, for precompiling it's quite a bit more annoying.

@vchuravy
Copy link
Member

vchuravy commented Mar 8, 2023

As a note right now we permalloc pkgimages, there is a discussion (x-ref: #48215) where we want to keep the data section. My original thinking was that we want to mmap it (through dlopen), but that means the allocation is out of our hands.

We have three options IIRC:

  • dlopen the so and copy+permalloc the data section
  • move the data section back to the .ji and permalloc it
  • dlopen the so and don't do permalloc.

If we permalloc it might be feasible to reserve a large range of address-space.
Alternative one 64bit we could use an extra bit for a permanent generation.

It's not clear to me what's the best solution is.

@vtjnash
Copy link
Member

vtjnash commented Mar 8, 2023

Theoretically, 32-bit has a lot more free GC bits (36) than 64-bit has (at least 4, and up to 20)

@chriselrod
Copy link
Contributor

Looking forward to seeing how much this helps...

julia> @time GC.gc()
 17.412473 seconds (100.00% gc time)

julia> @time GC.gc()
 20.713867 seconds (99.92% gc time)

julia> @time GC.gc()
 17.479863 seconds (100.00% gc time)

@d-netto
Copy link
Member

d-netto commented Mar 9, 2023

I can't reproduce this on my machine:

julia> versioninfo()
Julia Version 1.10.0-DEV.749
Commit 0f5f62cf5a* (2023-03-06 19:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @time for i in 1:25; GC.gc(); GC.gc(); GC.gc(); end
  2.201666 seconds (99.96% gc time)
julia> versioninfo()
Julia Version 1.10.0-DEV.751
Commit 4c75ad09ac (2023-03-08 18:09 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @time for i in 1:25; GC.gc(); GC.gc(); GC.gc(); end
  2.244798 seconds (99.95% gc time)

@chriselrod
Copy link
Contributor

Timings from a gc heavy benchmark on Julia master:

 40.410133 seconds (50.76 M allocations: 7.422 GiB, 62.57% gc time, 88.83% compilation time)
 78.515218 seconds (379.61 M allocations: 95.621 GiB, 61.52% gc time, 97.92% compilation time)
 14.417356 seconds (31.38 M allocations: 6.467 GiB, 66.60% gc time, 61.01% compilation time)
  1.062358 seconds (18.91 M allocations: 5.552 GiB, 37.68% gc time)
 17.594696 seconds (334.57 M allocations: 93.050 GiB, 41.82% gc time)
 17.100889 seconds (334.57 M allocations: 93.050 GiB, 42.83% gc time)

On this PR:

 32.199591 seconds (47.61 M allocations: 7.376 GiB, 49.19% gc time, 92.05% compilation time)
 57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)
 15.197605 seconds (39.56 M allocations: 6.859 GiB, 53.85% gc time, 68.14% compilation time)
  0.828262 seconds (18.92 M allocations: 5.553 GiB, 59.38% gc time)
 11.451439 seconds (333.75 M allocations: 92.927 GiB, 65.75% gc time)
 12.137801 seconds (333.75 M allocations: 92.927 GiB, 67.98% gc time)

This PR does seem to help a lot.
The runtimes (last two three timings) improved, even though the reported GC time is still bad.

@giordano
Copy link
Contributor

 57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)

🤔

@chriselrod
Copy link
Contributor

chriselrod commented Mar 14, 2023

 57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)

thinking

It's pretty common for @time to report >100% compile time.

I think this may be most common with multithreaded code?
Sometimes, I also have the impression that the threads spin while waiting for single threaded compilation, but I've not actually tested/investigated enough to confirm that.

This has been the case for a long time, and is unrelated to this PR.
It just happened to show up for that particular example on this PR and not on master.

@KristofferC
Copy link
Member

Timings from a gc heavy benchmark on Julia master:

Can this also be run on 1.8?

@pchintalapudi
Copy link
Member Author

Since we're migrating to type inference on multiple threads, it's hard to define what the notion of compilation time percentage means (% of time per thread? % of total time where any thread was compiling anything?). When I did the timer rework, I defined compilation time as the total time spent among all threads within the compiler, and since multiple threads can be within the compiler that can result in larger than 100% compilation times. It definitely shouldn't exceed threads * 100% though, so if that's being observed that's an issue.

@vtjnash
Copy link
Member

vtjnash commented Mar 14, 2023

So this PR makes % GC time significantly worse, but absolute runtime time significantly better, working out to net zero absolute GC time change??

@chriselrod
Copy link
Contributor

chriselrod commented Mar 14, 2023

Timings from a gc heavy benchmark on Julia master:

Can this also be run on 1.8?

Timings look quite different:

 21.184783 seconds (73.82 M allocations: 8.159 GiB, 9.07% gc time, 88.86% compilation time)
 77.685086 seconds (418.40 M allocations: 96.752 GiB, 45.92% gc time, 36.74% compilation time)
  9.935663 seconds (39.48 M allocations: 6.901 GiB, 14.77% gc time, 57.80% compilation time)
  1.467435 seconds (18.92 M allocations: 5.553 GiB, 51.89% gc time)
 40.684633 seconds (333.70 M allocations: 92.923 GiB, 78.39% gc time)
 39.284163 seconds (333.70 M allocations: 92.923 GiB, 77.99% gc time)

That first compile time improved, but the second matches master (and is worse than on this PR), while runtimes are much worse.

It definitely shouldn't exceed threads * 100% though, so if that's being observed that's an issue.

I have 36 threads, so that is not being observed.
The performance profiles look quite similar on 4 threads using my M1 mac mini.

EDIT: Reducing the number of allocations on the last two examples (but also optimizing the runtime performance)...

  4.896281 seconds (200.20 M allocations: 39.717 GiB, 61.95% gc time)
  4.844229 seconds (200.20 M allocations: 39.717 GiB, 62.22% gc time)

These correspond to

 11.451439 seconds (333.75 M allocations: 92.927 GiB, 65.75% gc time)
 12.137801 seconds (333.75 M allocations: 92.927 GiB, 67.98% gc time)

@pchintalapudi
Copy link
Member Author

So this PR makes % GC time significantly worse, but absolute runtime time significantly better, working out to net zero absolute GC time change??

This seems highly suspect, I can't imagine a case where we are doing so much better in absolute time but gc time is unaffected, given that this is a fairly large algorithmic change that primarily affects the gc mark loop and shows up on Gabriel's profile.

@vtjnash
Copy link
Member

vtjnash commented Mar 15, 2023

seems fairly concerning, from the worker which just loaded a bunch of these jll package images.

Logging                                          (3) |        started at 2023-03-14T17:57:57.049
      From worker 3:	julia(70445,0x1f51bc140) malloc: Heap corruption detected, free list is damaged at 0x600008725c20
      From worker 3:	*** Incorrect guard value: 15
      From worker 3:	julia(70445,0x1f51bc140) malloc: *** set a breakpoint in malloc_error_break to debug
      From worker 3:
      From worker 3:	[70445] signal (6): Abort trap: 6
      From worker 3:	in expression starting at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-5.0/build/default-macmini-aarch64-5-0/julialang/julia-master/julia-6283124d89/share/julia/stdlib/v1.10/Logging/test/runtests.jl:9
      From worker 3:	__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
      From worker 3:	Allocations: 2096664393 (Pool: 2094941610; Big: 1722783); GC: 1217

@gbaraldi
Copy link
Member

gbaraldi commented Mar 20, 2023

I can reproduce that failure just by trying to precompile CSV. Once with the error above, and another with the following

  Failed to precompile CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b] to "/var/folders/6f/7ndzy5jj18s0z4954jsb9kbr0000gn/T/jl_VOVn5a/compiled/v1.10/CSV/jl_5rkxii".
  Assertion failed: (((i % 2 == 1) || eytzinger_image_tree.items[eyt_idx] == (void*)eyt_val) && "Eytzinger tree failed to find object!"), function rebuild_image_blob_tree, file gc.c, line 275.

I will check if ASAN does something on linux here

EDIT: Asan found a failure at the same place

==351565==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x611002285e30 at pc 0x7f8390454ec9 bp 0x7fff82b6f900 sp 0x7fff82b6f8f8
READ of size 8 at 0x611002285e30 thread T0
    #0 0x7f8390454ec8 in rebuild_image_blob_tree /home/gabrielbaraldi/julia2/src/gc.c:276:9
    #1 0x7f83903de426 in jl_restore_system_image_from_stream_ /home/gabrielbaraldi/julia2/src/staticdata.c:3154:5
    #2 0x7f83903b9798 in jl_restore_package_image_from_stream /home/gabrielbaraldi/julia2/src/staticdata.c:3220:13
    #3 0x7f83903b8783 in jl_restore_incremental_from_buf /home/gabrielbaraldi/julia2/src/staticdata.c:3266:23
    #4 0x7f83903bb702 in ijl_restore_package_image_from_file /home/gabrielbaraldi/julia2/src/staticdata.c:3350:23
    #5 0x7f8378860154 in julia__include_from_serialized_35339 loading.jl:1011
    #6 0x7f83776a8a8c in julia__tryrequire_from_serialized_35509 loading.jl:1341
    #7 0x7f8377691066 in julia__require_search_from_serialized_35529 loading.jl:1434
    #8 0x7f8378f6115a in julia__require_35709 loading.jl:1774
    #9 0x7f8378f6f4ce in julia__require_prelocked_35609 loading.jl:1654
    #10 0x7f8378105f9c in macro expansion loading.jl:1640
    #11 0x7f8378105f9c in macro expansion lock.jl:267
    ....
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/gabrielbaraldi/julia2/src/gc.c:276:9 in rebuild_image_blob_tree
Shadow bytes around the buggy address:
  0x0c2280448b70: 00 00 00 00 00 00 fa fa fa fa fa fa fa fa fa fa
  0x0c2280448b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c2280448b90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa
  0x0c2280448ba0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c2280448bb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c2280448bc0: 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa fa fa
  0x0c2280448bd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00

@gbaraldi
Copy link
Member

There's an off by one error here somewhere.

@KristofferC KristofferC mentioned this pull request Mar 24, 2023
52 tasks
@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Mar 28, 2023
@vtjnash vtjnash requested review from JeffBezanson and removed request for vchuravy and topolarity March 28, 2023 18:22
@vtjnash vtjnash merged commit bc33c81 into master Mar 29, 2023
@vtjnash vtjnash deleted the pc/gc-bsearch3 branch March 29, 2023 15:09
@KristofferC
Copy link
Member

KristofferC commented Mar 30, 2023

I get

Assertion failed: ((uintptr_t) jl_linkage_blobs.items[i] % 4 == 0 && "Linkage blob not 4-byte aligned!"), function rebuild_image_blob_tree, file gc.c, line 263.

when I try backport this to 1.9. Maybe I made some mistake... Could you try it @pchintalapudi?

Edit: But not on my linux machine.. Guess I will have to try again on the mac later.

KristofferC pushed a commit that referenced this pull request Mar 30, 2023
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
(cherry picked from commit bc33c81)
(cherry picked from commit 40692cca1619a06991bebabc162e531255d99ddd)
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Mar 31, 2023
@pchintalapudi pchintalapudi removed the merge me PR is reviewed. Merge when all tests are passing label Apr 3, 2023
Xnartharax pushed a commit to Xnartharax/julia that referenced this pull request Apr 19, 2023
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector performance Must go faster pkgimage
Projects
None yet
Development

Successfully merging this pull request may close these issues.