Binary search for pkgimage gc #48940

pchintalapudi · 2023-03-08T00:45:48Z

This addresses the binary search TODO above jl_object_in_image by creating an Eytzinger binary search tree to make that function much faster in GC marking. Might help with #48923.

d-netto · 2023-03-08T01:11:47Z

Seems to improve GC times in the OmniPackage benchmark:

julia> @time include("src/OmniPackage.jl")
 69.209397 seconds (85.27 M allocations: 5.660 GiB, 43.88% gc time, 21.21% compilation time: 89% of which was recompilation)
Julia Version 1.10.0-DEV.749
Commit 0f5f62cf5a* (2023-03-06 19:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @time include("src/OmniPackage.jl")
 48.207285 seconds (85.58 M allocations: 5.680 GiB, 21.36% gc time, 23.49% compilation time: 85% of which was recompilation)
Julia Version 1.10.0-DEV.750
Commit 77d47d52ee (2023-03-08 00:42 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

I didn't profile it yet to see what fraction of the GC time is spent on the lookup algorithm itself, though.

pchintalapudi · 2023-03-08T01:20:35Z

I get some slightly larger numbers for OmniPackage, with:

rm -rf ~/.julia; make -j80 -C src && ./julia -E 'import Pkg; Pkg.add(url="https://github.com/JuliaComputing/OmniPackage.jl"); Pkg.precompile(); @time using OmniPackage;'

jl_object_in_image:
119.805792 seconds (80.69 M allocations: 5.354 GiB, 48.71% gc time, 16.33% compilation time: 80% of which was recompilation)

eyt_obj_in_img:
81.350924 seconds (80.78 M allocations: 5.360 GiB, 25.09% gc time, 19.05% compilation time: 78% of which was recompilation)

vchuravy · 2023-03-08T01:46:38Z

Thanks this is great! Ideally we would have O(1) rejection, e.g. this is not in the sysimg

d-netto · 2023-03-08T01:52:59Z

Ideally we would have O(1) rejection

An alternative would be to use an extra GC bit to encode whether the object is in the sysimg. Though it would be preferable to save GC bits in case we want to extend the number of generations in the future.

pchintalapudi · 2023-03-08T02:00:23Z

We could special-case the sysimg in particular since we expect that to be large and have many objects, but I doubt that's extensible to all package images without using another GC bit.

gbaraldi · 2023-03-08T02:08:21Z

The profile is quite massive, but the GC time seems to be mark dominated, with some part of it being to calculate the tree.

pchintalapudi · 2023-03-08T02:19:13Z

How much is used to calculate the tree? Given that it's just a sort-and-iterate over all of the packages once per GC, I can't really imagine that being most of the time spent.

gbaraldi · 2023-03-08T02:25:37Z

Just a second, because I reduced the number of threads and now profiling deadlocks :).

gbaraldi · 2023-03-08T02:34:43Z

profile.zip
This is a pprof profile of it. Though it might be lacking source not sure exactly how it will work

Most GCs look like this.

With this being how we spend the time in the new code

pchintalapudi · 2023-03-08T11:44:37Z

I don't think we'll be able to push the binary search approach much farther than it is right now: the inner assembly loop is roughly 7 instructions and runs for 10 iterations with 500 packages. The log(n) growth is helpful in reducing the actual runtime as package count rises, but it's also hard to supplement with a more complicated filtering setup, as the filter needs to eliminate ~50% of the search space with just 7 instructions.

gbaraldi · 2023-03-08T12:30:55Z

Yeah, the loop is already as fast as it can be I imagine. It's already a nice improvement. Though it would be nice to have some way of it being O(1)

vtjnash · 2023-03-08T16:08:42Z

FWIW, we search through tens of thousands of pools currently to find the applicable metadata bits, and it does not show up in that profile at all. We should be able to search through hundreds of package images without much performance cost too.

vtjnash · 2023-03-08T16:15:02Z

Would it be possible to move this into staticdata.c, so that all object_in_image queries could be accelerated, not just those for GC?

timholy · 2023-03-08T17:24:45Z

Seconding Jameson's point. The object_in_image queries are presumably a big cost in precompilation (I'm not yet a pprof user so this is harder to profile due to the separate process), so it might speed up precompilation of "late-in-the-dependency-chain" packages quite a bit.

gbaraldi · 2023-03-08T17:53:56Z

You can use pprof directly from the @profile data. That's what I did.

pchintalapudi · 2023-03-08T18:15:28Z

Note that while jl_object_in_image is now faster, external_blob_index still retains the linear search behavior and will still continue to be a cost. We'd need to store even more metadata in order to make that method fast (basically a second eytzinger tree).

JeffBezanson · 2023-03-08T19:03:51Z

It would be great if we could have all the image data in a certain address range.

oscardssmith · 2023-03-08T19:07:02Z

That should be very possible, right?

timholy · 2023-03-08T20:11:25Z

You can use pprof directly from the @profile data. That's what I did.

But the precompilation happens in a separate process; can you capture its @profile data so easily?

Note that while jl_object_in_image is now faster, external_blob_index still retains the linear search behavior and will still continue to be a cost.

I'd hazard a guess that making jl_object_in_image fast is probably the biggest win.

It would be great if we could have all the image data in a certain address range.

Sounds feasible if we allocate a giant block at the beginning. But I'm guessing there are downsides to that? I don't know many memory tricks, maybe others know other ways.

gbaraldi · 2023-03-08T20:22:04Z

Oh, my pprof data was from using, for precompiling it's quite a bit more annoying.

vchuravy · 2023-03-08T21:48:52Z

As a note right now we permalloc pkgimages, there is a discussion (x-ref: #48215) where we want to keep the data section. My original thinking was that we want to mmap it (through dlopen), but that means the allocation is out of our hands.

We have three options IIRC:

dlopen the so and copy+permalloc the data section
move the data section back to the .ji and permalloc it
dlopen the so and don't do permalloc.

If we permalloc it might be feasible to reserve a large range of address-space.
Alternative one 64bit we could use an extra bit for a permanent generation.

It's not clear to me what's the best solution is.

vtjnash · 2023-03-08T22:06:55Z

Theoretically, 32-bit has a lot more free GC bits (36) than 64-bit has (at least 4, and up to 20)

chriselrod · 2023-03-09T15:57:22Z

Looking forward to seeing how much this helps...

julia> @time GC.gc()
 17.412473 seconds (100.00% gc time)

julia> @time GC.gc()
 20.713867 seconds (99.92% gc time)

julia> @time GC.gc()
 17.479863 seconds (100.00% gc time)

d-netto · 2023-03-09T16:22:46Z

I can't reproduce this on my machine:

julia> versioninfo()
Julia Version 1.10.0-DEV.749
Commit 0f5f62cf5a* (2023-03-06 19:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @time for i in 1:25; GC.gc(); GC.gc(); GC.gc(); end
  2.201666 seconds (99.96% gc time)

julia> versioninfo()
Julia Version 1.10.0-DEV.751
Commit 4c75ad09ac (2023-03-08 18:09 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.2.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @time for i in 1:25; GC.gc(); GC.gc(); GC.gc(); end
  2.244798 seconds (99.95% gc time)

chriselrod · 2023-03-14T21:05:22Z

Timings from a gc heavy benchmark on Julia master:

 40.410133 seconds (50.76 M allocations: 7.422 GiB, 62.57% gc time, 88.83% compilation time)
 78.515218 seconds (379.61 M allocations: 95.621 GiB, 61.52% gc time, 97.92% compilation time)
 14.417356 seconds (31.38 M allocations: 6.467 GiB, 66.60% gc time, 61.01% compilation time)
  1.062358 seconds (18.91 M allocations: 5.552 GiB, 37.68% gc time)
 17.594696 seconds (334.57 M allocations: 93.050 GiB, 41.82% gc time)
 17.100889 seconds (334.57 M allocations: 93.050 GiB, 42.83% gc time)

On this PR:

 32.199591 seconds (47.61 M allocations: 7.376 GiB, 49.19% gc time, 92.05% compilation time)
 57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)
 15.197605 seconds (39.56 M allocations: 6.859 GiB, 53.85% gc time, 68.14% compilation time)
  0.828262 seconds (18.92 M allocations: 5.553 GiB, 59.38% gc time)
 11.451439 seconds (333.75 M allocations: 92.927 GiB, 65.75% gc time)
 12.137801 seconds (333.75 M allocations: 92.927 GiB, 67.98% gc time)

This PR does seem to help a lot.
The runtimes (last two three timings) improved, even though the reported GC time is still bad.

giordano · 2023-03-14T21:06:19Z

 57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)

🤔

chriselrod · 2023-03-14T21:06:55Z

 57.183759 seconds (374.03 M allocations: 95.429 GiB, 53.48% gc time, 534.88% compilation time)

thinking

It's pretty common for @time to report >100% compile time.

I think this may be most common with multithreaded code?
Sometimes, I also have the impression that the threads spin while waiting for single threaded compilation, but I've not actually tested/investigated enough to confirm that.

This has been the case for a long time, and is unrelated to this PR.
It just happened to show up for that particular example on this PR and not on master.

KristofferC · 2023-03-14T21:19:23Z

Timings from a gc heavy benchmark on Julia master:

Can this also be run on 1.8?

pchintalapudi · 2023-03-14T21:24:31Z

Since we're migrating to type inference on multiple threads, it's hard to define what the notion of compilation time percentage means (% of time per thread? % of total time where any thread was compiling anything?). When I did the timer rework, I defined compilation time as the total time spent among all threads within the compiler, and since multiple threads can be within the compiler that can result in larger than 100% compilation times. It definitely shouldn't exceed threads * 100% though, so if that's being observed that's an issue.

vtjnash · 2023-03-14T21:47:51Z

So this PR makes % GC time significantly worse, but absolute runtime time significantly better, working out to net zero absolute GC time change??

chriselrod · 2023-03-14T22:50:13Z

Timings from a gc heavy benchmark on Julia master:

Can this also be run on 1.8?

Timings look quite different:

 21.184783 seconds (73.82 M allocations: 8.159 GiB, 9.07% gc time, 88.86% compilation time)
 77.685086 seconds (418.40 M allocations: 96.752 GiB, 45.92% gc time, 36.74% compilation time)
  9.935663 seconds (39.48 M allocations: 6.901 GiB, 14.77% gc time, 57.80% compilation time)
  1.467435 seconds (18.92 M allocations: 5.553 GiB, 51.89% gc time)
 40.684633 seconds (333.70 M allocations: 92.923 GiB, 78.39% gc time)
 39.284163 seconds (333.70 M allocations: 92.923 GiB, 77.99% gc time)

That first compile time improved, but the second matches master (and is worse than on this PR), while runtimes are much worse.

It definitely shouldn't exceed threads * 100% though, so if that's being observed that's an issue.

I have 36 threads, so that is not being observed.
The performance profiles look quite similar on 4 threads using my M1 mac mini.

EDIT: Reducing the number of allocations on the last two examples (but also optimizing the runtime performance)...

  4.896281 seconds (200.20 M allocations: 39.717 GiB, 61.95% gc time)
  4.844229 seconds (200.20 M allocations: 39.717 GiB, 62.22% gc time)

These correspond to

 11.451439 seconds (333.75 M allocations: 92.927 GiB, 65.75% gc time)
 12.137801 seconds (333.75 M allocations: 92.927 GiB, 67.98% gc time)

pchintalapudi · 2023-03-15T15:08:26Z

So this PR makes % GC time significantly worse, but absolute runtime time significantly better, working out to net zero absolute GC time change??

This seems highly suspect, I can't imagine a case where we are doing so much better in absolute time but gc time is unaffected, given that this is a fairly large algorithmic change that primarily affects the gc mark loop and shows up on Gabriel's profile.

vtjnash · 2023-03-15T17:54:03Z

seems fairly concerning, from the worker which just loaded a bunch of these jll package images.

Logging                                          (3) |        started at 2023-03-14T17:57:57.049
      From worker 3:	julia(70445,0x1f51bc140) malloc: Heap corruption detected, free list is damaged at 0x600008725c20
      From worker 3:	*** Incorrect guard value: 15
      From worker 3:	julia(70445,0x1f51bc140) malloc: *** set a breakpoint in malloc_error_break to debug
      From worker 3:
      From worker 3:	[70445] signal (6): Abort trap: 6
      From worker 3:	in expression starting at /Users/julia/.julia/scratchspaces/a66863c6-20e8-4ff4-8a62-49f30b1f605e/agent-cache/default-macmini-aarch64-5.0/build/default-macmini-aarch64-5-0/julialang/julia-master/julia-6283124d89/share/julia/stdlib/v1.10/Logging/test/runtests.jl:9
      From worker 3:	__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
      From worker 3:	Allocations: 2096664393 (Pool: 2094941610; Big: 1722783); GC: 1217

gbaraldi · 2023-03-20T17:48:33Z

I can reproduce that failure just by trying to precompile CSV. Once with the error above, and another with the following

  Failed to precompile CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b] to "/var/folders/6f/7ndzy5jj18s0z4954jsb9kbr0000gn/T/jl_VOVn5a/compiled/v1.10/CSV/jl_5rkxii".
  Assertion failed: (((i % 2 == 1) || eytzinger_image_tree.items[eyt_idx] == (void*)eyt_val) && "Eytzinger tree failed to find object!"), function rebuild_image_blob_tree, file gc.c, line 275.

I will check if ASAN does something on linux here

EDIT: Asan found a failure at the same place

==351565==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x611002285e30 at pc 0x7f8390454ec9 bp 0x7fff82b6f900 sp 0x7fff82b6f8f8
READ of size 8 at 0x611002285e30 thread T0
    #0 0x7f8390454ec8 in rebuild_image_blob_tree /home/gabrielbaraldi/julia2/src/gc.c:276:9
    #1 0x7f83903de426 in jl_restore_system_image_from_stream_ /home/gabrielbaraldi/julia2/src/staticdata.c:3154:5
    #2 0x7f83903b9798 in jl_restore_package_image_from_stream /home/gabrielbaraldi/julia2/src/staticdata.c:3220:13
    #3 0x7f83903b8783 in jl_restore_incremental_from_buf /home/gabrielbaraldi/julia2/src/staticdata.c:3266:23
    #4 0x7f83903bb702 in ijl_restore_package_image_from_file /home/gabrielbaraldi/julia2/src/staticdata.c:3350:23
    #5 0x7f8378860154 in julia__include_from_serialized_35339 loading.jl:1011
    #6 0x7f83776a8a8c in julia__tryrequire_from_serialized_35509 loading.jl:1341
    #7 0x7f8377691066 in julia__require_search_from_serialized_35529 loading.jl:1434
    #8 0x7f8378f6115a in julia__require_35709 loading.jl:1774
    #9 0x7f8378f6f4ce in julia__require_prelocked_35609 loading.jl:1654
    #10 0x7f8378105f9c in macro expansion loading.jl:1640
    #11 0x7f8378105f9c in macro expansion lock.jl:267
    ....
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/gabrielbaraldi/julia2/src/gc.c:276:9 in rebuild_image_blob_tree
Shadow bytes around the buggy address:
  0x0c2280448b70: 00 00 00 00 00 00 fa fa fa fa fa fa fa fa fa fa
  0x0c2280448b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c2280448b90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa
  0x0c2280448ba0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c2280448bb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c2280448bc0: 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa fa fa
  0x0c2280448bd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448be0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448bf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2280448c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00

gbaraldi · 2023-03-20T22:12:20Z

There's an off by one error here somewhere.

KristofferC · 2023-03-30T07:30:52Z

I get

Assertion failed: ((uintptr_t) jl_linkage_blobs.items[i] % 4 == 0 && "Linkage blob not 4-byte aligned!"), function rebuild_image_blob_tree, file gc.c, line 263.

when I try backport this to 1.9. Maybe I made some mistake... Could you try it @pchintalapudi?

Edit: But not on my linux machine.. Guess I will have to try again on the mac later.

Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit bc33c81) (cherry picked from commit 40692cca1619a06991bebabc162e531255d99ddd)

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

pchintalapudi requested review from topolarity and vchuravy March 8, 2023 00:53

vchuravy added performance Must go faster GC Garbage collector backport 1.9 Change should be backported to release-1.9 pkgimage labels Mar 8, 2023

KristofferC mentioned this pull request Mar 24, 2023

Backports for 1.9.0-rc2 #48935

Merged

52 tasks

pchintalapudi added 4 commits March 28, 2023 12:23

Binary search for pkgimage gc

0aca0be

Export eyt_obj_in_image as jl_object_in_image, minor optimizations

f6a105a

Store indices as well to speed up external_blob_index

8ecb20f

Unused variable

0adc866

vtjnash force-pushed the pc/gc-bsearch3 branch from 6283124 to 8b1a67e Compare March 28, 2023 18:10

fix off-by-one in gc_img_max

802b82c

vtjnash force-pushed the pc/gc-bsearch3 branch from 8b1a67e to 802b82c Compare March 28, 2023 18:22

vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Mar 28, 2023

vtjnash requested review from JeffBezanson and removed request for vchuravy and topolarity March 28, 2023 18:22

vtjnash merged commit bc33c81 into master Mar 29, 2023

vtjnash deleted the pc/gc-bsearch3 branch March 29, 2023 15:09

KristofferC pushed a commit that referenced this pull request Mar 30, 2023

Binary search for pkgimage metadata (#48940)

3c01f2a

Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit bc33c81) (cherry picked from commit 40692cca1619a06991bebabc162e531255d99ddd)

KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Mar 31, 2023

pchintalapudi removed the merge me PR is reviewed. Merge when all tests are passing label Apr 3, 2023

Xnartharax pushed a commit to Xnartharax/julia that referenced this pull request Apr 19, 2023

Binary search for pkgimage metadata (JuliaLang#48940)

a7a8833

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

Binary search for pkgimage gc #48940

Binary search for pkgimage gc #48940

Conversation

pchintalapudi commented Mar 8, 2023

d-netto commented Mar 8, 2023

pchintalapudi commented Mar 8, 2023

vchuravy commented Mar 8, 2023

d-netto commented Mar 8, 2023

pchintalapudi commented Mar 8, 2023

gbaraldi commented Mar 8, 2023

pchintalapudi commented Mar 8, 2023

gbaraldi commented Mar 8, 2023

gbaraldi commented Mar 8, 2023 • edited Loading

pchintalapudi commented Mar 8, 2023

gbaraldi commented Mar 8, 2023 • edited Loading

vtjnash commented Mar 8, 2023

vtjnash commented Mar 8, 2023

timholy commented Mar 8, 2023 • edited Loading

gbaraldi commented Mar 8, 2023

pchintalapudi commented Mar 8, 2023

JeffBezanson commented Mar 8, 2023

oscardssmith commented Mar 8, 2023

timholy commented Mar 8, 2023 • edited Loading

gbaraldi commented Mar 8, 2023

vchuravy commented Mar 8, 2023

vtjnash commented Mar 8, 2023

chriselrod commented Mar 9, 2023

d-netto commented Mar 9, 2023

chriselrod commented Mar 14, 2023

giordano commented Mar 14, 2023

chriselrod commented Mar 14, 2023 • edited Loading

KristofferC commented Mar 14, 2023

pchintalapudi commented Mar 14, 2023

vtjnash commented Mar 14, 2023

chriselrod commented Mar 14, 2023 • edited Loading

pchintalapudi commented Mar 15, 2023

vtjnash commented Mar 15, 2023

gbaraldi commented Mar 20, 2023 • edited Loading

gbaraldi commented Mar 20, 2023

KristofferC commented Mar 30, 2023 • edited Loading

gbaraldi commented Mar 8, 2023 •

edited

Loading

gbaraldi commented Mar 8, 2023 •

edited

Loading

timholy commented Mar 8, 2023 •

edited

Loading

timholy commented Mar 8, 2023 •

edited

Loading

chriselrod commented Mar 14, 2023 •

edited

Loading

chriselrod commented Mar 14, 2023 •

edited

Loading

gbaraldi commented Mar 20, 2023 •

edited

Loading

KristofferC commented Mar 30, 2023 •

edited

Loading