-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
System image compression with zstd #59227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/staticdata.c
Outdated
jl_dlsym(handle, "jl_image_pointers", (void**)&image->pointers, 1); | ||
|
||
image->size = ZSTD_getFrameContentSize(data, *plen); | ||
image->data = (char *)malloc(image->size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to mmap this with huge pages/large pages
The savings are really nice here, but is there a way to claw back some of the startup time? Thinking out aloud, is it worth looking at whether Zstd has support for AVX512 (or fancy instructions) to speed up that may not be enabled? |
Currently I'm testing lz4 as an alternative compression algorithm that sacrifices some compression ratio for decompression speed, since it is intended to be about as fast as RAM on modern CPUs. @vtjnash also had some ideas about doing decompression and relocation in a single pass that I'd like to try: in this version we touch a whole bunch of pages while decompressing, and then force them all back into cache later, when performing relocations. |
I tried a simple test with the command line zstd and lz4 (so may not be representative) and they took basically the same amount of time but zstd compression was much better. So much better that I suspect the time was made up by reading less data. Relocating while decompressing sounds awesome if we can pull that off. |
I believe we should use lz4hc, which is quite slow to compress but has similar rations to zstd (while decompressing about 2.5x faster) |
Can we use threads for compressing/decompressing? |
Experimenting with compressing on |
I think they technically go in ldata currently. But not sure if rodata helps the OS in any meaningful way except write protection |
Co-authored-by: Gabriel Baraldi <baraldigabriel@gmail.com>
Even without compression, this gives about an 8% improvement in load times.
7a0de7a
to
0ee175b
Compare
0, // task_metrics | ||
-1, // timeout_for_safepoint_straggler_s | ||
0, // gc_sweep_always_full | ||
0, // compress_sysimage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revisit this default, but fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge this as an MVP that will let us see how bad the startup costs are in practice. IMO we should revisit multithreaded (de)compression and lz4hc before enabling it by default.
LGTM default aside |
Should we do a pkgeval run on this one? |
I don't think a PkgEval run is really relevant for this PR. |
…ls (#59470) > Inspired by a bug I hit on MinGW recently with weak symbol resolution. I'm not sure why we started using these in 70000ac, since I believe we should be able to lookup these symbols just fine in our `jl_exe_handle` as long as it was linked / compiled with `-rdynamic` > > Migrating away from weak symbols seems a good idea, given their uneven implementation across platforms. (cherry picked from commit 4a6ada6) @gbaraldi asked if this could be backported separately from #59227. Co-authored-by: Cody Tapscott <84105208+topolarity@users.noreply.github.com> Co-authored-by: Dilum Aluthge <dilum@aluthge.com>
After #59227 changed the way we load system images, the symbols in `null_sysimg.c` are no longer correct. Replace them with `jl_image_unpack` so linking a system image with libjulia is possible again.
Revived version of #48244, with a slightly different approach. This version looks for a function pointer called `jl_image_unpack` inside compiled system images and invokes it to get the `jl_image_buf_t` struct. Two implementations, `jl_image_unpack_zstd` and `jl_image_unpack_uncomp` are provided (for comparison). The zstd compression is applied only to the heap image, and not the compiled code, since that can be shared across Julia processes. TODO: test a few different compression settings and enable by default. Example data from un-trimmed juliac "hello world": ``` 156M hello-uncomp 43M hello-zstd 48M hello-zstd-1 45M hello-zstd-5 43M hello-zstd-15 39M hello-zstd-22 $ hyperfine -w3 ./hello-uncomp Benchmark 1: ./hello-uncomp Time (mean ± σ): 74.4 ms ± 0.8 ms [User: 51.9 ms, System: 19.0 ms] Range (min … max): 73.0 ms … 76.6 ms 39 runs $ hyperfine -w3 ./hello-zstd-1 Benchmark 1: ./hello-zstd-1 Time (mean ± σ): 152.4 ms ± 0.5 ms [User: 138.2 ms, System: 12.0 ms] Range (min … max): 151.4 ms … 153.2 ms 19 runs $ hyperfine -w3 ./hello-zstd-5 Benchmark 1: ./hello-zstd-5 Time (mean ± σ): 154.3 ms ± 0.5 ms [User: 139.6 ms, System: 12.4 ms] Range (min … max): 153.5 ms … 155.2 ms 19 runs $ hyperfine -w3 ./hello-zstd-15 Benchmark 1: ./hello-zstd-15 Time (mean ± σ): 135.9 ms ± 0.5 ms [User: 121.6 ms, System: 12.0 ms] Range (min … max): 135.1 ms … 136.5 ms 21 runs $ hyperfine -w3 ./hello-zstd-22 Benchmark 1: ./hello-zstd-22 Time (mean ± σ): 149.0 ms ± 0.6 ms [User: 134.7 ms, System: 12.1 ms] Range (min … max): 147.7 ms … 150.4 ms 19 runs ``` --------- Co-authored-by: Gabriel Baraldi <baraldigabriel@gmail.com>
Revived version of #48244, with a slightly different approach. This version looks for a function pointer called `jl_image_unpack` inside compiled system images and invokes it to get the `jl_image_buf_t` struct. Two implementations, `jl_image_unpack_zstd` and `jl_image_unpack_uncomp` are provided (for comparison). The zstd compression is applied only to the heap image, and not the compiled code, since that can be shared across Julia processes. TODO: test a few different compression settings and enable by default. Example data from un-trimmed juliac "hello world": ``` 156M hello-uncomp 43M hello-zstd 48M hello-zstd-1 45M hello-zstd-5 43M hello-zstd-15 39M hello-zstd-22 $ hyperfine -w3 ./hello-uncomp Benchmark 1: ./hello-uncomp Time (mean ± σ): 74.4 ms ± 0.8 ms [User: 51.9 ms, System: 19.0 ms] Range (min … max): 73.0 ms … 76.6 ms 39 runs $ hyperfine -w3 ./hello-zstd-1 Benchmark 1: ./hello-zstd-1 Time (mean ± σ): 152.4 ms ± 0.5 ms [User: 138.2 ms, System: 12.0 ms] Range (min … max): 151.4 ms … 153.2 ms 19 runs $ hyperfine -w3 ./hello-zstd-5 Benchmark 1: ./hello-zstd-5 Time (mean ± σ): 154.3 ms ± 0.5 ms [User: 139.6 ms, System: 12.4 ms] Range (min … max): 153.5 ms … 155.2 ms 19 runs $ hyperfine -w3 ./hello-zstd-15 Benchmark 1: ./hello-zstd-15 Time (mean ± σ): 135.9 ms ± 0.5 ms [User: 121.6 ms, System: 12.0 ms] Range (min … max): 135.1 ms … 136.5 ms 21 runs $ hyperfine -w3 ./hello-zstd-22 Benchmark 1: ./hello-zstd-22 Time (mean ± σ): 149.0 ms ± 0.6 ms [User: 134.7 ms, System: 12.1 ms] Range (min … max): 147.7 ms … 150.4 ms 19 runs ``` --------- Co-authored-by: Gabriel Baraldi <baraldigabriel@gmail.com>
AI tells me; in case you didn't know:
Can you try e.g. -7; and up to or some of such as 0? I'm not sure if any of this is really faster, optimized the case for a lot of runs of zeros as in the sysimage, but would be interesting if faster (Karpinski also has a compressor just for runs of zeros...). And how well it still compresses. Compression times are less interesting but interesting (and I suppose I could test myself). |
Revived version of #48244, with a slightly different approach. This version looks for a function pointer called
jl_image_unpack
inside compiled system images and invokes it to get thejl_image_buf_t
struct. Two implementations,jl_image_unpack_zstd
andjl_image_unpack_uncomp
are provided (for comparison). The zstd compression is applied only to the heap image, and not the compiled code, since that can be shared across Julia processes.TODO: test a few different compression settings and enable by default.
Example data from un-trimmed juliac "hello world":