Skip to content

Conversation

sjkelly
Copy link
Contributor

@sjkelly sjkelly commented Mar 1, 2024

This helps to minimize the amount of failed precompiles that occur by eliding the inlining cost check. It is likely this will lead to slightly larger sys/pkgimg sizes. However in practice it leads to more extensive and successful precompilation, which can minimize the number of JIT events which occur when using Julia.

@sjkelly
Copy link
Contributor Author

sjkelly commented Mar 1, 2024

This does grow the sysimg a considerable amount:

Master:
-rwxrwxr-x 1 sjkelly sjkelly 139M Mar 1 19:51 sys.so
PR:
-rwxrwxr-x 1 sjkelly sjkelly 158M Mar 1 20:01 sys.so

However, I would argue, reliable precompilation is somewhat important. Is it possible this threshold could be adjusted instead?

@mbauman
Copy link
Member

mbauman commented Mar 1, 2024

What are the cases where you need precompilation for an inlineable function but aren't calling it from a context where it's been inlined?

@sjkelly
Copy link
Contributor Author

sjkelly commented Mar 1, 2024

What are the cases where you need precompilation for an inlineable function but aren't calling it from a context where it's been inlined?

I don't know. The behavior I am seeing seems to imply inlining is not strictly a function of the callee complexity, but also the caller.

For example under the following contrived workload (run twice to allow downloads/resolve) I see:

$ ./julia --trace-compile=precomp_mast.txt -e 'using Pkg; Pkg.activate(); Pkg.add("Plots"); Pkg.rm("Plots")'

$ wc -l precomp_mast.txt 
136 precomp_mast.txt
$ ./julia --trace-compile=precomp_pr.txt -e 'using Pkg; Pkg.activate(); Pkg.add("Plots"); Pkg.rm("Plots")'

$ wc -l precomp_pr.txt 
96 precomp_pr.txt

precomp_mast.txt
precomp_pr.txt

$ diff precomp_mast.txt precomp_pr.txt 
10,11d9
< precompile(Tuple{typeof(Base.append!), Array{String, 1}, Array{String, 1}})
< precompile(Tuple{typeof(Base.join), Array{String, 1}, Char})
31d28
< precompile(Tuple{typeof(Pkg.API.add), String})
35d31
< precompile(Tuple{typeof(Base.copy), Base.EnvDict})
37,38d32
< precompile(Tuple{typeof(Base.get), Base.Dict{String, String}, String, String})
< precompile(Tuple{typeof(Base.getindex), Base.Dict{String, String}, String})
40,41d33
< precompile(Tuple{Type{Base.Cmd}, Array{String, 1}})
< precompile(Tuple{Type{NamedTuple{(:env,), T} where T<:Tuple}, Tuple{Base.Dict{String, String}}})
51,53d42
< precompile(Tuple{Type{NamedTuple{(:stale_age,), T} where T<:Tuple}, Tuple{Int64}})
< precompile(Tuple{typeof(Base.pairs), NamedTuple{(:stale_age,), Tuple{Int64}}})
< precompile(Tuple{typeof(Base.merge), NamedTuple{(), Tuple{}}, Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:stale_age,), Tuple{Int64}}}})
56d44
< precompile(Tuple{typeof(Base.join), Base.GenericIOBuffer{GenericMemory{:not_atomic, UInt8, Core.AddrSpace{Core}(0x00)}}, Array{Char, 1}})
58d45
< precompile(Tuple{typeof(Base.convert), Type{Any}, Any})
61,62d47
< precompile(Tuple{typeof(Base._str_sizehint), Base.SubString{String}})
< precompile(Tuple{typeof(Base.print), Base.GenericIOBuffer{GenericMemory{:not_atomic, UInt8, Core.AddrSpace{Core}(0x00)}}, Base.SHA1})
64,66d48
< precompile(Tuple{typeof(Base.similar), Array{String, 1}})
< precompile(Tuple{typeof(Base.Iterators.enumerate), Array{String, 1}})
< precompile(Tuple{typeof(Base.setindex!), Array{String, 1}, String, Int64})
68d49
< precompile(Tuple{typeof(Base.hash), Tuple{UInt64}, UInt64})
70d50
< precompile(Tuple{typeof(Base.close), FileWatching.FDWatcher})
73d52
< precompile(Tuple{typeof(Base.hash), Tuple{String}, UInt64})
78,79d56
< precompile(Tuple{typeof(Base.length), Array{Int64, 1}})
< precompile(Tuple{typeof(Base.length), Base.BitArray{1}})
83,84d59
< precompile(Tuple{typeof(Base.iterate), Base.Generator{Base.Iterators.Filter{Pkg.Resolve.var"#86#88"{Array{Int64, 1}, Base.BitArray{1}}, Base.UnitRange{Int64}}, Pkg.Resolve.var"#85#87"{Array{Base.BitArray{2}, 1}, Array{Int64, 1}, Array{Base.BitArray{1}, 1}, Int64}}})
< precompile(Tuple{typeof(Base.iterate), Base.Generator{Base.Iterators.Filter{Pkg.Resolve.var"#86#88"{Array{Int64, 1}, Base.BitArray{1}}, Base.UnitRange{Int64}}, Pkg.Resolve.var"#85#87"{Array{Base.BitArray{2}, 1}, Array{Int64, 1}, Array{Base.BitArray{1}, 1}, Int64}}, Int64})
92d66
< precompile(Tuple{typeof(Base.isempty), Base.Dict{String, Base.UUID}})
95d68
< precompile(Tuple{typeof(Base.push!), Array{Base.Dict{String, Any}, 1}, Base.Dict{String, Any}})
110d82
< precompile(Tuple{typeof(Base.isempty), Base.Dict{String, Union{Array{String, 1}, String}}})
116d87
< precompile(Tuple{typeof(Base.isempty), Base.Dict{String, String}})
120,122d90
< precompile(Tuple{typeof(Base.similar), Array{Any, 1}})
< precompile(Tuple{typeof(Base.Iterators.enumerate), Array{Any, 1}})
< precompile(Tuple{Type{GenericMemory{:not_atomic, String, Core.AddrSpace{Core}(0x00)}}, UndefInitializer, Int64})
124d91
< precompile(Tuple{typeof(Base.print), Base.IOContext{Base.GenericIOBuffer{GenericMemory{:not_atomic, UInt8, Core.AddrSpace{Core}(0x00)}}}, Char})
126,132d92
< precompile(Tuple{typeof(Base.issorted), Array{String, 1}, Base.Order.ReverseOrdering{Base.Order.ForwardOrdering}})
< precompile(Tuple{typeof(Pkg.MiniProgressBars.start_progress), Pkg.UnstableIO, Pkg.MiniProgressBars.MiniProgressBar})
< precompile(Tuple{typeof(Pkg.MiniProgressBars.end_progress), Pkg.UnstableIO, Pkg.MiniProgressBars.MiniProgressBar})
< precompile(Tuple{typeof(Base._str_sizehint), UInt64})
< precompile(Tuple{typeof(Base.print), Base.GenericIOBuffer{GenericMemory{:not_atomic, UInt8, Core.AddrSpace{Core}(0x00)}}, UInt64})
< precompile(Tuple{typeof(Base.haskey), Base.Dict{String, Any}, String})
< precompile(Tuple{typeof(Base.sizeof), Symbol})

@sjkelly
Copy link
Contributor Author

sjkelly commented Mar 1, 2024

Looks like the tests would need to be changed with this as well: https://buildkite.com/julialang/julia-master/builds/34254#018dfb82-9e51-4c92-a655-9d43f9b2dd9b/910-1671

This helps to minimize the amount of failed precompiles
that occur by eliding the inlining cost check. It is
likely this will lead to slightly larger sys/pkgimg
sizes. However in practice it leads to more extensive
and successful precompilation, which can minimize the number
of JIT events which occur when using Julia.
@sjkelly sjkelly force-pushed the sjk/precompheuristic2 branch from 361e211 to 650868e Compare March 3, 2024 16:17
@gbaraldi
Copy link
Member

gbaraldi commented Mar 4, 2024

The cases where this matters is that you have a dynamic dispatch that then dispatches to one of these functions that always gets inlined.

@mbauman
Copy link
Member

mbauman commented Mar 4, 2024

Right — that was my assumption and at the root of my question. If there's dynamic dispatch, how does Julia know the argument types in order to do the precompile? I suppose there must be other — type stable — functions that happen to call (and then inline) these methods in other usages, so then the standalone versions happen to get precompiled on this branch?

@sjkelly
Copy link
Contributor Author

sjkelly commented Mar 4, 2024

The cases where this matters is that you have a dynamic dispatch that then dispatches to one of these functions that always gets inlined.

So if I am understanding correctly this is then indeed a bug in the current precompilation heuristic. Meaning there are such cases where something is in theory always able to be inlined, but is in fact not due to dynamic dispatch. Therefore this patch fixes the issue, but is inefficient because we exhaustively precompile every method instance, whereas we want something that knows if the calling context is a dynamic dispatch. This truth table may help articulate this better:

Inlinable? In a dynamic dispatch context? Gets precompiled?
Y Y N (should be Yes)
Y N Implicit
N Y Y

@JeffBezanson
Copy link
Member

I suppose there must be other — type stable — functions that happen to call (and then inline) these methods in other usages, so then the standalone versions happen to get precompiled on this branch?

I think that's right, yes. It's basically using the fact that we had to infer and inline a function somewhere as a heuristic to find functions that are probably useful. Given that we had to analyze these functions anyway, it's nice to compile and save them. But I'm not sure it's really worth the extra space. Maybe we should enable this in compile=all mode though?

sjkelly added a commit to sjkelly/julia that referenced this pull request Mar 20, 2024
This helps to minimize the amount of missed precompiles
that occur by eliding the inlining cost check when julia is
run under `--compile=all`. This will lead to
slightly larger sysimage sizes. However, in practice it leads
to more extensive and successful precompilation, which can
minimize the number of JIT events in
call sites with dynamic dispatch.

This is an alternative to JuliaLang#53547, where removing the inlining
cost check was made universal. However, that lead to ~15% larger
sysimage sizes by default. This implements Jeff's  suggestion that
this mode be enabled under `--compile=all`.
@sjkelly
Copy link
Contributor Author

sjkelly commented Mar 20, 2024

Replacing with #53798

@sjkelly sjkelly closed this Mar 20, 2024
vtjnash pushed a commit that referenced this pull request Apr 16, 2024
This helps to minimize the amount of missed precompiles
that occur by eliding the inlining cost check when julia is
run under `--compile=all`. This will lead to
slightly larger sysimage sizes. However, in practice it leads
to more extensive and successful precompilation, which can
minimize the number of JIT events in
call sites with dynamic dispatch.

This is an alternative to #53547, where removing the inlining
cost check was made universal. However, that lead to ~15% larger
sysimage sizes by default. This implements Jeff's  suggestion that
this mode be enabled under `--compile=all`.
kpamnany pushed a commit to RelationalAI/julia that referenced this pull request May 15, 2025
…ang#53798)

This helps to minimize the amount of missed precompiles
that occur by eliding the inlining cost check when julia is
run under `--compile=all`. This will lead to
slightly larger sysimage sizes. However, in practice it leads
to more extensive and successful precompilation, which can
minimize the number of JIT events in
call sites with dynamic dispatch.

This is an alternative to JuliaLang#53547, where removing the inlining
cost check was made universal. However, that lead to ~15% larger
sysimage sizes by default. This implements Jeff's  suggestion that
this mode be enabled under `--compile=all`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants