Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT actual memory usage is higher than reported #52710

Closed
adnan-alhomssi opened this issue Jan 3, 2024 · 10 comments
Closed

JIT actual memory usage is higher than reported #52710

adnan-alhomssi opened this issue Jan 3, 2024 · 10 comments

Comments

@adnan-alhomssi
Copy link
Contributor

Using Jemalloc profiler, we found that LLVM in-use memory usage after JITing lots of code is an order of magnitude higher than what Base.jit_total_bytes() reports.

Here is a simple example:

julia> before_jit = Base.jit_total_bytes()
500285

julia> @time for i in 1:50000
           @eval f(x, ::Val{$i}) = x + $i
           @eval begin
               Base.Experimental.@force_compile
               f(1, Val($i))
           end
       end

382.571918 seconds (94.59 M allocations: 6.002 GiB, 0.65% gc time, 93.05% compilation time)

julia> after_jit = Base.jit_total_bytes()
10261653

julia> println("JIT MiB: ", (after_jit - before_jit) >> 20)
JIT MiB: 9

But jemalloc is reporting 357.2 MiB live heap (still allocated).
Here is the jemalloc profile output in --text format. Attached you can find a clearer SVG graph. Both are generated using jeprof

jit jeprof

julia> dump_and_analyze_profile()
│ Columns: 1_direct_usage_mib, 2_direct_usage_pct, 3_(cumulative sum of column 2), 4_usage_including_callees_mib, 5_(col 4 in pct), 6_function_name
│ Total: 448.5 MB
│      0.0   0.0%   0.0%    448.5 100.0% #invokelatest#2 (inline)
│      0.0   0.0%   0.0%    448.5 100.0% __libc_start_main
│      0.0   0.0%   0.0%    448.5 100.0% _start
│      0.0   0.0%   0.0%    448.5 100.0% eval (inline)
│      0.0   0.0%   0.0%    448.5 100.0% ijl_toplevel_eval_in
│      0.0   0.0%   0.0%    448.5 100.0% invokelatest (inline)
│      0.0   0.0%   0.0%    448.5 100.0% japi1_#start_repl_backend#46_60085
│      0.0   0.0%   0.0%    448.5 100.0% japi1_eval_user_input_59994.clone_1.clone_2
│      0.0   0.0%   0.0%    448.5 100.0% japi1_start_repl_backend_25229.clone_1
│      0.0   0.0%   0.0%    448.5 100.0% jfptr_YY.1017_48079.clone_1
│      0.0   0.0%   0.0%    448.5 100.0% jfptr__start_41245.clone_1
│      0.0   0.0%   0.0%    448.5 100.0% jfptr_run_repl_61515.clone_1
│      0.0   0.0%   0.0%    448.5 100.0% jl_apply (inline)
│      0.0   0.0%   0.0%    448.5 100.0% jl_f__call_latest
│      0.0   0.0%   0.0%    448.5 100.0% jl_repl_entrypoint
│      0.0   0.0%   0.0%    448.5 100.0% jl_toplevel_eval_flex
│      0.0   0.0%   0.0%    448.5 100.0% julia_#1017_48078
│      0.0   0.0%   0.0%    448.5 100.0% julia_#run_repl#59_61632
│      0.0   0.0%   0.0%    448.5 100.0% julia__start_41244.clone_1
│      0.0   0.0%   0.0%    448.5 100.0% julia_exec_options_37133.clone_1.clone_2
│      0.0   0.0%   0.0%    448.5 100.0% julia_repl_backend_loop_61447.clone_1.clone_2
│      0.0   0.0%   0.0%    448.5 100.0% julia_run_main_repl_55820.clone_1.clone_2
│      0.0   0.0%   0.0%    448.5 100.0% julia_run_repl_61514.clone_1
│      0.0   0.0%   0.0%    448.5 100.0% main
│      0.0   0.0%   0.0%    448.5 100.0% true_main
│      0.0   0.0%   0.0%    446.9  99.7% jfptr_eval_12402.clone_1
│      0.0   0.0%   0.0%    446.9  99.7% julia_eval_12401.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% 0x00007fef3be86145
│      0.0   0.0%   0.0%    443.4  98.9% j_abstract_call_known_17651.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% jfptr_abstract_call_known_20323.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% jfptr_typeinf_ext_toplevel_18662.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% jl_type_infer
│      0.0   0.0%   0.0%    443.4  98.9% julia__typeinf_14743.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_call_17642.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_call_17669.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_call_gf_by_type_20424.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_call_known_20322.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_eval_basic_statement_18712.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_eval_statement_20202.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% julia_abstract_eval_statement_expr_20881.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_typeinf_13357.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_typeinf_ext_toplevel_18661.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% julia_typeinf_ext_toplevel_18665.clone_1
│      0.0   0.0%   0.0%    443.4  98.9% julia_typeinf_local_19608.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% julia_typeinf_nocycle_15000.clone_1.clone_2
│      0.0   0.0%   0.0%    443.4  98.9% tojlinvoke21359.clone_1
│      0.0   0.0%   0.0%    434.8  97.0% imalloc (inline)
│      0.0   0.0%   0.0%    434.8  97.0% imalloc_body (inline)
│      0.0   0.0%   0.0%    434.8  97.0% je_malloc_default
│      0.0   0.0%   0.0%    434.8  97.0% je_prof_backtrace
│      0.0   0.0%   0.0%    434.8  97.0% je_prof_tctx_create
│      0.0   0.0%   0.0%    434.8  97.0% prof_alloc_prep (inline)
│    434.8  97.0%  97.0%    434.8  97.0% prof_backtrace_impl
│      0.0   0.0%  97.0%    431.8  96.3% julia_abstract_call_method_with_const_args_12673.clone_1.clone_2
│      0.0   0.0%  97.0%    431.8  96.3% julia_abstract_call_method_with_const_args_12703.clone_1.clone_2
│      0.0   0.0%  97.0%    431.0  96.1% _jl_compile_codeinst [clone .isra.0]
│      0.0   0.0%  97.0%    431.0  96.1% _jl_invoke (inline)
│      0.0   0.0%  97.0%    431.0  96.1% jl_compile_method_internal
│      0.0   0.0%  97.0%    431.0  96.1% jl_compile_method_internal (inline)
│      0.0   0.0%  97.0%    430.5  96.0% ijl_apply_generic
│      0.0   0.0%  97.0%    430.5  96.0% jl_generate_fptr_impl
│      0.0   0.0%  97.0%    429.5  95.8% do_apply
│      0.0   0.0%  97.0%    429.5  95.8% jl_f__call_in_world_total
│      0.0   0.0%  97.0%    429.5  95.8% julia_concrete_eval_call_19755.clone_1.clone_2
│      0.0   0.0%  97.0%    357.2  79.7% JuliaOJIT::addModule
│      0.0   0.0%  97.0%    357.2  79.7% jl_add_to_ee
│      0.0   0.0%  97.0%    346.7  77.3% JuliaOJIT::OptSelLayerT::emit
│      0.0   0.0%  97.0%    346.7  77.3% llvm::detail::UniqueFunctionBase::CallImpl
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::BasicIRLayerMaterializationUnit::materialize [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::ExecutionSession::OL_applyQueryPhase1 [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::ExecutionSession::OL_completeLookup [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::ExecutionSession::dispatchOutstandingMUs [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::ExecutionSession::lookup
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::ExecutionSession::lookup [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::IRCompileLayer::emit [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::IRTransformLayer::emit [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::InProgressFullLookupState::complete
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::MaterializationTask::run [clone .localalias]
│      0.0   0.0%  97.0%    346.7  77.3% llvm::orc::RTDyldObjectLinkingLayer::emit [clone .localalias]
│      0.0   0.0%  97.0%    345.7  77.1% llvm::jitLinkForORC
│      0.0   0.0%  97.0%    324.4  72.3% ::LoadedELFObjectInfo::getObjectForDebug
│      0.0   0.0%  97.0%    324.4  72.3% ::createELFDebugObject
│      0.0   0.0%  97.0%    313.4  69.9% llvm::MemoryBuffer::getMemBufferCopy
│      0.0   0.0%  97.0%    313.4  69.9% llvm::WritableMemoryBuffer::getNewUninitMemBuffer [clone .localalias]
│      0.0   0.0%  97.0%    313.4  69.9% void* fallback_impl@bc770
│      0.0   0.0%  97.0%    173.5  38.7% llvm::orc::RTDyldObjectLinkingLayer::onObjLoad [clone .localalias]
│      0.0   0.0%  97.0%    173.5  38.7% registerRTDyldJITObject
│      0.0   0.0%  97.0%    172.2  38.4% ::JITDylibSearchOrderResolver::lookup
│      0.0   0.0%  97.0%    172.2  38.4% llvm::RuntimeDyldImpl::finalizeAsync [clone .localalias]
│      0.0   0.0%  97.0%    172.2  38.4% llvm::RuntimeDyldImpl::finalizeAsync::{lambda#1}::operator
│      0.0   0.0%  97.0%    172.2  38.4% llvm::orc::AsynchronousSymbolQuery::handleComplete [clone .localalias]
│      0.0   0.0%  97.0%    172.2  38.4% llvm::orc::AsynchronousSymbolQuery::handleComplete::RunQueryCompleteTask::run
│      0.0   0.0%  97.0%    168.2  37.5% llvm::orc::RTDyldObjectLinkingLayer::onObjEmit [clone .localalias]
│      0.0   0.0%  97.0%    167.5  37.3% ::GDBJITRegistrationListener::notifyObjectLoaded
│   total_mib = 448.5
└   duration = 0.0
julia> versioninfo()
Julia Version 1.9.2+RAI
Platform Info:
  OS: Linux (x86_64-unknown-linux-gnu)
  CPU: 12 × AMD Ryzen 5 PRO 7530U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 26 on 12 virtual cores
Environment:
  LD_PRELOAD = /nix/store/lpww4nafkw2z1qs3dllbqx28qc66ivpx-jemalloc-5.3.0/lib/libjemalloc.so
  JULIA_PROJECT = ...
  JULIA_NUM_THREADS = 12,2
@vtjnash
Copy link
Member

vtjnash commented Jan 3, 2024

I don't think any of the memory of Base.jit_total_bytes should be getting attributed from jemalloc, as those are separate allocation pools

@adnan-alhomssi
Copy link
Contributor Author

Do we know why is the memory being used by LLVM grows monotonically and where is it going to? In some workloads it reaches 1.6 GiB while Base.jit_total_bytes report just around 260 MiB. We first were relying on Base.jit_total_bytes and now that see how compilation is taking more memory, I wonder if there an opportunity to optimize this.

@JeffBezanson
Copy link
Member

One chunk is ::GDBJITRegistrationListener::notifyObjectLoaded, so maybe that is data structures used to register debug info for GDB? That will certainly not be counted by jit_total_bytes. Most of the rest looks like llvm MemoryBuffers for in-memory object files. I don't know why they/re so big. Maybe the buffers are over-allocated, or maybe there is a lot more in the object file than just the code and data sections we're counting?

@vtjnash
Copy link
Member

vtjnash commented Aug 16, 2024

DebugObjects are now accounted for, which are usually roughly 10x the size of what Base.jit_total_bytes previously accounted for (though also now compressed, so usually only 3x now)

@vtjnash vtjnash closed this as completed Aug 16, 2024
@NHDaly
Copy link
Member

NHDaly commented Aug 16, 2024

@vtjnash: Do you know which commit(s) added that accounting + compression? Is it something we could easily backport to 1.10? (Or if not, maybe it's something RAI could backport to our fork of 1.10)

@vtjnash
Copy link
Member

vtjnash commented Aug 16, 2024

#55180. I assume it shouldn't be too complicated to backport

@NHDaly
Copy link
Member

NHDaly commented Aug 16, 2024

Cool, thank you! I added the backport labels to that ticket, then.

Also, if we do end up decompressing the debug info to read it, will that decompressed data be attributed to Base.jit_total_bytes now? Will it stick around, or is it lazily decompressed on every access?

@vtjnash
Copy link
Member

vtjnash commented Aug 16, 2024

Yes, it will maintain an updated count

@adnan-alhomssi
Copy link
Contributor Author

@vtjnash When do we decompress the debug info?

@vtjnash
Copy link
Member

vtjnash commented Aug 19, 2024

It is decompressed when you access it (e.g. printing an error that occurred in that function)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants