Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Rejecting cache file" on a heterogeneous cluster, leading to repeated precompilation #48579

Open
jishnub opened this issue Feb 7, 2023 · 6 comments
Assignees
Labels
needs docs Documentation for this change is required packages Package management and loading pkgimage

Comments

@jishnub
Copy link
Contributor

jishnub commented Feb 7, 2023

I am using a freshly downloaded nightly on a Slurm cluster, and encounter this repeated cache invalidation that leads to repeated precompilation.
The login node has

julia> versioninfo()
Julia Version 1.10.0-DEV.524
Commit 2c619b77e04 (2023-02-07 12:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores
Environment:
  LD_LIBRARY_PATH = /home/user/lib:/lib::/home/user/.local/lib
  JULIA_DEPOT_PATH = /scratch/user/.julia
  JULIA_REVISE_POLL = 1
  JULIA_NUM_PRECOMPILE_TASKS = 1
  JULIA_DEBUG = loading

and the compute node has

julia> versioninfo()
Julia Version 1.10.0-DEV.524
Commit 2c619b77e04 (2023-02-07 12:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 40 virtual cores
Environment:
  LD_LIBRARY_PATH = /home/user/lib:/lib:/home/user/lib:/lib::/home/user/.local/lib
  JULIA_DEPOT_PATH = /scratch/user/.julia
  JULIA_REVISE_POLL = 1
  JULIA_NUM_PRECOMPILE_TASKS = 1
  JULIA_DEBUG = loading

I start by deleting my .julia directory to avoid clashes:

rm -rf /scratch/user/.julia

After this, on the login node, I generate a simple package with FillArrays.jl as the only dependency. I instantiate the package on the login node, to see

(Testpkg) pkg> instantiate
  Installing known registries into `/scratch/user/.julia`
   Installed FillArrays ─ v0.13.7
Precompiling environment...
  7 dependencies successfully precompiled in 4 seconds
  2 dependencies had warnings during precompilation:
┌ FillArrays [1a297f60-69ca-5386-bcde-b61e274b549b]
│  ┌ Debug: Loading object cache file /scratch/user/.julia/compiled/v1.10/Statistics/ERcPL_ty4bU.so for Statistics [10745b16-79ce-11e8-11f9-7d13ad32a3b2]
│  └ @ Base loading.jl:1004
└  
┌ LibSSH2_jll [29816b5a-b9ab-546f-933c-edad1886dfa8]
│  ┌ Debug: Loading object cache file /scratch/user/.julia/compiled/v1.10/MbedTLS_jll/u5NEn_ty4bU.so for MbedTLS_jll [c8ffd9c3-330d-5841-b78e-0817d7145fa1]
│  └ @ Base loading.jl:1004
└  

(Testpkg) pkg> precompile

(Testpkg) pkg>

So far, so good, as the package clearly doesn't precompile twice. Now, I drop to the compute node and find that the package precompiles again:

(Testpkg) pkg> precompile
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/Statistics/ERcPL_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/Zlib_jll/xjq3Q_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/SuiteSparse_jll/ME9At_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/MbedTLS_jll/u5NEn_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
Precompiling environment...
  7 dependencies successfully precompiled in 5 seconds
  2 dependencies had warnings during precompilation:
┌ FillArrays [1a297f60-69ca-5386-bcde-b61e274b549b]
│  ┌ Debug: Loading object cache file /scratch/user/.julia/compiled/v1.10/Statistics/ERcPL_ty4bU.so for Statistics [10745b16-79ce-11e8-11f9-7d13ad32a3b2]
│  └ @ Base loading.jl:1004
└  
┌ LibSSH2_jll [29816b5a-b9ab-546f-933c-edad1886dfa8]
│  ┌ Debug: Loading object cache file /scratch/user/.julia/compiled/v1.10/MbedTLS_jll/u5NEn_ty4bU.so for MbedTLS_jll [c8ffd9c3-330d-5841-b78e-0817d7145fa1]
│  └ @ Base loading.jl:1004

Now, if I go back to the login node and try to precompile the package again, I find

(Testpkg) pkg> precompile
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/Statistics/ERcPL_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/Zlib_jll/xjq3Q_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/SuiteSparse_jll/ME9At_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
┌ Debug: Rejecting cache file /scratch/user/.julia/compiled/v1.10/MbedTLS_jll/u5NEn_ty4bU.ji for  [top-level] since pkgimage can't be loaded on this target
└ @ Base loading.jl:2710
Precompiling environment...
  7 dependencies successfully precompiled in 5 seconds
  2 dependencies had warnings during precompilation:
┌ FillArrays [1a297f60-69ca-5386-bcde-b61e274b549b]
│  ┌ Debug: Loading object cache file /scratch/user/.julia/compiled/v1.10/Statistics/ERcPL_ty4bU.so for Statistics [10745b16-79ce-11e8-11f9-7d13ad32a3b2]
│  └ @ Base loading.jl:1004
└  
┌ LibSSH2_jll [29816b5a-b9ab-546f-933c-edad1886dfa8]
│  ┌ Debug: Loading object cache file /scratch/user/.julia/compiled/v1.10/MbedTLS_jll/u5NEn_ty4bU.so for MbedTLS_jll [c8ffd9c3-330d-5841-b78e-0817d7145fa1]
│  └ @ Base loading.jl:1004

Every time I switch between the login and the compute node, the package requires a fresh round of precompilation, which can be quite time-consuming. I wonder if it'll be possible to save two sets of cache files such that one doesn't invalidate the other?

@jishnub jishnub added the packages Package management and loading label Feb 7, 2023
@vchuravy
Copy link
Member

vchuravy commented Feb 7, 2023

Could probably benefit from better docs, but package images support multiversioning.

https://docs.julialang.org/en/v1.10-dev/devdocs/pkgimg/#Package-images-optimized-for-multiple-microarchitectures

So there are two strategies.

  1. Set JULIA_CPU_TARGET to something reasonable for your system (see https://docs.julialang.org/en/v1.10-dev/devdocs/sysimg/#Specifying-multiple-system-image-targets)
  2. Create different depots based on architecture: JULIA_DEPOT_PATH

I think 1. is probably the best.

@jishnub
Copy link
Contributor Author

jishnub commented Feb 7, 2023

Unfortunately, I don't know enough about setting CPU targets. Does the following option make sense?

JULIA_CPU_TARGET="generic;skylake-avx512,clone_all;znver2,clone_all"

Edit: Unfortunately, this didn't seem to work

@vchuravy
Copy link
Member

vchuravy commented Feb 7, 2023

Edit: Unfortunately, this didn't seem to work

Can you elaborate?

@jishnub
Copy link
Contributor Author

jishnub commented Feb 7, 2023

My bad, this seems to be resolved by setting the environment variable export JULIA_CPU_TARGET=generic. Creating multi-microarchitecture system images doesn't seem necessary. I had tried the latter without the former. I wonder if somehow this may be included by default?

@KristofferC
Copy link
Member

KristofferC commented Feb 7, 2023

Only having generic means that it will only use quite old instructions so it's generally better to include some more features to multiversion for.

@vchuravy
Copy link
Member

vchuravy commented Feb 7, 2023

As Kristoffer said you are basically caching the oldest x86_64 code, by default julia does native.

For multi CPU environments I would recommend export JULIA_CPU_TARGET="generic;skylake-avx512,clone_all;znver2,clone_all"

@vchuravy vchuravy added needs docs Documentation for this change is required pkgimage labels Feb 7, 2023
@vchuravy vchuravy self-assigned this Feb 7, 2023
@giordano giordano changed the title "Rejecting cache file" on a Slurm cluster, leading to repeated precompilation "Rejecting cache file" on a heterogeneous cluster, leading to repeated precompilation Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs docs Documentation for this change is required packages Package management and loading pkgimage
Projects
None yet
Development

No branches or pull requests

3 participants