Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How precompile files are loaded need to change if using multiple projects are going to be pleasant #27418

Closed
KristofferC opened this issue Jun 4, 2018 · 27 comments · Fixed by #32651
Labels
packages Package management and loading

Comments

@KristofferC
Copy link
Member

Precompile files are currently stored only based on the UUID of the package.
So if you change your project it is likely that you will have to recompile everything. And then again when you swap back etc.
This will be very annoying for people trying to use multiple packages and people will likely just use one mega project like before.
#26165 also removed any possibility for users to change the precompile path so there is no way to workaround this right now.

We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.

@KristofferC KristofferC added the packages Package management and loading label Jun 4, 2018
@StefanKarpinski
Copy link
Member

We'll need advice and input from @vtjnash on this one.

@tkf
Copy link
Member

tkf commented Sep 1, 2018

Could you consider #28518 when fixing it?

Re: implementation, I suppose you can de-duplicate precompile cache by using hash tree? What I mean by that is to generate the path of the precompile file using a hash that depends on its own git-tree-sha1 (or version?) and the hash of all of its dependencies. What I suggested in #28518 was to make it also depend on the package options (JuliaLang/Juleps#38).

ref: JuliaPy/pyjulia#173

@musm
Copy link
Contributor

musm commented Oct 25, 2018

Chiming in that for me this is pretty useful

Argument:
I have dev shared environment .
What is annoying is that I have to recompile my 'clean' environment whenever I work on the development packages and then switch back to my clean environemnt, even though none of the packages in the default environment have been touched.

At least an optional flag for new environments not to share the the precompile cache would be awesome.

@tkf
Copy link
Member

tkf commented Oct 30, 2018

As different system images may contain different versions of packages, I suppose it makes sense for the cache path to depend on (say) the path of the system image as well? I think it also helps to decouple stdlib more from Julia core.

@tkf
Copy link
Member

tkf commented Nov 4, 2018

@StefanKarpinski I don't think implementing what I suggested above #27418 (comment) is difficult. Does this conceptually work?

function cache_path_slug(env::Pkg.Types.EnvCache, uuid::Base.UUID)
    info = Pkg.Types.manifest_info(env, uuid)
    crc = 0x00000000
    if haskey(info, "deps")
        for dep_uuid in sort(Base.UUID.(values(info["deps"])))
            slug = cache_path_slug(env, dep_uuid)
            crc = Base._crc32c(slug, crc)
        end
    end
    crc = Base._crc32c(uuid, crc)
    if haskey(info, "git-tree-sha1")
        crc = Base._crc32c(info["git-tree-sha1"], crc)
    end
    # crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
    return Base.slug(crc, 5)
end

cache_path_slug(Pkg.Types.EnvCache(), Base.identify_package("Compat").uuid)

(By "conceptually", I mean that I'm grossing over that probably Base shouldn't be using Pkg. Also, above function as-is without memoization may be bad for large dependency trees.)

Some possible flaws I noticed:

  • It requires GC. But I guess we can do that it in Pkg.gc.
  • It is supposed to share the common sub-tree of the projects. However, unless the projects are updated at the same time, none of them share substantial sub-tree (e.g., if they have different Compat.jl version, they probably do not share anything.). I'm not sure how problematic it can be.

@timholy
Copy link
Member

timholy commented Dec 3, 2018

Related: I was benchmarking julia master vs. a branch using two different directories & builds. The two compete against one another for the ownership of the compiled package files.

@cstjean
Copy link
Contributor

cstjean commented Dec 3, 2018

FWIW, we found that using a different DEPOT_PATH for each frequently-used environment is a decent (if cumbersome) work-around until there's a fix.

@timholy
Copy link
Member

timholy commented Dec 3, 2018

That's what I was doing too but recently I ran into a case where, surprisingly, that didn't work. I was rushing and didn't have time to document it, but I will see if I can remember what was involved.

@JeffreySarnoff
Copy link
Contributor

tangentially adjacent or interwoven?
every time I make a change to julia source code in ArbNumerics, pkg insists on regenerating all the c library files oblivious to the fact that nothing at all has occured which benefits therefrom

@timholy
Copy link
Member

timholy commented Dec 30, 2018

This is also the cause of timholy/Revise.jl#205

@bjarthur
Copy link
Contributor

yowza. could we please prioritize this with a milestone?

@jpsamaroo
Copy link
Member

We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.

Under what conditions can it be guaranteed that one or more precompile files are shareable? If we can nail down the varying inputs to precompilation, it should at least be possible to put in a hack to stop truly unnecessary precompilations, at least until a better mechanism is devised.

@montyvesselinov
Copy link

montyvesselinov commented Mar 8, 2019

My home dir is typically shared by many different machines (os/proc type). I would need to have different build place for *.ji files related to each machine.

DEPOT_PATH defines the location /Users/monty/.julia

So i will need /Users/monty/.julia-redhat, /Users/monty/.julia-linux, /Users/monty/.julia-ubuntu14, /Users/monty/.julia-ubuntu16, etc.

Is there a better way?

@tkf
Copy link
Member

tkf commented Mar 26, 2019

I'm replying to @jpsamaroo's comment in this discourse thread here since this discussion belongs to here than there. Please read my comment (and the follow-up) and @jpsamaroo's comment for the full context.

Therefore, I think initially we should focus on just precompiling each project which is loaded in isolation, before any further activate s occur.

I think it does not handle many common cases. For example, if you have using Revise in startup.jl then you can't capture even the first activate in this scheme. Also, what do you do after first activate? Switch to a --compiled-modules=no mode (I don't know if you can toggle this flag dynamically)? Since you also need to address chicken-and-egg problem in this approach by adding a TOML-parser in Base or persistent cache (or something else) to get dependencies before locating cache path (they are hard problem on its own), and since we know that this cannot capture many use-cases, I think it makes sense to implement the fully dynamic solution ("in-memory dependency tree") from the get go.

But I actually don't know if it is such a bad idea as the first implementation. As switching project trigger precompilation anyway ATM, it is an improvement if julia automatically turns off recompilation. Also, if people care reproducibility maybe they use --project/JULIA_PROJECT most of the time. In that case, full dynamism may not be required for precompilation. Also, a con for fully dynamic solution is "GC" of *.ji files. It'll create more precompilation files than static solution and it's hard to know what files are needed or not.

@jpsamaroo
Copy link
Member

I'd be interested in elaboration on this "in-memory dependency tree" and how it can solve the issue of dynamic activations. I only consider my "solution" a temporary improvement for certain commons cases anyway, but you're definitely right that it might make other common cases worse instead of better.

@oxinabox
Copy link
Contributor

oxinabox commented Mar 26, 2019

I don't see why we can't just have 1 complile cache directory, per exact stack of enviroments.
At least as a short term solution.
I feel like this would generally lead to less than 3 compile caches per enviroment.
And sure it might duplicate a bit of compile time but it wouild be less than we have now.

And sure it woud use more harddrive space, but harddrive space is cheap.
Cheaper than my time that I spend waiting for compilation when I switch enviroments.
Probably would want some gc complilecache all to clear all compile caches,
and maybe gc compilcache dead to clear all compile caches that we can no longer locate all Manifest.tomls for.

@tkf
Copy link
Member

tkf commented Mar 27, 2019

@oxinabox

I don't see why we can't just have 1 complile cache directory, per exact stack of enviroments.
At least as a short term solution.

I think it's not a crazy plan provided that there is a mechanism to switch to the mode that acts like --compiled-modules=no when precompilation does not work.

To illustrate what I mean by "precompilation does not work", consider the following setup:

Default (named) project v1.2 with packages:

  • A
  • C@1.0 (package C of version 1.0)

custom_project with packages:

  • B
  • C@1.1 (package C of version 1.1)

Further assume that packages A and B both only require C >= 1.0. (custom_project gets C@1.1, e.g., due to the timing it is created.)

If you do

julia> using A  # loads C@1.0

pkg> activate custom_project

julia> using B

this Julia session (hereafter Session 1) loads C@1.0 while if you do

pkg> activate custom_project

julia> using A  # loads C@1.1

julia> using B

then this Julia session (hereafter Session 2) loads C@1.1. Notice that at the point using B, both sessions have exactly the same environment stack. However, if you want to precompile package B, you need to compile it with C@1.0 in Session 1 and C@1.1 in Session 2.

@jpsamaroo This is what I meant by "in-memory dependency tree." The information that C@1.0 must be used in Session 1 and that C@1.1 must be used in Session 2 is only in the memory of each session. This information has to be passed to the subprocess compiling package B. Actually, "in-memory dependency tree" is misleading and I should have called it "in-memory manifest" which includes the list of exact package versions (or maybe rather file path to the source code directory of the given version ~/.julia/packages/$package_name/$version_slug/).

@vtjnash
Copy link
Member

vtjnash commented Mar 27, 2019

This is all great thinking. Unfortunately, the current issue is just so much more mundane than all that. We actually already have all of that great "in-memory dependency tree" logic and stacks of caches and more! So what's the problem, since that's clearly not working for the default user experience? Well, at the end of the precompile step, it goes and garbage collects the old files right away. So there's nary a chance for it to survive for even a brief moment to be found later and used. If it only could just stop doing that until some later explicit step (like the brand new Pkg.gc() operation), life would be much happier for everyone.

@jpsamaroo
Copy link
Member

Right, that's a good point. But we do still need to ensure we know how to locate the previously-generated *.ji files deterministically in a manner that is guaranteed to load the correct ones. Currently it seems this issue is avoided by blowing everything away and starting from scratch the moment any little thing changes with respect to the conditions that generated the previous *.ji files.

@tkf
Copy link
Member

tkf commented Mar 28, 2019

We actually already have all of that great "in-memory dependency tree" logic and stacks of caches and more!

@vtjnash Do you mind let us know where it is implemented? The closest thing I could find was Base._concrete_dependencies but it only records the pair of PkgId and build_id. IIUC, the actual dependencies are still recorded in the header of the cache file (together with build_id of them). It's great for integrity check but it looks to me that there are no dependencies (list of upstream packages uuid and version for each package) stored in memory.

@staticfloat
Copy link
Member

@vtjnash It would be great if you could elucidate a little more concretely what needs to change inside of base; I don't quite follow precisely what needs to change. Clearly the naming of precompile files needs to change, and I think what you're saying is that we need a way to determine which precompile files are used and which are not used so that we don't just slowly fill up a disk with stale precompile caches?

@staticfloat
Copy link
Member

Another perspective; there are situations where having user-control over which precompile file gets loaded is desirable. Let us imagine a user wanting to distribute a docker container with Julia GPU packages pre-installed; the Julia GPU packages need to do some setup when they see a new generation of GPU hardware attached, and so right now in the docker container we are forced to set JULIA_DEPOT_PATH=~/.julia_for_hardware_x, precompile for all different configurations in a for loop (with different hardware attached each time), then ship the whole thing to the user. (This is to avoid needing to precompile every time you launch the docker container)

It would be much preferable if there were some kind of mechanism that allowed packages to expose a user-defined function that gets called to add some salt into the hash; an extremely coarse-grained version could be an environment variable JULIA_CODELOAD_SALT=hardware-x, which would then shift ALL precompile files by the hash of that string, (thereby saving on space by having multiple depots) but I could imagine finer-grained versions as well.

Of course, the problem of how to intelligently garbage collect these files remains.

@tkf
Copy link
Member

tkf commented Jun 6, 2019

Yes, it would be nice to integrate this with package options JuliaLang/Juleps#38

Meanwhile, you can build a patched system image with which you can add arbitrary salt to it via an environment variable. This works because child processes (which precompile Julia packages) inherit environment variables. More precisely, here is the code snippet that does this (used in jlm; a similar trick is also used in PyJulia):

Base.eval(Base, quote
    function package_slug(uuid::UUID, p::Int=5)
        crc = _crc32c(uuid)
        crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
        crc = _crc32c(get(ENV, "JLM_PRECOMPILE_KEY", ""), crc)
        return slug(crc, p)
    end
end)

(You can get this system image by running JuliaManager.compile_patched_sysimage("PATH/TO/NEW/sys.so").)

@AndersBlomdell
Copy link

I would very much like a functionality like this for our lab computers, since it would make it possible to have multiple precompiled versions of commonly used libraries. Attached is a simplistic patch that adds the same version slug that is used in
./packages/<name>/<slug/>

loading.jl.patch.txt

@StefanKarpinski
Copy link
Member

This was already implemented in 2019.

@denizyuret
Copy link
Contributor

denizyuret commented Sep 10, 2020 via email

@AndersBlomdell
Copy link

AndersBlomdell commented Sep 10, 2020

This was already implemented in 2019.

Actually, no; loading.jl uses three different slugs:

  • function package_slug(uuid::UUID, p::Int=5) PKG-SLUG used for

    • determine what goes in the cache_file_entry
  • function version_slug(uuid::UUID, sha1::SHA1, p::Int=5) VER-SLUG based on package UUID
    and directory hash used for

    • locating the requested package in explicit_manifest_uuid_path
  • project_precompile_slug as defined in function compilecache_path(pkg::PkgId)::String PRJ-SLUG

    •    crc = _crc32c(something(Base.active_project(), ""))
         crc = _crc32c(unsafe_string(JLOptions().image_file), crc)
         crc = _crc32c(unsafe_string(JLOptions().julia_bin), crc)
         project_precompile_slug = slug(crc, 5)```  
      
    • determine where precompiled code lives

These parts are then used to place package source code in package/<name>/<VER-SLUG>/ and
precompiled code in compiled/v<MAJOR>.<MINOR>>/<name>/<PKG-SLUG>_<PRJ-SLUG>.ji[the validty of the precompiled
code is checked in _require_from_serialized]

With this scheme the number of files in the precompiled files is kept low, since new versions of a precompiled
package will overwrite the old one, there will also be a sharing of compatible precompiled code between project
the same packages, since all precompiled code starting with /<PKG-SLUG>- is checked before a new precompilation
is done. It is not a good scheme for a shared environment, though; I would rather suggest

  • introduce a new compiled_slug based on the data checked in _require_from_serialized CMP-SLUG
  • place package source code package/<name>/<VER-SLUG>/ [i.e. no change]
  • place precompiled code in compiled/v<MAJOR>.<MINOR>/<name>/<VER-SLUG>_<CMP-SLUG>.ji
  • maybe all this should be based on some flag to keep the pressure on the filesystem low for
    systems used by a single individual?

BTW: the previous loading.jl.patch contained some bugs, so here we go again
julia-loading.jl.patch.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
packages Package management and loading
Projects
None yet
Development

Successfully merging a pull request may close this issue.