Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequester mutable state outside of package directories #796

Closed
staticfloat opened this issue Oct 5, 2018 · 16 comments
Closed

Sequester mutable state outside of package directories #796

staticfloat opened this issue Oct 5, 2018 · 16 comments

Comments

@staticfloat
Copy link
Member

staticfloat commented Oct 5, 2018

I think it would be desirable to have packages use something similar to Pkg.package_state_dir(@__MODULE__) or something as the default location where e.g. binaries, datasets, etc.. should be stored. This would be constructed as an overall per-environment state directory (overridable by an environment variable or environment config key perhaps), that then has hashed subdirectories similar to ~/.julia/packages but explicitly including information to disambiguate julia OS, arch, calling ABI, GCC ABI and package options. This has multiple benefits;

  • Packages become more "immutable". It would be lovely to be certain that the entire tree hash of a package directory inside of ~/.julia/packages never changes.

  • Packages get automatically pushed toward greater relocatability. As recent experiments with PackageCompiler have shown, broad-spectrum usage of things like @__FILE__ and @__DIR__ should be discouraged anyway. A common use case is for creating a scratch space for binaries (e.g. <pkg dir>/deps/usr), but others exist (downloading datasets, generating julia code, etc... Forcing a runtime lookup based on @__MODULE__ is already what we need to do, so this would dovetail nicely.

  • Pkg3 package resolution could technically be arch/OS agnostic. I'm imagining the nightmare scenario where Crazy Charlie has installed three copies of Julia 1.0, one with GCC 8 targeting x86_64, one with GCC 7 targeting x86_64 and one with GCC 6 targeting i686. Technically, we could share the actual Julia package directories, but when run from a Julia with a particular <arch>-<os>-<calling_abi>-<gcc_abi> (increasingly inaccurately named) triplet, the result of Pkg.package_state_dir(@__MODULE__) could mutate accordingly.

  • The storage directory for mutable state could be decoupled from the storage for julia code. Imagine a heterogenous cluster with various CPUs and a shared package depot that is provided to all users; not only would the .ji files generated differ on each machine, but the downloaded binaries could differ as well (taking advantage of different ISAs). This separation between Julia code and build/run-time content would solve both the "must provide binaries that work on every platform globally" problem and the "I don't have permissions to modify packages placed in this depot" problem

  • This could make the "nuclear" state reset option a little easier for users; instead of nuking ~/.julia entirely (as some still do to try and fix stale state problems) they could instead nuke ~/.julia/package_state or whatever we default the location to. This would essentially cause a "rebuild" of every package, as if it were freshly installed, without doing more drastic things like losing the set of installed packages.

  • This would allow for, (in my mind) a cleaner workflow for managing package state than the current Pkg.build() system; I would prefer that there is no Pkg.build() and instead each package is responsible for checking the existence of files within __init__(); this can be done extremely quickly (e.g. isdir(joinpath(Pkg.package_state_dir(@__MODULE__), "usr"))) and should remove one more minor pain point in Pkg, the "This package was not properly installed, please run Pkg.build(<pkg name>) error message.

@tkf
Copy link
Member

tkf commented Oct 8, 2018

when run from a Julia with a particular <arch>-<os>-<calling_abi>-<gcc_abi> (increasingly inaccurately named) triplet, the result of Pkg.package_state_dir(@__MODULE__) could mutate accordingly.

It would be nice if Pkg.package_state_dir(@__MODULE__) depends on package options JuliaLang/Juleps#38 or at least designed in such a way that it can depend on arbitrary key-value pairs of strings.

I would prefer that there is no Pkg.build() and instead each package is responsible for checking the existence of files within __init__()

Would it work with Julia packages which requires external packages at precompile-time? Those packages may need do some precompile-time metaprogrammings to, e.g., define structs depending on the C ABI of the external package (PyCall does it).

@staticfloat
Copy link
Member Author

It would be nice if Pkg.package_state_dir(@MODULE) depends on package options JuliaLang/Juleps#38 or at least designed in such a way that it can depend on arbitrary key-value pairs of strings.

That's a very good thought; I'm going to add it on to the top issue. Probably the whole dict of options gets hashed and mixed in with the other elements determining the storage location.

Would it work with Julia packages which requires external packages at precompile-time? Those packages may need do some precompile-time metaprogrammings to, e.g., define structs depending on the C ABI of the external package (PyCall does it).

Dependent packages should be loaded by the time you __init__() a package. __init__() is run after precompilation; it's a runtime function.

@tkf
Copy link
Member

tkf commented Oct 8, 2018

the whole dict of options gets hashed

My thoughts exactly!

Dependent packages should be loaded by the time you __init__() a package.

I guess I misunderstood that external libraries are somehow loaded during __init__() as you are talking about getting rid of Pkg.build().

@staticfloat
Copy link
Member Author

Ah, I see what you mean. You want to be sure that e.g. libPython is loadable before PyCall.jl runs its __init__() method. Yes, this should be handled by the next step of BinaryBuilder work that I'm doing; essentially separating binary dependencies out into their own packages (we're calling them .jll packages) so they would be fully initialized by the time __init__() begins for any dependent packages (e.g. PyCall.jl depends on python.jll, so by the time __init__() gets called wtihin PyCall.jl the python.jll package has had the opportunity to download, install and dlopen() its libPython)

@tkf
Copy link
Member

tkf commented Oct 11, 2018

I found https://github.com/JuliaPackaging/BinaryBuilder.jl/wiki/Roadmap after writing the post below. It looks like PyCall.jl can just do something like using LibPython.jll: libpython to get the handle. So I guess the following pattern is supported.


Can I access information about python.jll package at precompile time of PyCall.jl? PyCall.jl needs to define struct layout depending on Python version. For example:

struct PyDateTime_CAPI
    # type objects:
    DateType::PyPtr
    DateTimeType::PyPtr
    TimeType::PyPtr
    DeltaType::PyPtr
    TZInfoType::PyPtr

    @static if pyversion >= v"3.7"
        TimeZone_UTC::PyPtr
    end

    ... and so on ...
end

--- https://github.com/JuliaPy/PyCall.jl/blob/fb88f4d0df66fd2ce1bc4dc862611c355be0e50d/src/pydates.jl#L12-L35

where pyversion is obtained by calling Python C API at precompile time:

const pyversion = vparse(split(Py_GetVersion(libpy_handle))[1])

--- https://github.com/JuliaPy/PyCall.jl/blob/fb88f4d0df66fd2ce1bc4dc862611c355be0e50d/src/startup.jl#L85

PyCall.jl also inspects libpython with hassym at precompile time.

@staticfloat
Copy link
Member Author

staticfloat commented Oct 11, 2018 via email

@stevengj
Copy link
Member

stevengj commented Oct 12, 2018

Note that BinaryBuilder will probably never be a reasonable option for PyCall. You use Python for the ecosystem, not just libpython, and so we need to have access to a full-featured Python distro like Anaconda.

But we still need persistent per-package options. e.g. PyCall should be able to remember what python you configured it to use. (Currently, in Julia 1.0, it forgets your configuration every time you update PyCall because Pkg fetches a fresh directory.) Package options should go into the Project.toml, probably?

@staticfloat
Copy link
Member Author

staticfloat commented Oct 12, 2018

I'm going to respond with #777 in mind here:

On the one hand, this issue explicitly does not want to share state between different versions of packages, because the intended use case is for different .jll package versions to be potentially completely different binary versions. On the other hand, it would be really nice to avoid needing to download and install stuff twice.

Oh if only we had some way of uniquely identifying the content we want to download/store, and we could use that unique identifier to key us into a directory! Oh wait, that's Stefan's content-addressable filesystem idea. So, new API idea that might satisfy everyone here: just pass a hash to Pkg.package_state_dir(), and what you use to build that hash determines the lifecycle/sharing of your data. Examples (I'm using hash() here as pseudo-code; not anything concrete):

  • Pkg.package_state_dir(hash(basename(pathof(@__MODULE__)))): Scratch space that is shared across all installations of this package.

  • Pkg.package_state_dir(hash(libfoo_tarball_hash, libbar_tarball_hash)): My .jll package requests space keyed off of the content hashes of the tarballs I'm going to extract into it.

  • Pkg.package_state_dir(hash(basename(pathof(@__MODULE)), version.major, version.minor)): SemVer-aware grouping.

  • Pkg.package_state_dir(hash(basename(pathof(@__MODULE)), options_dict)): Options-keyed directory.

I think it makes a lot of sense to "deduplicate" based on a hash that the user passes to package_state_dir().

@tkf
Copy link
Member

tkf commented Oct 12, 2018

You use Python for the ecosystem, not just libpython

@stevengj If you want only Python packages, I think BinaryBuilder could be a reasonable option to install python command and libpython. Once python command is installed, a reproducible Python environments can be constructed using Pipenv (which can already be done if JuliaPy/PyCall.jl#578 is merged). Pipenv is much closer to Pkg3 in design. You don't need to treat mutable state yourself and an entire data for reproducing the Python environment is in two text files (actually JSON and TOML).

However, this is only for Python packages available from PyPI. For example, you can't install Node.js from PyPI (which is required for installing JupyterLab extension). But this probably can be covered by BinaryBuilder directly?

Pkg.package_state_dir(hash(basename(pathof(@__MODULE__)))): Scratch space that is shared across all installations of this package.

@staticfloat Yeah, that's what I was thinking when connecting this to #777. Maybe it could be Pkg.package_state_dir(hash(Base.PkgId(@__MODULE__))) but the idea is essentially the same.

@staticfloat
Copy link
Member Author

Base.PkgId(@__MODULE__)

Yes, that is clearly superior. :)

@staticfloat
Copy link
Member Author

Ah, I forgot another benefit of this; right now we get shared package state when you dev Foo within the default environment from two different Julia versions. While the resolver will check to make sure that the Julia code is marked as satisfying all constraints, this can cause serious problems with two different versions of Julia built with two different versions of GCC.

So it's important to not only allow the user to specify what keys the package_state_dir(), but also make it easy to (and probably default to) key off of Julia ABI stuff.

Updated API proposal:

Pkg.package_state_dir(things_to_be_hashed...; include_version::Bool = true, include_ABI::Bool = true)

Where things_to_be_hashed gets intelligently combined through a hash function, and the flags signify inclusion of information about Julia's version and ABI. These would be true by default, but if set to false then a package could be shared across Julia versions (within the same environment). Fleshing this out a little bit more, hashing should be fine with the UInt64 based hashes we use with hash() in Base (to get a 1-in-a-million chance of a collision, you need to have 6 million packages installed), so we could define this as something similar to:

function package_state_dir(things_to_be_hashed...; include_version::Bool = true, include_ABI::Bool = true)
    h = UInt64(0)
    for t in things_to_be_hashed
        h = hash(t, h)
    end
    if include_version
        h = hash(Base.VERSION, h)
    end
    if include_ABI
        # We would perhaps want to integrate this logic into Pkg
        h = hash(BinaryProvider.triplet(BinaryProvider.platform_key_abi()), h)
    end

    return joinpath(Pkg.data_dir(), string(h, base=16))
end

@tkf
Copy link
Member

tkf commented Oct 13, 2018

@staticfloat Actually, using Pkg.package_state_dir for both BinaryProvider and (say) Conda would make it hard to:

instead of nuking ~/.julia entirely (as some still do to try and fix stale state problems) they could instead nuke ~/.julia/package_state

because re-installing conda takes more time than downloading some binaries. Also, current Conda.jl has no mechanism for re-creating the same environment (at the moment).

It's probably better to have two kinds of state directories like XDG_DATA_HOME (default: ~/.local/share) and XDG_CACHE_HOME (default: ~/.cache) (and maybe also something similar to /var/ for e.g., *.log and *.jl.mem). For example, call them ~/.julia/data and ~/.julia/cache (where ~/.julia would be replaced by DEPOT_PATH[1] in real case). The distinction is that wiping out ~/.julia/cache is safe in the sense ]instantiate (or something) brings back to the equivalent environment while there is no such guarantee for ~/.julia/data. The directory ~/.julia/data is for, e.g., highly-mutable data like Conda.jl's and login authentication data for GitHub integration. The directory ~/.julia/cache would be useful for BinaryProvider and also something like InstantiateFromURL.

I don't know if discussing "~/.julia/data" here is preferred but it's probably better to have in mind that there may be other kind of data/state directories, when deciding the name under ~/.julia.

On the other hand, above specification may sound over-complication (especially considering XDG compliance was rejected before). That's why I suggested #777; Pkg.jl could be just agnostic about what each package does and just provide a scratch space for it. Each package can then just implement it's own state/data strategy like package_state_dir.

But since BinaryProvider should be working with Pkg closely, it may not be optimal here. So maybe just forget about making this a public API and expose to JuliaPackaging as semi-public API?

@staticfloat
Copy link
Member Author

because re-installing conda takes more time than downloading some binaries.

I don't think this is a good reason to make "clearing state" not clear the Conda installation data. It seems to me that Conda.jl installed packages should be treated exactly the same way as BinaryProvider-downloaded packages; I don't see a clear difference between them.

@tkf
Copy link
Member

tkf commented Oct 20, 2018

Right, that was not appropriate reasoning. What I was trying to point out was that there are information/data more important than external libraries. In case of Conda.jl, that would be the version numbers and package origins. Although there is no direct easy way, a conda environment can have something like Project.toml/Manifest.toml (the complication was that there is no easy way to do this in conda ATM). It would be a bad idea to store such information in the same directory where the binaries are stored. Another such example is authentication information: e.g., GitHub.jl can store authentication token in some directory but you wouldn't want to re-authenticate just because you cleaned the directory for the external libraries.

@fredrikekre
Copy link
Member

fredrikekre commented Aug 20, 2019

Stefan's notes from triage:

Mutable state in packages

@staticfloat wants a way to generate artifacts outside of packages
Let packages generate/access a workspace
Workspace keyed by package UUID
Packages often want a scratch spaceExamples:

  • big squashfs images need patching to match current user
    • so caching of expensive work
  • Conda.jl wants to persist across versions
    • might be better to share between versions

“Lifecycled caches”: ~/.julia/caches

@StefanKarpinski: what if the workspace is a project?

  • or maybe a project with an Artifacts.toml file
  • the actual data goes in there as artifactsDo we want levels of caching:
  • mutable workspace needs to be per-user
  • does a more system-wide cache make sense?

staticfloat added a commit that referenced this issue May 20, 2020
This implements basic functionality and tests for a new `Caches`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package.

Closes #796
staticfloat added a commit that referenced this issue May 21, 2020
This implements functionality and tests for a new `Spaces`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 21, 2020
This implements functionality and tests for a new `Spaces`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Spaces`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Spaces`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Scratch`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Spaces`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Scratch`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Scratch`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue May 22, 2020
This implements functionality and tests for a new `Scratch`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
staticfloat added a commit that referenced this issue Jun 3, 2020
This implements functionality and tests for a new `Scratch`
subsystem in `Pkg`; analogous to the `Artifacts` added in 1.3, this
provides an abstraction for a mutable datastore that can be explicitly
lifecycled to an owning package, or shared among multiple packages.

Closes #796
@staticfloat
Copy link
Member Author

And with the official announcement of Scratch.jl, I think this can be closed. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants