Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing Chains Retrievably in Stable Data Formats #299

Closed
farr opened this issue Apr 29, 2021 · 13 comments
Closed

Storing Chains Retrievably in Stable Data Formats #299

farr opened this issue Apr 29, 2021 · 13 comments

Comments

@farr
Copy link

farr commented Apr 29, 2021

Currently the only way to store a chain is to use serialization (which is not an archival solution, as deserialization is dependent on machine details and Julia version, or else requires external translation libraries) or to translate it to a different data format entirely (e.g. DataFrame or w/e, which entails a loss of information about internal sampling parameters, or else awkward conversions, etc). I have written the following code for serializing chains to HDF5 that I use all the time; it's quite straightforward about the way it stores the parameters and internal parameters of the chain, and can handle arbitrary sections, etc. The chain can be stored as the root element in an HDF5 file, or inside any (isolated---can't put multiple chains in a single group) group of a larger HDF5 file.

Do people think code like this, or similar, could make it into the MCMCChains.jl library?

using HDF5
import Base: read, write

"""
    write(f::Union{HDF5.File, HDF5.Group}, chains::Chains)

Write MCMCChains object to the HDF5 file or group.
"""
function Base.write(f::Union{HDF5.File, HDF5.Group}, c::Chains)
    for s in sections(c)
        g = create_group(f, string(s))
        for n in names(c, s)
            g[string(n), shuffle=(), compress=3] = Array(c[n])
        end
    end
end

"""
    read(f::Union{HDF5.File, HDF5.Group}, ::Type{Chains})

Read a chain object from the given HDF5 file our group.

"""
function Base.read(f::Union{HDF5.File, HDF5.Group}, ::Type{Chains})
    secs = keys(f)
    pns = []
    datas = []
    name_map = Dict()
    for s in secs
        ns = keys(f[s])
        name_map[Symbol(s)] = ns
        for n in ns
            push!(pns, n)
            push!(datas, read(f[s], n))
        end
    end

    nc, ns = size(datas[1])
    np = size(datas,1)

    a = zeros(nc, np, ns)

    for i in 1:np
        a[:,i,:] = datas[i]
    end
    
    Chains(a, pns, name_map)
end
@devmotion
Copy link
Member

I agree, the current state is a bit unsatisfying.

serialization (which is not an archival solution, as deserialization is dependent on machine details and Julia version, or else requires external translation libraries)

The docs mention JLSO which saves also additional metadata such as Julia version and the package versions such that the state of the serialization is reproducible again. Maybe that could solve some of your issues?

Additionally, the Tables interface is implemented for Chains, so every output format that support this interface (such as DataFrames) is supported automatically (although, in this case only the samples and no metadata are available).

Personally, I think this package is already quite heavy and hence I think one should not add a dependency on HDF5 (or other storage formats some other users might be interested in). However, I guess it could be helpful to define it in a separate package (or at least mention it in the docs, but I guess a package would be good).

@cpfiffer
Copy link
Member

I would love to see a separate package that handles saving to HDF5. I absolutely hate that we serialize chains to save them.

How big is the HDF5 dep?

@farr
Copy link
Author

farr commented Apr 29, 2021

How big is the HDF5 dep?

I don't really know. (It does pull in a C library for HDF5 which is mature, but may not be small---HDF5 definitely offers a lot of infrastructure around parallel writes/reads.). On my system,

(base) wfarr@C02WW0Q2HV2V O3aPISN % du -hs ~/.julia/packages/HDF5/cDXRT 
600K	/Users/wfarr/.julia/packages/HDF5/cDXRT
(base) wfarr@C02WW0Q2HV2V O3aPISN % du -hs ~/.julia/packages/HDF5_jll/BGk9m 
 44K	/Users/wfarr/.julia/packages/HDF5_jll/BGk9m

which doesn't seem like much.

@farr
Copy link
Author

farr commented Apr 29, 2021

But it would be pretty easy for me to spin this off into a separate package if you would prefer that. (I think I'm still not totally in the "Julia frame of mind" with respect to lots of very lightweight packages.)

@devmotion
Copy link
Member

I am particularly afraid of the binary dependency which is quite challenging to build apparently: JuliaPackaging/Yggdrasil#567

@devmotion
Copy link
Member

But a separate package would be really cool (it's the perfect example of a glue package, hopefully at some point this can be handled better: JuliaLang/Pkg.jl#1285).

@farr
Copy link
Author

farr commented Apr 29, 2021

Oof---had no idea HDF5 was so hard to build. OK. I'll try to spin it out into a separate package that depends on MCMCChains and HDF5. Will let you all know if/when it's ready, so you can mention it in the docs / link it from here.

@farr farr closed this as completed Apr 29, 2021
@devmotion
Copy link
Member

Great, I am looking forward to it!

@cpfiffer
Copy link
Member

Me too, I'm very excited.

@farr
Copy link
Author

farr commented May 3, 2021

OK, here it is (I'll re-open this issue when it's accepted in the general package repository); for now it's only installable from git:

https://github.com/farr/MCMCChainsStorage.jl

@farr
Copy link
Author

farr commented May 13, 2021

MCMCChainsStorage is accepted into the general package repository. Could we add something like

The [MCMCChainsStorage.jl](https://github.com/farr/MCMCChainsStorage.jl) package also provides the ability to serialize/deserialize a chain to an HDF5 file across different versions of Julia and/or different system images.

As the final sentence of the first paragraph here?

@cpfiffer
Copy link
Member

Sure, I opened a PR at #304 for this.

@farr
Copy link
Author

farr commented May 13, 2021

Excellent---I see that the change in #304 has been merged into the docs, so I'm going to close this issue, too (again). Thanks so much!

@farr farr closed this as completed May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants