Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Write Julia types as HDF5 compound types #132

Merged
merged 30 commits into from
Aug 25, 2014
Merged

Conversation

simonster
Copy link
Member

This is a major change to the serialization format of JLD. Rather than writing Julia types as arrays of references, we translate Julia types to HDF5 compound types and write those. Following the discussion in #27, everything Julia would store inline is presently stored inline in the compound type, and everything Julia would store as a pointer to another object is stored as an HDF5 reference to another object. This PR also incorporates the changes from #102.

This presently passes all read and write tests, but there is plenty more work to be done:

  • We cannot yet read JLD files created with earlier versions of JLD or convert them to the new format.
  • There is no validation to ensure that HDF5 types match Julia types. Thus, if a Julia type changes, JLD will probably segfault on readout.
  • There is no ability to read data if the type is not available in the current workspace. The structure of the compound type is sufficient to reconstruct the Julia type hierarchy (except for the types of reference fields) but this is not yet implemented.
  • The rootmodule keyword argument to write is not yet supported. This is a bit painful to deal with if the user decides to write the same type with and without the rootmodule keyword, since we would need to create two distinct compound types. @timholy, can you explain the use case for this so I can think about whether there's an easier design that could accomplish the same ends?
  • This code has a lot more cases to handle than the old code and it is unlikely that the existing tests cover all of them.
  • Adapt getindex/setindex! methods. These are missing tests.
  • Reimplement dump.
  • Delete datasets when no more references to them remain. (Maybe by some means analogous to mark and sweep?)
  • In principle, for pointerfree Julia types, HDF5 compound types could be generated with the same padding so that they exactly match the Julia type. This would avoid a copy when reading/writing and also allow mmapping arrays directly from the file, at the cost of saving the padding to the file.
  • In some cases where Julia stores fields as pointers to objects, we may want to store the contents inline in JLD. Two such cases are fields that are non-pointerfree immutables and fields that are leaf type tuples. The main complication to storing such fields inline in JLD is that they could be undefined.
  • It's possible that we could register our conversion functions with libhdf5 instead of calling them ourselves. This might result in less memory usage, although I'm not sure how it would interact with garbage collection.
  • I have to fix some things for Julia 0.2 if we want to maintain compatibility.

Implementation notes:

  • Compound types are committed with numeric indices in the _types group, with a julia type attribute that indicates the corresponding Julia type. Because libhdf5 loses track of compound type hierarchy between closing and reopening a file, numeric suffixes indicate the corresponding compound type index for a field that is itself a compound type.
  • Code to convert between a given Julia type and the corresponding HDF5 compound type and vice versa is dynamically generated. This should be very efficient after warmup. The code that does this needs to be inspected very carefully, since it does some somewhat unclean things, especially for non-pointerfree immutables.
  • Empty types/immutables and arbitrary bitstypes are stored as HDF5 opaque types. A special case throws an error when a pointer is encountered. As a result, most write methods now throw an error when they encounter a pointer rather than displaying a warning. The variant of @save that saves all top-level variables catches the error and displays a warning instead.

@simonster simonster changed the title Write Julia types as HDF5 compound types WIP: Write Julia types as HDF5 compound types Aug 13, 2014
@timholy
Copy link
Member

timholy commented Aug 13, 2014

Oh, my!

We cannot yet read JLD files created with earlier versions of JLD or convert them to the new format.

If plain.jl is at least backwards-compatible, then could we just duplicate the whole old jld.jl file and call it JLDOld? (not exported, naturally)

There is no validation to ensure that HDF5 types match Julia types. Thus, if a Julia type changes, JLD will probably segfault on readout.

I could have sworn that once I had (or was working on) a facility to go through and check the types when the file was first opened. That way you'd only have to pay the price once, rather that upon reading each object. I've checked every branch I can find, and no trace of it. So maybe I imagined it.

I don't think issue should block merger of this feature, but certainly it's got to go on the TODO list.

There is no ability to read data if the type is not available in the current workspace. The structure of the compound type is sufficient to reconstruct the Julia type hierarchy (except for the types of reference fields) but this is not yet implemented.

I'm a bit leery of constructing the types because it seems that then you could get in a situation where you have no method(::MyType, ...) and yet you have such a method for a type that shares the same name, but is not really the same type object. But maybe this would be safe. Anyway, again I'm fine with it like this. That's basically what addrequire is about.

The rootmodule keyword argument to write is not yet supported. This is a bit painful to deal with if the user decides to write the same type with and without the rootmodule keyword, since we would need to create two distinct compound types. @timholy, can you explain the use case for this so I can think about whether there's an easier design that could accomplish the same ends?

If memory serves, its whole purpose is basically to allow me to write this test. If there's another good way to do that, so much the better. I assume you saw this documentation as well.

In principle, for pointerfree Julia types, HDF5 compound types could be generated with the same padding so that they exactly match the Julia type. This would avoid a copy when reading/writing and also allow mmapping arrays directly from the file, at the cost of saving the padding to the file.

That would be so awesome, I wouldn't know what to say. But would that be portable across machines? If we had to, I guess we could test it upon file opening? However, I don't know how to get the alignment of fields in a julia type.

In some cases where Julia stores fields as pointers to objects, we may want to store the contents inline in JLD. Two such cases are fields that are non-pointerfree immutables and fields that are leaf type tuples. The main complication to storing such fields inline in JLD is that they could be undefined.

That would just be showing off 😄.

I have to fix some things for Julia 0.2 if we want to maintain compatibility.

I'm fine with just "freezing" 0.2 at a particular commit. Since PkgEvaluator will shortly not even be testing 0.2, I think that's the safest procedure anyway.

The implementation sounds awesome. I'll look over the code next.

@simonster
Copy link
Member Author

We cannot yet read JLD files created with earlier versions of JLD or convert them to the new format.

If plain.jl is at least backwards-compatible, then could we just duplicate the whole old jld.jl file and call it JLDOld? (not exported, naturally)

I think this is a good idea. Maybe we can share the parts of the code not related to reading/writing, but everything else is basically a rewrite anyway.

There is no validation to ensure that HDF5 types match Julia types. Thus, if a Julia type changes, JLD will probably segfault on readout.

I could have sworn that once I had (or was working on) a facility to go through and check the types when the file was first opened. That way you'd only have to pay the price once, rather that upon reading each object. I've checked every branch I can find, and no trace of it. So maybe I imagined it.

We already only create the conversion functions once for each type, so this code can just go where that happens (in jldatatype). The simplest thing to do would just be to do the mapping from Julia to HDF5 type (without committing the HDF5 type) and then check that the two match.

I don't think issue should block merger of this feature, but certainly it's got to go on the TODO list.

I think we should have this before merging. Segfaulting is bad to begin with, but it's also possible that this can be exploited for code execution.

There is no ability to read data if the type is not available in the current workspace. The structure of the compound type is sufficient to reconstruct the Julia type hierarchy (except for the types of reference fields) but this is not yet implemented.

I'm a bit leery of constructing the types because it seems that then you could get in a situation where you have no method(::MyType, ...) and yet you have such a method for a type that shares the same name, but is not really the same type object. But maybe this would be safe. Anyway, again I'm fine with it like this. That's basically what addrequire is about.

I'd like to implement some way to read data without the types before merging, to avoid catastrophe in the case where you have a data file but not the code and so you can no longer get the data out. (I don't currently have anything like readsafely in this branch.) Dispatch isn't going to work no matter what we return, because the methods aren't there. It's probably comparably difficult to do either the type-based approach or the Dict approach. I don't have a strong opinion on this.

The rootmodule keyword argument to write is not yet supported. This is a bit painful to deal with if the user decides to write the same type with and without the rootmodule keyword, since we would need to create two distinct compound types. @timholy, can you explain the use case for this so I can think about whether there's an easier design that could accomplish the same ends?

If memory serves, its whole purpose is basically to allow me to write this test. If there's another good way to do that, so much the better. I assume you saw this documentation as well.

It would be reasonably easy to add a facility to register a specific type so that its module path (and the module paths of the types it contains) is truncated on write. Would that suffice?

In principle, for pointerfree Julia types, HDF5 compound types could be generated with the same padding so that they exactly match the Julia type. This would avoid a copy when reading/writing and also allow mmapping arrays directly from the file, at the cost of saving the padding to the file.

That would be so awesome, I wouldn't know what to say. But would that be portable across machines? If we had to, I guess we could test it upon file opening? However, I don't know how to get the alignment of fields in a julia type.

fieldoffsets (which I implemented in Base when I first attempted #27) does this. I think alignment is the same for all Julia-supported types on all 64-bit x86 systems, but maybe not for doubles on 32-bit. We could implement conversion or libhdf5 might be able to do it for us, although we can obviously only mmap if the layout is the same. Eventually we may also need endianness conversion, but that's probably not worth worrying about until Julia runs on a big endian architecture.

In some cases where Julia stores fields as pointers to objects, we may want to store the contents inline in JLD. Two such cases are fields that are non-pointerfree immutables and fields that are leaf type tuples. The main complication to storing such fields inline in JLD is that they could be undefined.

That would just be showing off 😄.

Thinking more about this, we may also have a problem if there are null string fields or undefs in a string array for a similar reason.

@timholy
Copy link
Member

timholy commented Aug 13, 2014

The simplest thing to do would just be to do the mapping from Julia to HDF5 type (without committing the HDF5 type) and then check that the two match.

Works for me.

I think we should have this before merging. Segfaulting is bad to begin with, but it's also possible that this can be exploited for code execution.

OK, definitely sounds important.

It's probably comparably difficult to do either the type-based approach or the Dict approach. I don't have a strong opinion on this.

If you prefer the type-based approach, that's fine. I agree that it won't work without user help no matter what we choose.

It would be reasonably easy to add a facility to register a specific type so that its module path (and the module paths of the types it contains) is truncated on write. Would that suffice?

Yes

fieldoffsets (which I implemented in Base when I first attempted #27) does this.

Oh, awesome! Since libhdf5 will implement the conversion for us when we can't mmap, this sounds ideal.

Also convert full_typename to use IOBuffer, to avoid some allocations
DataArrays should probably not be trying to perform a conversion if
the type is the same, but we can trivially avoid calling convert in the
first place.
@mbauman mbauman mentioned this pull request Aug 14, 2014
Avoid boxing due to pointer instability. This branch is now beating
Base.serialize by >2x on the test case from
JuliaData/DataFrames.jl#667
We will almost certainly need a different strategy here if we want to
perform within 1 OOM of serialize
@simonster
Copy link
Member Author

So, I did some benchmarking of this branch. The good news is that we appear to be faster than serialize/deserialize and allocate less memory for arrays of strings and numeric data, sometimes by a significant margin, and not much slower for immutables. (We can probably make up the difference if I implement saving with the struct layout.) The files we generate are also generally not enormous. The bad news is that we are >30x slower than serialize at saving arrays of small arrays, and a decent proportion of that time is spent in libhdf5. I may need to resurrect the array of arrays optimization from 3a95493. But first I'll work on items 2 and 3 on the list above.

cc @jiahao and @jakebolewski, in case you're interested. (Sometime I should get you to give me a julia.mit.edu account so I can test on your datasets.)

@timholy
Copy link
Member

timholy commented Aug 15, 2014

Just checking in to say I'm sorry this is taking me so long to get to. I am pretty swamped with coding that my lab needs done ASAP, and this PR is a big body of work. But I'll try to give this a serious review over the weekend.

@simonster
Copy link
Member Author

No worries. It's also fine with me if you wait until I finish items 2 and 3 before reviewing; I don't yet know how big those changes will be.

@simonster
Copy link
Member Author

And now for a stupid libhdf5 performance issue: reading out an array of objects is >200x slower after the file is closed and re-opened again. Before closing the file, I can read 1000 items in 10 ms. After closing the file, it takes 2 ms to read each item. It seems that getting the name of the datatype for each object takes immeasurably (at least thousands of times) longer. It may be time to take a trip to the libhdf5 source.

@timholy
Copy link
Member

timholy commented Aug 15, 2014

Oh, wow. That's frightening.

@jakebolewski
Copy link

Are you purging the linux buffer cache when benchmarking? Something has to
be caching the writes.

On Friday, August 15, 2014, Tim Holy notifications@github.com wrote:

Oh, wow. That's frightening.


Reply to this email directly or view it on GitHub
#132 (comment).

@jakebolewski
Copy link

This seems to be a huge improvement !

For one of the examples metioned in serialization thread, these are the times I'm getting:

julia> df = @time begin
       fh = open("labevents.jls")
       deserialize(fh)
       end;
elapsed time: 22.488775592 seconds (1326551288 bytes allocated, 7.86% gc time)

julia> @time begin
       fh = open("test.jls", "w")
       serialize(fh, df)
       end;
elapsed time: 28.523595344 seconds (748371396 bytes allocated, 35.53% gc time)

# This pull request
julia> @time jldopen("test.jld", "w") do io
       write(io, "test", df)
       end
elapsed time: 18.036020153 seconds (324413684 bytes allocated, 42.59% gc time)

julia> @time jldopen("test.jld", "r") do io
       read(io, "test")
       end;
elapsed time: 15.344303129 seconds (1521609372 bytes allocated, 48.61% gc time)

julia> map(n -> typeof(df[n]), names(df))
9-element Array{DataType,1}:
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{DateTime,1}   
 DataArray{UTF16String,1}
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{UTF16String,1}

julia> nrow(df)
3740682

# Master 
# ...this takes so long it is not really worth comparing (10 + minutes)

another example

julia> using HDF5, JLD

julia> df = @time begin
       fh = open("ioevents.jls", "r")
       deserialize(fh)
       end;
elapsed time: 20.578164919 seconds (1214042696 bytes allocated, 6.45% gc time)

julia> @time begin
       fh = open("test.jls", "w")
       serialize(fh, df)
       end;
elapsed time: 15.59771594 seconds (662881924 bytes allocated)

julia> @time jldopen("test.jld", "w") do io
       write(io, "test", df)
       end
elapsed time: 11.546232497 seconds (308564552 bytes allocated)

julia> df = @time jldopen("test.jld", "r") do io
       read(io, "test")
       end;
elapsed time: 12.905426358 seconds (1470065908 bytes allocated, 43.48% gc time)

julia> map(n -> typeof(df[n]), names(df))
16-element Array{DataType,1}:
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{DateTime,1}   
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{DateTime,1}   
 DataArray{Int32,1}      
 DataArray{Int32,1}      
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{Float64,1}    
 DataArray{UTF16String,1}
 DataArray{UTF16String,1}

julia> nrow(df)
2471191

Use H5Oget_info to get type address instead of getting type name (which
apparently requires a search through all objects in the file). Also use
global buffers in some places instead of allocating on every call.
When we read from a file, we don't want to have to define h5convert!
for the type we're reading, since we won't use it, but it's possible
that we will read a type from a file and then write that type. In that
case, an entry would exist in the jlh5type dict for the type, but we
can't know to define h5convert!. So instead, we define h5convert! on
write.
I decided to do this rather than read the data as a Dict because 1) it
is a bit easier and 2) reading the data as a Dict could be extremely
slow if there is a lot of data to be read.
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 58b3f55 on sjk/compound_types into * on master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling d30f700 on sjk/compound_types into * on master*.

@mbauman
Copy link
Member

mbauman commented Aug 21, 2014

Works in my testing now. Wonderful! 👍

(To be clear, I think treating ByteStrings like they're immutable is just fine. I was just trying a bunch of different code paths. So don't worry about that case.)

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 7dcf9fd on sjk/compound_types into * on master*.

- Use H5Pset_create_intermediate_group to create intermediate groups,
  instead of doing this ourselves.
- Store small datasets in compact format.
- Avoid looking up created datasets to create references to them.

Also implement a few more libhdf5 functions, most of which don't seem
to make a difference to performance.
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 6714b2e on sjk/compound_types into * on master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 376ccd4 on sjk/compound_types into * on master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling cb3bb81 on sjk/compound_types into * on master*.

@simonster
Copy link
Member Author

Any last comments before I pull the trigger? 😄

@timholy
Copy link
Member

timholy commented Aug 25, 2014

Only one: go for it!

This is a really awesome advance. I feel like all the packages I use heavily need/are getting massive facelifts.

When we tag it, I think we should bump the minor version. I know I did that recently, but this is a pretty big change.

simonster added a commit that referenced this pull request Aug 25, 2014
WIP: Write Julia types as HDF5 compound types
@simonster simonster merged commit 4468fa4 into master Aug 25, 2014
@simonster simonster deleted the sjk/compound_types branch August 25, 2014 16:20
@timholy
Copy link
Member

timholy commented Aug 25, 2014

Yay! 🍰

@timholy
Copy link
Member

timholy commented Aug 25, 2014

Should we encourage folks to test on master for a little bit, then tag a new version? I'd be happy to send an email to julia-users advertising your work! (And of course you should feel free to do so yourself, it's just that it's easier to brag on someone else's behalf 😄.)

@simonster
Copy link
Member Author

Sounds good. I'm always happy to have someone else brag for me :bowtie:

@jakebolewski
Copy link

👍 awesome work!

@tkelman
Copy link
Contributor

tkelman commented Aug 25, 2014

You asked for testers - Pkg.test("HDF5") causes a segfault and crashes Julia on Win32 https://ci.appveyor.com/project/tkelman/hdf5-jl/build/job/06jf23fjv074321i
and Win64 https://ci.appveyor.com/project/tkelman/hdf5-jl/build/job/jj3u0vnnygn55n69

Note that there's an easy-to-fix 32 bit bug here: tkelman@c5ddb77

@timholy
Copy link
Member

timholy commented Aug 25, 2014

Thanks, @tkelman, this is exactly the kind of feedback we need.

First, do you know whether older versions passed?

Second, with AppVeyor, can you get a remote login? With Travis, this works well (but you have to ask via email). To me it looks like one of the ccalls is barfing, and that would be easier to debug interactively. If that's not possible, then perhaps inserting printlns in places like this would be the best approach.

How do you submit a job to AppVeyor, anyway?

@tkelman
Copy link
Contributor

tkelman commented Aug 25, 2014

Older versions had trouble with mmap'ing on Windows (#89), but IIRC would pass everything else.

I get crashes locally too, the AppVeyor logs are just easier to share than copying a big gist from my terminal. I haven't tried asking if we can get a remote login, but it might be possible. I can open a PR with the appveyor.yml configuration file and instructions for how to turn it on. It runs on a webhook, and you can manually trigger builds through their UI as well.

@timholy
Copy link
Member

timholy commented Aug 25, 2014

That would be great, look forward to the PR. I can probably contact them by Wednesday if the fix isn't relatively straightforward.

@simonster
Copy link
Member Author

The segfault is in gc and only happens on Windows, but I can reproduce it in a VM. My testing indicates that the segfault happens when Julia tries to free an object that was malloc'd by libhdf5. There is some discussion of this problem here that indicates that one solution is to make both Julia and libhdf5 use the same C runtime, and another is to free the memory with H5free_memory, which we could do but would require either copying or attaching a finalizer to every variable-length object coming out of libhdf5. So I guess my questions are 1) is the libhdf5 library we provide for Windows linked against a different runtime than Julia and 2) if so, can we link them against the same runtime? Also cc @ihnorton

@tkelman
Copy link
Contributor

tkelman commented Aug 25, 2014

@timholy PR is #135
@simonster Excellent questions. It looks like the HDF5 library we are downloading is a MSVC one (nota bene the lines Extracting usr\lib\x86\msvcp100.dll Extracting usr\lib\x86\msvcr100.dll), so yes we are using a different runtime than Julia. We can try finding or building our own HDF5 library using MinGW. I don't see one available on WinRPM, but this seems like a library that should be there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants