-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5 support #64
Comments
I checked in preliminary support for a new AbstractDataFrame stored in an HDF5 file. It's in the hdf5 branch:
HDF5 is an interesting format. It has tons of features and potential. It's also huge, overwhelmingly so. It could be a good way to offer on-disk storage (issue #25) with chunking, compression, and indexing. In the process of adding this, I also tried to make more of the DataFrames into AbstractDataFrames in dataframe.jl. That increased the warning count quite a bit. That is an annoying problem. |
Hm, interesting. I don't have much to say about this, as I've never had a If it's a vector-of-structs format, as you mention in the code, it doesn't On Wed, Oct 3, 2012 at 9:53 PM, Tom Short notifications@github.com wrote:
|
Vector-of-structs is just one option. The main option I’m trying uses separate columns, and each column can be compressed and chunked. Indexing is not built in, but it looks like fast indexing is on the horizon. Pytables has some interesting features for memory-mapped storage, and it uses HDF5: How to support DataVecs and PooledDataVecs is another issue. It can be done, it's just a question of how hard it is to implement. |
I think this is well worthwhile examining. The HDF5Compound type is very similar to an AbstractDataFrame. The "rhdf5" package for R provides the h5save() function which might be a good way to pass data frames back and forth. Right now the HDF5 package for Julia is a bit mystified by HDF5Compound types, I think (@timholy, is that correct?) but it is a natural structure in that it has names and offsets into a table. It should be possible to index into the gargantuan array of characters and covert to the desired types. The example of importance to me is an R data frame with a couple of million records. $ h5ls -r -v wkce1.h5
Opened "wkce1.h5" with sec2 driver.
/ Group
Location: 1:96
Links: 1
/wkce1 Dataset {2606541/2606541}
Location: 1:800
Links: 1
Storage: 260654100 logical bytes, 260654100 allocated bytes, 100.00% utilization
Type: struct {
"stuid" +0 native int
"year" +4 native int
"gr" +8 native int
"age" +12 native int
"sex" +16 native int
"race" +20 native int
"econ" +24 native int
"disab" +28 native int
"ell" +32 native int
"school" +36 native int
"dist" +40 native int
"distFay" +44 native int
"schFay" +48 native int
"readRS" +52 native double
"readSS" +60 native double
"mathRS" +68 native double
"mathSS" +76 native double
"Rproflvl" +84 native int
"Mproflvl" +88 native int
"Rprof" +92 native int
"Mprof" +96 native int
} 100 bytes |
Yes, Julia's HDF5 support for compound types is limited. HDF5 compound data types are basically the equivalent of C structs. My plan was to largely avoid them until Julia's support for C structs is better. I was forced to add a bit of it to get support for Matlab's complex numbers (something that clearly needed to happen sooner rather than later), but otherwise it's very very rough. JuliaLang/julia#1831 suggests it might not be long now. But if you want to store columns directly, then I think everything is in place. The array support is good, and should be completely general. Presumably you can just save a DataFrame directly with JLD? I can't test now because of the strpack breakage, but hopefully soon. |
On Wed, Jan 9, 2013 at 6:33 PM, Tim Holy notifications@github.com wrote:
Last I checked, you cannot save a DataFrame directly with JLD. The write I'm also interested in an on-disk HDF5 that is an AbstractDataFrame. Tim's |
Tom may have already looked at this but two of the issues for storing DataFrames and their components are allowing for NA's and storing the levels of a PooledDataVector. At the very least the hierarchical nature of HDF5 allows for a PooledDataVector to be an HDF5 group containing both the indices and the levels. If you want to include the NA BitVector you could use a hierarchical representation of DataVector's too. I am trying to remember why NA's aren't stored as a particular NaN value for Float64 and Float32 vectors, as in R (well, R doesn't have a native Float32 but it does use an NaN pattern for NA's in numeric vectors). R also uses a special pattern for integers and for logical vectors (and, I think, for character strings too) but those are implemented by having the low-level code check for them. Would it be reasonable to try to leverage the 32-bit float NaN's for NA's in integer formats? For PooledDataVector's an index outside the allowable range (e.g. 0) could be used to signal NA. I imagine suggestions like this have been considered and rejected so if you can just point me to a discussion I can find out why they won't work easily. :-) |
Issues #22 and #45 have background and discussion on NA representation. I've been an advocate of allowing standard Arrays in DataFrames and also of using bit patterns to indicate NA's. It's less of an issue now that John has stepped up and made DataVectors usable, but I think this is still a viable alternative that would be useful in some instances. As far as an HDF5 AbstractDataFrame, NA's could be handled in several ways:
|
Once we have a more stable and feature complete implementation of DataArray's and DataFrames, I'd be open to reconsidering this issue. My gut feeling continues to be that it adds a lot of complexity to the system. After we have enough tests to keep us in check while we experiment with alternative backends, I'd be open to trying to see what is needed to make DataArray's faster. |
Is there a quick and dirty solution here right now for DataFrames that contain only simple data types such as strings and integers? Is it possible to convert a dataframe into |
Saving and loading of DataFrames should with HDF5 as is. Tim added support a few months ago, but I haven't tried it for a while.. Longer term, options for better support of HDF5 include:
|
Thanks @tshort. I will try this out, since I am already getting tired of doing |
I think this can be closed; it's been possible to save DataFrames using HDF5/JLD for quite some time. |
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
Consolidating the constructors minimized the number of places where auto promotion could take place. The new constructor recycles scalars such that if DataTable is created with a mix of scalars and vectors the scalars will be recycled to the same length as the vectors. Fixes an outstanding bug where scalar recycling only worked if the scalar assignments came after the vector assignments of the desired length, see #882. Tests that used to assume NullableArray promotion now explicitly use NullableArrays and new constructor tests have been added to test changes.
I have some code to read HDF5 tables into Julia efficiently (using HDF5's built-in memory layout conversion), and I'm willing to contribute some more code to turn that into a DataFrame. In which package should this go? |
To clarify -- The existing support for compound types in HDF5.jl is inefficient (type instability, redundant storage of field types and names with each row, and unnecessary data copying/conversion), making it unsuitable for large datasets. But I'm a bit wary of modifying HDF5.jl to return a DataFrame, because this would break existing code, and there may be some tricky cases with tables that contain arrays and strings (everything needs to be |
I think HDF5-DataFrames interop code should live either in HDF5.jl or in a special package. If it lives in HDF5.jl, it doesn't necessarily mean that a |
Thanks for the pointers. I've made a quick & dirty package here: https://github.com/damiendr/HDFTables.jl |
* drop Julia 0.7, add Julia 1.2 and 1.3 to CI * require DataFrames 0.19
This may be nice way to exchange data with R and pandas.
See Tim Holy's work based on code by Konrad Hinsen:
https://github.com/timholy/julia_hdf5
An on-disk Hdf5DataFrame might be nice, too.
The text was updated successfully, but these errors were encountered: