Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New format on DSC data #86

Open
gaow opened this issue Jan 12, 2018 · 4 comments
Open

New format on DSC data #86

gaow opened this issue Jan 12, 2018 · 4 comments

Comments

@gaow
Copy link
Member

gaow commented Jan 12, 2018

@pcarbo and I have decided to give HDF5 a stab as replacement to current default RDS storage format. We start from R and Python. The basic data types we'd like to support are:

HDF5 R Python
? character str
? integer int, np.int*, np.uint*
? double float, np.float*
? vector list, np.array
? matrix np.matrix
? array np.array, list of lists
? data.frame pd.DataFrame
? NaN np.nan
? Na None

np for numpy, pd for pandas. Here is a test on Python's end:

import numpy as np
import pandas as pd
data = {'charater': 'pcarbo', 
        'integer1': 1, 'integer2': np.uint8(1), 
        'double1': 1.0, 'double2': np.float16(1.0), 
        'vector1': [1,2,'gaow'], 'vector2': [1,2,3], 'vector3': np.array([1,2,3]),
        'matrix': np.matrix([[1,2],[3,4]]),
        'array1': np.array([[1,2],[3,4]]), 'array2': [[1,2],[3,4]],
        'dataframe': pd.DataFrame({'A': [1,2], 'B': [3,4]}, index=['row1', 'row2'])
       }
data['recursive'] = data

Here is the outcome in HDF5:

test.h5.zip

I used this API from UChicago:

https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py

But it would not be difficult, I presume, to customize.

A particular difficult case is NULL/NA/NaN in R. In Python there are only None and NaN, no NULL. #25

@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.

@pcarbo
Copy link
Member

pcarbo commented Jan 17, 2018

@gaow For the initial version, I'm going to propose a stripped-down, bare-bones version. Hence the name "barebones data object (BDO)".

Barebones data object:

  1. Only one object is stored in an hdf5 file. The object is a list.
  2. All elements of the list are stored in separate nodes ("groups").
  3. Each list element may be one of: (a) array containing double-precision floating point numbers ("doubles"), (b) array containing character strings, or (c) a list.
  4. Lists within lists are stored hierarchically as subnodes in the hdf5 file.
  5. Each list element may have zero, one or more named attributes. Each of these attributes is an array storing characters or doubles (lists are not allowed).
  6. Missing values (NA in R) are not allowed.

All the data types you proposed above can represented in this format, although it will take some extra steps to convert to the desired representation; e.g. to convert from a list of vectors to a data frame in R.

Note I avoided integers since most integers can represented as doubles, and there are inconsistencies in the way that integers are implemented in R and Python which will cause trouble.

We will use h5py in Python the hdf5r package in R to read/write BDOs to hdf5 files.

See here for reference on basic data types in R. See here for reference on the hdf5r package.

@gaow
Copy link
Member Author

gaow commented Jan 17, 2018

Great thanks @pcarbo for the outline. I mostly agree with what you have suggested. Here are a few issues, though:

  1. Why are we leaving out matrix and data.frame? or only for now?
  2. In R, is it important to distinguish between int and double?

Since potentially R will have more restrictions than Python, it may be good idea that we have R-based I/O functions and results first, then I'll try to make it Python compatible.

Looking forward!

@pcarbo
Copy link
Member

pcarbo commented Jan 17, 2018

Why are we leaving out matrix and data.frame?

A matrix is a 2-d array.

A data frame is just a list of vectors (with some extra attributes like rownames).

In R, is it important to distinguish between int and double?

It is important, but not essential; integers are represented differently in R and Python, so it seemed like a major headache to deal with this data type. See for example here and here for some complexities.

@gaow
Copy link
Member Author

gaow commented Apr 27, 2018

Related to this issue is the support to multiple explicit file outputs per module. If we can get that work we'll be able to load files directly; although users will have to provide means to load data for different languages.

@gaow gaow closed this as completed Apr 27, 2018
@gaow gaow mentioned this issue Apr 27, 2018
@gaow gaow reopened this Nov 21, 2019
@gaow gaow modified the milestones: 0.3.x, 2.x Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants