New format on DSC data #86

gaow · 2018-01-12T18:56:03Z

@pcarbo and I have decided to give HDF5 a stab as replacement to current default RDS storage format. We start from R and Python. The basic data types we'd like to support are:

HDF5	R	Python
?	character	str
?	integer	int, np.int, np.uint
?	double	float, np.float*
?	vector	list, np.array
?	matrix	np.matrix
?	array	np.array, list of lists
?	data.frame	pd.DataFrame
?	NaN	np.nan
?	Na	None

np for numpy, pd for pandas. Here is a test on Python's end:

import numpy as np
import pandas as pd
data = {'charater': 'pcarbo', 
        'integer1': 1, 'integer2': np.uint8(1), 
        'double1': 1.0, 'double2': np.float16(1.0), 
        'vector1': [1,2,'gaow'], 'vector2': [1,2,3], 'vector3': np.array([1,2,3]),
        'matrix': np.matrix([[1,2],[3,4]]),
        'array1': np.array([[1,2],[3,4]]), 'array2': [[1,2],[3,4]],
        'dataframe': pd.DataFrame({'A': [1,2], 'B': [3,4]}, index=['row1', 'row2'])
       }
data['recursive'] = data

Here is the outcome in HDF5:

test.h5.zip

I used this API from UChicago:

https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py

But it would not be difficult, I presume, to customize.

A particular difficult case is NULL/NA/NaN in R. In Python there are only None and NaN, no NULL. #25

@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.

The text was updated successfully, but these errors were encountered:

pcarbo · 2018-01-17T20:22:51Z

@gaow For the initial version, I'm going to propose a stripped-down, bare-bones version. Hence the name "barebones data object (BDO)".

Barebones data object:

Only one object is stored in an hdf5 file. The object is a list.
All elements of the list are stored in separate nodes ("groups").
Each list element may be one of: (a) array containing double-precision floating point numbers ("doubles"), (b) array containing character strings, or (c) a list.
Lists within lists are stored hierarchically as subnodes in the hdf5 file.
Each list element may have zero, one or more named attributes. Each of these attributes is an array storing characters or doubles (lists are not allowed).
Missing values (NA in R) are not allowed.

All the data types you proposed above can represented in this format, although it will take some extra steps to convert to the desired representation; e.g. to convert from a list of vectors to a data frame in R.

Note I avoided integers since most integers can represented as doubles, and there are inconsistencies in the way that integers are implemented in R and Python which will cause trouble.

We will use h5py in Python the hdf5r package in R to read/write BDOs to hdf5 files.

See here for reference on basic data types in R. See here for reference on the hdf5r package.

gaow · 2018-01-17T20:54:25Z

Great thanks @pcarbo for the outline. I mostly agree with what you have suggested. Here are a few issues, though:

Why are we leaving out matrix and data.frame? or only for now?
In R, is it important to distinguish between int and double?

Since potentially R will have more restrictions than Python, it may be good idea that we have R-based I/O functions and results first, then I'll try to make it Python compatible.

Looking forward!

pcarbo · 2018-01-17T21:04:41Z

Why are we leaving out matrix and data.frame?

A matrix is a 2-d array.

A data frame is just a list of vectors (with some extra attributes like rownames).

In R, is it important to distinguish between int and double?

It is important, but not essential; integers are represented differently in R and Python, so it seemed like a major headache to deal with this data type. See for example here and here for some complexities.

gaow · 2018-04-27T03:28:44Z

Related to this issue is the support to multiple explicit file outputs per module. If we can get that work we'll be able to load files directly; although users will have to provide means to load data for different languages.

gaow mentioned this issue Jan 25, 2018

Companion R package #87

Closed

gaow added this to the 0.3.x milestone Feb 22, 2018

gaow added discussion enhancement help wanted later labels Mar 30, 2018

gaow closed this as completed Apr 27, 2018

gaow mentioned this issue Apr 27, 2018

MATLAB support #30

Closed

gaow reopened this Nov 21, 2019

gaow modified the milestones: 0.3.x, 2.x Nov 21, 2019

gaow added data-model and removed help wanted labels Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New format on DSC data #86

New format on DSC data #86

gaow commented Jan 12, 2018 •

edited

Loading

pcarbo commented Jan 17, 2018 •

edited

Loading

gaow commented Jan 17, 2018

pcarbo commented Jan 17, 2018 •

edited

Loading

gaow commented Apr 27, 2018 •

edited

Loading

New format on DSC data #86

New format on DSC data #86

Comments

gaow commented Jan 12, 2018 • edited Loading

pcarbo commented Jan 17, 2018 • edited Loading

gaow commented Jan 17, 2018

pcarbo commented Jan 17, 2018 • edited Loading

gaow commented Apr 27, 2018 • edited Loading

gaow commented Jan 12, 2018 •

edited

Loading

pcarbo commented Jan 17, 2018 •

edited

Loading

pcarbo commented Jan 17, 2018 •

edited

Loading

gaow commented Apr 27, 2018 •

edited

Loading