-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New format on DSC data #86
Comments
@gaow For the initial version, I'm going to propose a stripped-down, bare-bones version. Hence the name "barebones data object (BDO)". Barebones data object:
All the data types you proposed above can represented in this format, although it will take some extra steps to convert to the desired representation; e.g. to convert from a list of vectors to a data frame in R. Note I avoided integers since most integers can represented as doubles, and there are inconsistencies in the way that integers are implemented in R and Python which will cause trouble. We will use See here for reference on basic data types in R. See here for reference on the |
Great thanks @pcarbo for the outline. I mostly agree with what you have suggested. Here are a few issues, though:
Since potentially R will have more restrictions than Python, it may be good idea that we have R-based I/O functions and results first, then I'll try to make it Python compatible. Looking forward! |
A matrix is a 2-d array. A data frame is just a list of vectors (with some extra attributes like
It is important, but not essential; integers are represented differently in R and Python, so it seemed like a major headache to deal with this data type. See for example here and here for some complexities. |
Related to this issue is the support to multiple explicit file outputs per module. If we can get that work we'll be able to load files directly; although users will have to provide means to load data for different languages. |
@pcarbo and I have decided to give
HDF5
a stab as replacement to current defaultRDS
storage format. We start fromR
andPython
. The basic data types we'd like to support are:np
fornumpy
,pd
forpandas
. Here is a test on Python's end:Here is the outcome in HDF5:
test.h5.zip
I used this API from UChicago:
https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py
But it would not be difficult, I presume, to customize.
A particular difficult case is
NULL/NA/NaN
in R. In Python there are onlyNone
andNaN
, noNULL
. #25@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.
The text was updated successfully, but these errors were encountered: