-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add catalog dataset for HDF formats not supported by pandas #240
Comments
Hi @tomaz-suller @Eric-OG thanks for the issue - this is an interesting one! Very exciting Kedro can be bestowed the honour of being your first open source contribution! We have contributing guidelines here and here. We can help you in this journey and the tutorial for defining / extending your own dataset can be found here. As a rule we try not to alter the underlying API too much, that's why pandas limitation here becomes our limitation. In this case I think I agree it make sense to extend things so that your way of working is supported. Shall we use this issue to discuss a proposed implementation before writing too much code? If you wouldn't mind posting a pseudocode / outline version of how you would extend pandas.HDFDataSet or simply defining a new low level |
Hello @datajoely, our main idea behind using a interface that doesn't use pandas is avoiding using the pandas hdf5 interface since the data we were using couldn't be imported with pandas HDFStore. Therefore, we intend to implement a interface that reads the a HDF5 file with h5py and return it as an File object from h5py. One instance of a dataset where we would apply this would be a simple .h5 file with a single dataset (a single key) with a basic array as data. The outline of our ideia is the following:
By using the File object, it would be possible access data from a h5 file with any structure. Our main doubt is if it is possible to do this, because of the way Kedro uses fsspec. Since the files are being written in binary and we don't know if it is possible to do this using h5py's interface. Is it possible to read and write directly with h5py (skipping the binary steps) in Kedro? |
So, we figured out a way to read from binary to create a The basic idea would be to get the binary data from def h5py_from_binary(binary):
file_access_property_list = h5py.h5p.create(h5py.h5p.FILE_ACCESS)
file_access_property_list.set_fapl_core(backing_store=False)
file_access_property_list.set_file_image(binary)
file_id_args = {
'fapl': file_access_property_list,
'flags': h5py.h5f.ACC_RDONLY,
'name': next(tempfile._get_candidate_names()).encode(),
}
h5_file_args = {
'backing_store': False,
'driver': 'core',
'mode': 'r'
}
file_id = h5py.h5f.open(**file_id_args)
return h5py.File(file_id, **h5_file_args) We tested it on Google Colab and it worked, but there is one problem with it still: we don't quite understand what is going on there - seemingly a temp file is created with the contents of the binary and is passed to the Writing the file is less obscure: we write the def h5py_to_binary(h5f):
bio = BytesIO()
with h5py.File(bio, 'w') as f:
for key in h5_file.keys():
f.create_dataset_like(key, h5_file[key])
f.close()
return bio.getvalue() Edit: we found out about the These solutions cover both the You @datajoely mentioned it might be best not to write a ton of code before discussing what would be best, so tomorrow we'll try and get familiar with the Kedro build process to then implement the |
I've also just realised the name of the file in the obscure def h5py_from_binary(binary, filename):
file_access_property_list = h5py.h5p.create(h5py.h5p.FILE_ACCESS)
file_access_property_list.set_fapl_core(backing_store=False)
file_access_property_list.set_file_image(binary)
file_id_args = {
'fapl': file_access_property_list,
'flags': h5py.h5f.ACC_RDONLY,
'name': filename.encode('utf-8'),
}
h5_file_args = {
'backing_store': False,
'driver': 'core',
'mode': 'r'
}
file_id = h5py.h5f.open(**file_id_args)
return h5py.File(file_id, **h5_file_args) Edit: the I don't know if that's useful, but it makes the code a bit less weird and removes a dependency on |
Agreed with @datajoely that this is an interesting one, thanks very much for raising the issue and for choosing kedro for your first open source contribution! Just shout if you need any help. Just building on what Joel says regarding implementation, this is potentially a bit tricky to do correctly. The options are:
As I see it, option 2 is preferred because it's the most general: this way a user can use kedro with any kind of h5 file, not just things which are coercible to a pandas DataFrame. You could store proper hierarchical data this way, and in the node pick out the bits you want and convert them to pandas DataFrame (or just numpy array or whatever). It also leaves open the possibility of a However, it is potentially harder than option. In your case it sounds like you're only be interested in loading a single array into a pandas DataFrame? If this is what people generally do then probably we can get the majority of the useful functionality (being able to convert simple h5 files into pandas DataFrames) for less work than doing the fully general solution. |
Oh sorry, I hadn't refreshed my page and so completely missed your posts, @Eric-OG and @tomaz-suller. Pretend that my comment above came directly after @datajoely's! 🤦 I'll take a proper read through what you wrote later, but just 3 random things for now:
|
Hi @AntonyMilneQB. So, I'll try and answer the points you raised.
To be really honest with you, I didn't get what you mean... could you elaborate on it further? Do you mean we would have problems with the methods we suggested corrupting files, i.e. writing files to the filesystem which could only be read using our method? If so, we haven't tested that out yet, but it's something simple I'll do and post the results here.
That is precisely what we want, meaning the approach you mentioned of only using a single
Yeah, I do agree, just the fact that we didn't understand how the low-level API does its thing bugs me a bit.
Again, that is precisely what we wish to implement, but isn't sim_dataset:
layer: raw
type: PartitionedDataSet
path: data/01_raw/sim
dataset:
type: pandas.CSVDataSet
save_args:
index: False I'm adding below some additional information about our use case as context, but it's not an essential read Detailing our use case further: the dynamic simulator I mentioned initially outputs one $ tree -An
├── 0001
│ └── pos.h5
├── 0002
│ └── pos.h5
...
├── 7002
│ └── pos.h5
└── 7003
└── pos.h5 These ~7000 files represent just 1 set of simulations - we need to analyse at least 20 of them. Anyway, each file contains information on the simulated conditions and the actual data generated, which we wish to analyse; we use the >>> import h5py
>>> h5f = h5py.File('pos.h5')
>>> h5f.keys()
<KeysViewHDF5 ['combination_id', ..., 'time_series_data']>
# The file contain a single Group, and we are interested in particular in the time_series_data Dataset
>>> type(h5f['time_series_data'])
h5py._hl.dataset.Dataset |
Hello @AntonyMilneQB, regarding using
According to what we searched, it would be necessary to use a Because of that, taking into consideration h5py's File implementation, the interface for |
Two new things. Firstly, the method I sent here to turn a def h5py_to_binary(h5f):
bio = BytesIO()
with h5py.File(bio, 'w') as biof:
for key, value in h5f.items():
h5f.copy(value, biof,
expand_soft=True,
expand_external=True,
expand_refs=True)
biof.close()
return bio.getvalue() Secondly, complementing what @Eric-OG mentioned, @AntonyMilneQB we could get a >>> import h5py
>>> h5f = h5py.File('pos.h5')
>>> type(h5f.get('/'))
h5py._hl.group.Group So yes, our solution would work using |
Hi @tomaz-suller I think we're at a good point in terms of thinking - it may be worth trying to scratch up an implementation. We have docs here on how to create a custom dataset: I would ask that you try and implement two Kedro style pieces of functionality:
|
Hey @datajoely. @Eric-OG and I have experimented a bit with the functions I sent over here in the We'll test it in our own (private) Kedro project to iron out any clear bugs and then start adapting the documentation and writing the unit tests. By the way, I was considering creating the PR when we believe the dataset works and then work on the unit tests from the PR rather than from our fork, since we don't have that much experience in writing Python unit tests. What do you think? On a side note, I tried installing Kedro's development dependencies as explained in the documentation but the build fails. I think it might have to do with me using a
I also tried building with |
For transparency's sake, I'll be upfront and say @Eric-OG and I won't work on this issue for some time because of other obligations. I think we can be reasonably sure that we'll have at least opened the PR before the end of the year. Meanwhile, if anyone's interested, our fork I linked in this thread has one initial implementation, though saving hasn't been manually tested thoroughly and we don't know how to integrate the |
Thanks @tomaz-suller I've added a 'Help wanted label' in case anyone else wants to have a go. Likewise the core team is unlikely to have time to takeover this request. Regarding your point on It's something we could revisit in another PR if users are asking for it. |
I decided to come back to this issue and finally adapt the tests we needed so we could submit a PR. All tests I've developed in our fork are currently passing, but I don't know how to get to 100% coverage. file_id_args = {
"fapl": file_access_property_list,
"flags": h5py.h5f.ACC_RDONLY,
"name": next(tempfile._get_candidate_names()).encode(),
} Another issue I'm facing is I can't make linting pass. I get this output out of
I assume our fork being a bit over a year old might be to blame, but I couldn't find any issue in the GitHub repo related to changes to the configuration files mentioned in the error messages. cc @Eric-OG |
Add dataset and tests for HDF5 files which may not be supported by pandas. Resolves #1026. Sign-off-by: Eric Oliveira Gomes <ericog123456@gmail.com> Sign-off-by: tomaz-suller <tomaz.suller@usp.br>
Closing this issue as our datasets now live in the kedro-plugins repository. Feel free to re-open this issue / migrate the PR contribution over 😄 . |
I was actually thinking about commenting about this just the other day. I'd still be open to contribute, though what struck me was this sat for about two years and there seems not to be any other request for it. Would it make sense to add this dataset from a maintenance perspective? My view is it wouldn't just to support this one niche, but I'd like to get the perspective of Kedro developers as well. |
Hi @tomaz-suller - thanks for raising this concern. Generally some datasets will always be more niche than others. Our rule of thumb is as long as the dataset is introduced with a suitable test suite, we are more than happy to accept the contribution! |
kedro-org/kedro#1964 was an attempt to close this issue, I transferred it here and I'm reopening for visibility |
Description
HDF files which were not created by
pandas.to_hdf
sometimes cannot be read usingpandas.read_hdf
. Since Kedro's HDFDataSet depends on pandas, these files cannot be added to the catalog all.Context
Currently, the dynamic simulation software employed in my research group outputs exclusively
.h5
files which contain information we wish to feed to ML models using Kedro. Currently we use a Python script which converts these HDF files into CSVs so we can track using Kedro, yet this is an inefficient process as we are required to rewrite thousands of files to process them.Ideally, we would like to add out dataset to our data catalog just like we do with our CSV datasets, but in a way that was able to read any kind of HDF file, unlike
kedro.extras.datasets.pandas.HDFDataSet
.Given pandas cannot read HDF files which do not conform to its specs (apparently by design according to this issue), this simple addition would benefit any user who may store information in HDF files, be it because it is their preferred storage method or (like in our case) they use some software which directly outputs HDF.
Possible Implementation
My research colleague @Eric-OG and I believe we can implement this on our own. I think it's worth noting it'd be our first contribution to an open source project, but we've read the guidelines and so forth.
It would basically involve copying one of the existing datasets (possible even
pandas.HDFDataSet
) and adapting it to use another library. We planned on usingh5py
.Possible Alternatives
The text was updated successfully, but these errors were encountered: