Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check to see if our HDF5 files have checksums turned on #1525

Closed
duncan-brown opened this issue Mar 15, 2017 · 9 comments
Closed

Check to see if our HDF5 files have checksums turned on #1525

duncan-brown opened this issue Mar 15, 2017 · 9 comments
Assignees

Comments

@duncan-brown
Copy link
Contributor

No description provided.

@duncan-brown duncan-brown self-assigned this Mar 15, 2017
@spxiwh
Copy link
Contributor

spxiwh commented Mar 16, 2017

This came up at the "HDF formats" discussion.

HDF does not have top-level checksum a la frame files, but it does have a per dataset checking feature (fletcher32):

http://docs.h5py.org/en/latest/high/dataset.html#fletcher32-filter

We should probably just turn this on.

@ahnitz
Copy link
Member

ahnitz commented Mar 16, 2017 via email

@stuartthebruce
Copy link

Did anyone give this a try, and if so did it "just work"?

@stuartthebruce
Copy link

@ahnitz @spxiwh @titodalcanton did PyCBC adopt HDF5 checksums for it's data files? If so, how well (or not) is that working? Thanks.

@spxiwh
Copy link
Contributor

spxiwh commented Sep 18, 2022

We did not implement the "per dataset checking feature" within PyCBC.

.... However, Pegasus implemented checksum testing on all data files (similar to our own checks on frame files). After we added an option to stop it testing symlinked files, this has been working nicely. We often disable this for development runs, so must remember to re-enable it for production runs! This doesn't help if using the files outside of pegasus though, unless one also extracts the checksums for all files (which is possible). We might consider enabling this feature, but it would not be ideal to have to enable it in every HDF call, as there are a lot of these throughout PyCBC. Some sort of environment variable to make this default would be much easier .... @GarethCabournDavies A point of discussion for the face-to-face tomorrow(?)

@stuartthebruce
Copy link

Are the Pegasus checksums maintained in a database or calculated on the fly before/after each file transfer? Note, in either case I think it would be worth while also adding internal checksums to the critical files as well. Thanks.

@GarethCabournDavies
Copy link
Contributor

For enabling the h5py dataset checksu:

Could we insist on all calls to h5py.File being through HFile instead (this solves the 'every call' issue) and adding the _setitem_/_getitem_ changes as Alex suggests - I don't see how to do that quite yet though

This could add the environment variable check into the HFile init function, so that it is used, but hidden from the user

@GarethCabournDavies
Copy link
Contributor

I also would need to check that e.g. file_object['new_dataset_name'] would also do this, as we don't always go through the create_dataset method

@GarethCabournDavies
Copy link
Contributor

Can be closed by #4831

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants