Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files should be validated on write #1288

Open
rly opened this issue Aug 26, 2020 · 4 comments
Open

Files should be validated on write #1288

rly opened this issue Aug 26, 2020 · 4 comments
Labels
category: enhancement improvements of code or code behavior

Comments

@rly
Copy link
Contributor

rly commented Aug 26, 2020

Despite our best efforts, it is possible to use PyNWB to write a file that does not validate against the schema. Users should be aware that the file that they just wrote does not comply with the schema and therefore may not work with certain tools.

See also this old discussion: #306

Thoughts? @oruebel @ajtritt @bendichter

@rly rly added the category: enhancement improvements of code or code behavior label Aug 26, 2020
@t-b
Copy link
Collaborator

t-b commented Aug 27, 2020

@rly I'm all for it! And not just a warning, maybe a flag like writeInvalidFiles which needs to be true in this case.

@oruebel
Copy link
Contributor

oruebel commented Aug 27, 2020

Is it possible to run the validator on the builders before write? If so, that would allow us to save cost for having to write and then read the file again and also allow us to prevent any I/O from happening to avoid creating bad files.

@rly
Copy link
Contributor Author

rly commented Aug 27, 2020

@oruebel That should be possible but I'm not sure how that would work with DataChunkIterators which will not have been written yet. Also before writing, the data would be in lists/tuples/numpy arrays instead of in H5Datasets. That should not matter much, but I think it would be better to have the validator work exactly as it would if the validator were called on the data file. Point taken though that there is extra overhead involved in validating after write.

We could also strongly encourage users to validate themselves after writing, but I think many users would assume that PyNWB cannot write a non-compliant file and not bother validating after writing.

@oruebel
Copy link
Contributor

oruebel commented Aug 27, 2020

@rly it may make sense to do both. I.e., validate builders before write to catch possible errors before write even happens and then validate after write again, to make sure the file is actually correct. I think a validate before write should be able to catch the vast majority of problems and should also be fairly cheap (at least compared to validation after write). Even when DataChunkIterators are used, the only thing we may not know is the final total shape of the dataset, however, we should still know the data type and the initial shape, which should be sufficient for most validation needs, as the dimension you iterate over is rarely a dimension that has a fixed required length. In general, whether you want to validate before and/or after write, these should be configurable options, as users may want to skip these steps for performance reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

No branches or pull requests

3 participants