-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate builders against both top level data type specs and inner specs #609
Validate builders against both top level data type specs and inner specs #609
Conversation
* Fix hdmf-dev#585 * Ref hdmf-dev#542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated
I'm currently exploring the errors in the pynwb CI. Looks like there are a number of pynwb specs that use inner spec extension (as described above), which are now failing validation for one or more reasons. I'm still exploring the errors, but at least some of them are caused by the validator not handling dataset reference vectors (shape is 1D). In this case, the I am thinking of breaking this fix into two parts:
I'll change the PR to a draft until I get the pynwb tests resolved. |
* Ref hdmf-dev#585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach
I think I determined the sources for all for the errors, and they all fell into 4 categories:
I've added a workaround to the hdmf validation code to resolve (I think) 1. and 2. It would be good to get feedback @rly and @ajtritt (and anyone else) about the implementation. In particular, is there another way you'd recommend to determine the type of Item 3. is just a quick fix to a pynwb test, I will submit a PR. Item 4. however seems to be a concern. It seems like there is a core disagreement between nwb-schema and pynwb implementation that should be resolved, but it isn't clear to me which of the two should be changed, or if there's another way to make them both work. Regardless, it seems like this makes this current PR a breaking change, and the disagreement between nwb-schema and pywnb should be resolved before this PR is released. Thoughts? There's a lot of things involved here, and my interpretation of everything going on might not be right, so please correct me if anything above is incorrect! |
* These unit tests will begin to fail after the update to the hdmf validator that increase validation coverage * Ref hdmf-dev/hdmf#585 * Ref hdmf-dev/hdmf#609 * See discussion: hdmf-dev/hdmf#609 (comment)
@oruebel that seems like it would work. At first I thought that we might have to treat BuilderH5ReferenceDataset objects differently from BuilderH5TableDataset objects, but it looks like in either case we can check for From a conceptual standpoint, is that the appropriate abstraction? Does it make sense to consider a dataset with a composite dtype (which seems to be the case when there is a BuilderH5TableDataset based on this comment) as a Reference? Also, as far as I can tell, these BuilderH5... objects are only instantiated when reading an existing file from disk and not when writing a newly created nwbfile object to disk. So there is a difference in builder graph when going from new container -> builder vs going from file -> builder. It would be beneficial for the validator and just conceptually easier if there were only one way to represent nwb data with a builder graph. Would it be possible and worthwhile to add another step while loading files to convert/resolve the BuilderH5... objects to the same builder objects you'd get when starting with a set of containers? Maybe this wouldn't be a priority, but if it is potentially beneficial, I can add an issue. |
If I remember correctly, composite types are only wrapped if the compound datatype contains a reference. I believe compound types that only contain numerical data are treated as regular datasets. @ajtritt is that correct?
This is because for datasets that contain reference we need to resolve the references to appropriate builder/container object. To do this on read we'd need to read all datasets that contain references into memory and resolve all the links. To support lazy read of datasets with references, the datasets are, therefore, wrapped on read.
I'm not sure we can fully avoid this. The reason being that the source of the data are different on write and read. When writing data the data is organized in Container classes and usually lives in memory (or is retrieved iteratively with DataChunkIterators and/or wrapped with DataIO). On read the data sits in an HDF5 file and we don't yet have Containers. In particular when working with links this is tricky. One possible approach that we had discussed at some point would be to always wrap all data arrays. However, the effort involved with doing this is substantial and will likely end up breaking a lot of things, which is why we have so far avoided going this route. This has come up, e.g., also during #480 where we now also need to wrap string datasets since h5py3 changed how strings are being handled. I don't think this should cause issues for the validator since |
Ah, ok, that makes sense, thanks for the clarification.
Gotcha, the need for lazy evaluation makes sense as to why it is like it is. Thanks.
Good to know you've been discussing all of this. I understand the motivation, and it seems like it'd be hard to get around having two different implementations. So for now, I'll go ahead with the solution we discussed and add unit tests. The one final thing that needs to have a course of action before I can finalize this PR is this item:
|
This is a bug in the NWB schema. This PR will fix it: NeurodataWithoutBorders/nwb-schema#478 |
I will also make a corresponding PR in PyNWB. |
* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix hdmf-dev#585
Thanks for taking care of the electrode filtering issue @rly! I believe this PR is now ready to go. I retargeted for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsleiter thanks for all the hard work on these fixes to the validator! Looks good to me.
LGTM. Thanks for all the hard work on this! |
Awesome, thanks for the support in sorting through all of the details here! |
…ecs (#609) * Validate builders against both top level data type specs and inner specs * Fix #585 * Ref #542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated * Update changelog * Ref #585 * Fix pynwb validation errors related to reference and compound data types * Ref #585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach * Remove validator reference to H5-specific classes and add unit tests * use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix #585 * Update tests/unit/validator_tests/test_validate.py * Update tests/unit/validator_tests/test_validate.py Co-authored-by: Ryan Ly <rly@lbl.gov>
In certain specs, there are datasets or groups nested inside a top-level spec (let's call them "inner" specs) which inherit from a data type via data_type_inc and also add or modify fields (let's call it an "extension" of the base data type). In most situations, a new type is defined via data_type_def, but sometimes no new type is defined (so let's say it an "anonymous" extension).
In order to validate against the added/modified fields in the anonymous data type extension while still validating against the fields in the base data type, the validator will now in this situation validate an inner builder twice - once against the base data type spec and a second time against the inner spec. This resolves the validation issue when there is an anonymous extension.
The double validation also happens when the inner spec is a "pure" data type and doesn't make any additions or modifications, but in this case, the extra validation will not have any effect.
There are situations where the same attribute or field is defined redundantly at the top-level data type and at the inner spec. If there is a problem with this attribute/field, two equal validations are now generated. However, before returning the full list of validation errors, any duplicates are removed. Duplicates are identified by new logic for Error equality, which depends on the error location, reason, and name of the component.
An alternative solution that I considered was to try to merge the specs together. However, there are a lot of complex rules on how the specs should work together and what is a valid extension, and I didn't want the validator to take on the responsibility of correctly merging specs (I don't think this can be done at spec-loading time unless we add additional restrictions to the schema language). Perhaps in the future we can make adjustments to the schema-language to either avoid the need for this or make it possible to merge specs at load time.
I did do quite a bit of code reading while working on this to understand how specs are loaded, how the namespaces are managed, and how spec inheritance works. I did do my best to avoid any side effects of this change, but I can't say for certain that I've 100% avoided them, so please do look critically at the changes.
Note that this change will make it more difficult to raise warnings on extra fields since we want to only raise a warning if there is a field that isn't listed in both specs. I plan to approach that by creating a ValidationQueue where you can queue up specs to validate against a particular builder, only run the validation once all specs for that builder are queued, and then only pass through missing field warnings that are generated from both specs.
Motivation
See #585
How to test the behavior?
Execute unit tests and see that the new tests pass.
Checklist
flake8
from the source directory.