Validate builders against both top level data type specs and inner specs #609

dsleiter · 2021-05-13T14:11:35Z

In certain specs, there are datasets or groups nested inside a top-level spec (let's call them "inner" specs) which inherit from a data type via data_type_inc and also add or modify fields (let's call it an "extension" of the base data type). In most situations, a new type is defined via data_type_def, but sometimes no new type is defined (so let's say it an "anonymous" extension).

In order to validate against the added/modified fields in the anonymous data type extension while still validating against the fields in the base data type, the validator will now in this situation validate an inner builder twice - once against the base data type spec and a second time against the inner spec. This resolves the validation issue when there is an anonymous extension.

The double validation also happens when the inner spec is a "pure" data type and doesn't make any additions or modifications, but in this case, the extra validation will not have any effect.

There are situations where the same attribute or field is defined redundantly at the top-level data type and at the inner spec. If there is a problem with this attribute/field, two equal validations are now generated. However, before returning the full list of validation errors, any duplicates are removed. Duplicates are identified by new logic for Error equality, which depends on the error location, reason, and name of the component.

An alternative solution that I considered was to try to merge the specs together. However, there are a lot of complex rules on how the specs should work together and what is a valid extension, and I didn't want the validator to take on the responsibility of correctly merging specs (I don't think this can be done at spec-loading time unless we add additional restrictions to the schema language). Perhaps in the future we can make adjustments to the schema-language to either avoid the need for this or make it possible to merge specs at load time.

I did do quite a bit of code reading while working on this to understand how specs are loaded, how the namespaces are managed, and how spec inheritance works. I did do my best to avoid any side effects of this change, but I can't say for certain that I've 100% avoided them, so please do look critically at the changes.

Note that this change will make it more difficult to raise warnings on extra fields since we want to only raise a warning if there is a field that isn't listed in both specs. I plan to approach that by creating a ValidationQueue where you can queue up specs to validate against a particular builder, only run the validation once all specs for that builder are queued, and then only pass through missing field warnings that are generated from both specs.

Motivation

See #585

How to test the behavior?

Execute unit tests and see that the new tests pass.

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

* Fix hdmf-dev#585 * Ref hdmf-dev#542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Ref hdmf-dev#585

dsleiter · 2021-06-07T09:51:43Z

I'm currently exploring the errors in the pynwb CI. Looks like there are a number of pynwb specs that use inner spec extension (as described above), which are now failing validation for one or more reasons.

I'm still exploring the errors, but at least some of them are caused by the validator not handling dataset reference vectors (shape is 1D). In this case, the DatasetBuilder has a BuilderH5ReferenceDataset in the data field.

I am thinking of breaking this fix into two parts:

find a way to at least get pynwb tests to all pass in this PR, even if that means skipping validation of BuilderH5ReferenceDataset data (looks like they were never being evaluated before this PR anyways)
create a separate issue & PR for validating references

I'll change the PR to a draft until I get the pynwb tests resolved.

* Ref hdmf-dev#585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach

dsleiter · 2021-06-09T06:26:06Z

I think I determined the sources for all for the errors, and they all fell into 4 categories:

When a BuilderH5ReferenceDataset is passed to hdmf.validate.validator.get_type, which raises an error. As far as I can tell, this happens when a file is read from disk that contains a dataset with a reference dtype with shape (N,)
When a BuilderH5TableDataset is passed to hdmf.validate.validator.get_type, which raises an error. As far as I can tell, this happens when a file is read from disk that contains a dataset with a compound dtype.
pynwb integration tests in integration/hdf5/test_ecephys.py that use the function TestElectricalSeriesIO.make_electrode_table fail correctly because the value for x in table.add_row is provided as an int while the schema expects a float
the nwb schema expects the filtering column of an electrodes table to be of type float32 but many tests provide a string, and NWBFile.add_electrode requires a string for filtering

I've added a workaround to the hdmf validation code to resolve (I think) 1. and 2. It would be good to get feedback @rly and @ajtritt (and anyone else) about the implementation. In particular, is there another way you'd recommend to determine the type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects? It seems to be an abstraction leakage to reference classes related to the hdf5 backend from the validator, so I'd be interested if there's another type up the MRO that I should be checking for instead.

Item 3. is just a quick fix to a pynwb test, I will submit a PR.

Item 4. however seems to be a concern. It seems like there is a core disagreement between nwb-schema and pynwb implementation that should be resolved, but it isn't clear to me which of the two should be changed, or if there's another way to make them both work. Regardless, it seems like this makes this current PR a breaking change, and the disagreement between nwb-schema and pywnb should be resolved before this PR is released. Thoughts?

There's a lot of things involved here, and my interpretation of everything going on might not be right, so please correct me if anything above is incorrect!

* These unit tests will begin to fail after the update to the hdmf validator that increase validation coverage * Ref hdmf-dev/hdmf#585 * Ref hdmf-dev/hdmf#609 * See discussion: hdmf-dev/hdmf#609 (comment)

oruebel · 2021-06-09T17:09:07Z

I believe both `BuilderH5ReferenceDatasetandBuilderH5TableDataset`` inherit from ``hdmf.query.ReferenceResolver`` so it may be possible to use that for checking in the validator @ajtritt @rly

dsleiter · 2021-06-10T04:06:04Z

@oruebel that seems like it would work.

At first I thought that we might have to treat BuilderH5ReferenceDataset objects differently from BuilderH5TableDataset objects, but it looks like in either case we can check for ReferenceResolver and return the object's dtype.

From a conceptual standpoint, is that the appropriate abstraction? Does it make sense to consider a dataset with a composite dtype (which seems to be the case when there is a BuilderH5TableDataset based on this comment) as a Reference?

Also, as far as I can tell, these BuilderH5... objects are only instantiated when reading an existing file from disk and not when writing a newly created nwbfile object to disk. So there is a difference in builder graph when going from new container -> builder vs going from file -> builder. It would be beneficial for the validator and just conceptually easier if there were only one way to represent nwb data with a builder graph. Would it be possible and worthwhile to add another step while loading files to convert/resolve the BuilderH5... objects to the same builder objects you'd get when starting with a set of containers? Maybe this wouldn't be a priority, but if it is potentially beneficial, I can add an issue.

oruebel · 2021-06-10T04:27:14Z

Does it make sense to consider a dataset with a composite dtype (which seems to be the case when there is a BuilderH5TableDataset based on this comment) as a Reference?

If I remember correctly, composite types are only wrapped if the compound datatype contains a reference. I believe compound types that only contain numerical data are treated as regular datasets. @ajtritt is that correct?

Also, as far as I can tell, these BuilderH5... objects are only instantiated when reading an existing file from disk and not when writing a newly created nwbfile object to disk

This is because for datasets that contain reference we need to resolve the references to appropriate builder/container object. To do this on read we'd need to read all datasets that contain references into memory and resolve all the links. To support lazy read of datasets with references, the datasets are, therefore, wrapped on read.

So there is a difference in builder graph when going from new container -> builder vs going from file -> builder.

I'm not sure we can fully avoid this. The reason being that the source of the data are different on write and read. When writing data the data is organized in Container classes and usually lives in memory (or is retrieved iteratively with DataChunkIterators and/or wrapped with DataIO). On read the data sits in an HDF5 file and we don't yet have Containers. In particular when working with links this is tricky. One possible approach that we had discussed at some point would be to always wrap all data arrays. However, the effort involved with doing this is substantial and will likely end up breaking a lot of things, which is why we have so far avoided going this route. This has come up, e.g., also during #480 where we now also need to wrap string datasets since h5py3 changed how strings are being handled. I don't think this should cause issues for the validator since StrDataset extends h5py.Dataset.

dsleiter · 2021-06-11T08:12:00Z

If I remember correctly, composite types are only wrapped if the compound datatype contains a reference. I believe compound types that only contain numerical data are treated as regular datasets. @ajtritt is that correct?

Ah, ok, that makes sense, thanks for the clarification.

This is because for datasets that contain reference we need to resolve the references to appropriate builder/container object. To do this on read we'd need to read all datasets that contain references into memory and resolve all the links. To support lazy read of datasets with references, the datasets are, therefore, wrapped on read.

Gotcha, the need for lazy evaluation makes sense as to why it is like it is. Thanks.

I'm not sure we can fully avoid this. The reason being that the source of the data are different on write and read. When writing data the data is organized in Container classes and usually lives in memory (or is retrieved iteratively with DataChunkIterators and/or wrapped with DataIO). On read the data sits in an HDF5 file and we don't yet have Containers. In particular when working with links this is tricky. One possible approach that we had discussed at some point would be to always wrap all data arrays. However, the effort involved with doing this is substantial and will likely end up breaking a lot of things, which is why we have so far avoided going this route. This has come up, e.g., also during #480 where we now also need to wrap string datasets since h5py3 changed how strings are being handled. I don't think this should cause issues for the validator since StrDataset extends h5py.Dataset.

Good to know you've been discussing all of this. I understand the motivation, and it seems like it'd be hard to get around having two different implementations. So for now, I'll go ahead with the solution we discussed and add unit tests.

The one final thing that needs to have a course of action before I can finalize this PR is this item:

the nwb schema expects the filtering column of an electrodes table to be of type float32 but many tests provide a string, and NWBFile.add_electrode requires a string for filtering

rly · 2021-06-14T19:02:53Z

The one final thing that needs to have a course of action before I can finalize this PR is this item:

the nwb schema expects the filtering column of an electrodes table to be of type float32 but many tests provide a string, and NWBFile.add_electrode requires a string for filtering

This is a bug in the NWB schema. This PR will fix it: NeurodataWithoutBorders/nwb-schema#478

rly · 2021-06-14T19:03:20Z

I will also make a corresponding PR in PyNWB.

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix hdmf-dev#585

dsleiter · 2021-06-17T07:24:31Z

Thanks for taking care of the electrode filtering issue @rly!

I believe this PR is now ready to go. I retargeted for rc/3.0.0 because this will cause errors on pynwb if it is merged in before the nwb schema update is merged and this PR is merged: NeurodataWithoutBorders/pynwb#1366, but let me know if you want it to merge into dev.

oruebel

@dsleiter thanks for all the hard work on these fixes to the validator! Looks good to me.

tests/unit/validator_tests/test_validate.py

rly · 2021-06-25T00:14:15Z

LGTM. Thanks for all the hard work on this!

dsleiter · 2021-06-25T01:56:57Z

Awesome, thanks for the support in sorting through all of the details here!

…ecs (#609) * Validate builders against both top level data type specs and inner specs * Fix #585 * Ref #542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated * Update changelog * Ref #585 * Fix pynwb validation errors related to reference and compound data types * Ref #585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach * Remove validator reference to H5-specific classes and add unit tests * use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix #585 * Update tests/unit/validator_tests/test_validate.py * Update tests/unit/validator_tests/test_validate.py Co-authored-by: Ryan Ly <rly@lbl.gov>

dsleiter added 2 commits May 13, 2021 23:44

Update changelog

eb1411f

* Ref hdmf-dev#585

dsleiter mentioned this pull request May 13, 2021

Make validation errors immutable #610

Open

5 tasks

Merge branch 'dev' into bug/validate_extended_inc_data_types

bd4aada

dsleiter marked this pull request as draft June 7, 2021 09:52

dsleiter added 2 commits June 8, 2021 23:28

Fix pynwb validation errors related to reference and compound data types

9668b88

* Ref hdmf-dev#585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach

Merge branch 'dev' into bug/validate_extended_inc_data_types

d1bca5c

dsleiter mentioned this pull request Jun 9, 2021

Fix integration tests with invalid test data that will be caught by future hdmf validator version NeurodataWithoutBorders/pynwb#1366

Merged

6 tasks

rly mentioned this pull request Jun 14, 2021

Extended dataset can be written with invalid data type #629

Open

5 tasks

dsleiter added 2 commits June 17, 2021 16:57

Remove validator reference to H5-specific classes and add unit tests

1ec83d6

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix hdmf-dev#585

Merge branch 'rc/3.0.0' into bug/validate_extended_inc_data_types

88c172c

dsleiter changed the base branch from dev to rc/3.0.0 June 17, 2021 07:10

dsleiter marked this pull request as ready for review June 17, 2021 07:11

oruebel approved these changes Jun 17, 2021

View reviewed changes

rly reviewed Jun 24, 2021

View reviewed changes

tests/unit/validator_tests/test_validate.py Outdated Show resolved Hide resolved

Update tests/unit/validator_tests/test_validate.py

f1ef930

rly reviewed Jun 24, 2021

View reviewed changes

tests/unit/validator_tests/test_validate.py Outdated Show resolved Hide resolved

Update tests/unit/validator_tests/test_validate.py

29829c8

rly approved these changes Jun 25, 2021

View reviewed changes

rly merged commit b93cde4 into hdmf-dev:rc/3.0.0 Jun 25, 2021

dsleiter deleted the bug/validate_extended_inc_data_types branch June 25, 2021 01:49

dsleiter mentioned this pull request Jun 25, 2021

Validator should allow extensions to data_types which only define data_type_inc #585

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate builders against both top level data type specs and inner specs #609

Validate builders against both top level data type specs and inner specs #609

dsleiter commented May 13, 2021 •

edited

Loading

dsleiter commented Jun 7, 2021

dsleiter commented Jun 9, 2021

oruebel commented Jun 9, 2021

dsleiter commented Jun 10, 2021

oruebel commented Jun 10, 2021

dsleiter commented Jun 11, 2021

rly commented Jun 14, 2021

rly commented Jun 14, 2021

dsleiter commented Jun 17, 2021

oruebel left a comment

rly commented Jun 25, 2021

dsleiter commented Jun 25, 2021

Validate builders against both top level data type specs and inner specs #609

Validate builders against both top level data type specs and inner specs #609

Conversation

dsleiter commented May 13, 2021 • edited Loading

Motivation

How to test the behavior?

Checklist

dsleiter commented Jun 7, 2021

dsleiter commented Jun 9, 2021

oruebel commented Jun 9, 2021

dsleiter commented Jun 10, 2021

oruebel commented Jun 10, 2021

dsleiter commented Jun 11, 2021

rly commented Jun 14, 2021

rly commented Jun 14, 2021

dsleiter commented Jun 17, 2021

oruebel left a comment

Choose a reason for hiding this comment

rly commented Jun 25, 2021

dsleiter commented Jun 25, 2021

dsleiter commented May 13, 2021 •

edited

Loading