Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate builders against both top level data type specs and inner specs #609

Merged

Conversation

dsleiter
Copy link
Contributor

@dsleiter dsleiter commented May 13, 2021

In certain specs, there are datasets or groups nested inside a top-level spec (let's call them "inner" specs) which inherit from a data type via data_type_inc and also add or modify fields (let's call it an "extension" of the base data type). In most situations, a new type is defined via data_type_def, but sometimes no new type is defined (so let's say it an "anonymous" extension).

In order to validate against the added/modified fields in the anonymous data type extension while still validating against the fields in the base data type, the validator will now in this situation validate an inner builder twice - once against the base data type spec and a second time against the inner spec. This resolves the validation issue when there is an anonymous extension.

The double validation also happens when the inner spec is a "pure" data type and doesn't make any additions or modifications, but in this case, the extra validation will not have any effect.

There are situations where the same attribute or field is defined redundantly at the top-level data type and at the inner spec. If there is a problem with this attribute/field, two equal validations are now generated. However, before returning the full list of validation errors, any duplicates are removed. Duplicates are identified by new logic for Error equality, which depends on the error location, reason, and name of the component.

An alternative solution that I considered was to try to merge the specs together. However, there are a lot of complex rules on how the specs should work together and what is a valid extension, and I didn't want the validator to take on the responsibility of correctly merging specs (I don't think this can be done at spec-loading time unless we add additional restrictions to the schema language). Perhaps in the future we can make adjustments to the schema-language to either avoid the need for this or make it possible to merge specs at load time.

I did do quite a bit of code reading while working on this to understand how specs are loaded, how the namespaces are managed, and how spec inheritance works. I did do my best to avoid any side effects of this change, but I can't say for certain that I've 100% avoided them, so please do look critically at the changes.

Note that this change will make it more difficult to raise warnings on extra fields since we want to only raise a warning if there is a field that isn't listed in both specs. I plan to approach that by creating a ValidationQueue where you can queue up specs to validate against a particular builder, only run the validation once all specs for that builder are queued, and then only pass through missing field warnings that are generated from both specs.

Motivation

See #585

How to test the behavior?

Execute unit tests and see that the new tests pass.

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

* Fix hdmf-dev#585
* Ref hdmf-dev#542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated
@dsleiter dsleiter mentioned this pull request May 13, 2021
5 tasks
@dsleiter
Copy link
Contributor Author

dsleiter commented Jun 7, 2021

I'm currently exploring the errors in the pynwb CI. Looks like there are a number of pynwb specs that use inner spec extension (as described above), which are now failing validation for one or more reasons.

I'm still exploring the errors, but at least some of them are caused by the validator not handling dataset reference vectors (shape is 1D). In this case, the DatasetBuilder has a BuilderH5ReferenceDataset in the data field.

I am thinking of breaking this fix into two parts:

  1. find a way to at least get pynwb tests to all pass in this PR, even if that means skipping validation of BuilderH5ReferenceDataset data (looks like they were never being evaluated before this PR anyways)
  2. create a separate issue & PR for validating references

I'll change the PR to a draft until I get the pynwb tests resolved.

@dsleiter dsleiter marked this pull request as draft June 7, 2021 09:52
* Ref hdmf-dev#585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach
@dsleiter
Copy link
Contributor Author

dsleiter commented Jun 9, 2021

I think I determined the sources for all for the errors, and they all fell into 4 categories:

  1. When a BuilderH5ReferenceDataset is passed to hdmf.validate.validator.get_type, which raises an error. As far as I can tell, this happens when a file is read from disk that contains a dataset with a reference dtype with shape (N,)
  2. When a BuilderH5TableDataset is passed to hdmf.validate.validator.get_type, which raises an error. As far as I can tell, this happens when a file is read from disk that contains a dataset with a compound dtype.
  3. pynwb integration tests in integration/hdf5/test_ecephys.py that use the function TestElectricalSeriesIO.make_electrode_table fail correctly because the value for x in table.add_row is provided as an int while the schema expects a float
  4. the nwb schema expects the filtering column of an electrodes table to be of type float32 but many tests provide a string, and NWBFile.add_electrode requires a string for filtering

I've added a workaround to the hdmf validation code to resolve (I think) 1. and 2. It would be good to get feedback @rly and @ajtritt (and anyone else) about the implementation. In particular, is there another way you'd recommend to determine the type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects? It seems to be an abstraction leakage to reference classes related to the hdf5 backend from the validator, so I'd be interested if there's another type up the MRO that I should be checking for instead.

Item 3. is just a quick fix to a pynwb test, I will submit a PR.

Item 4. however seems to be a concern. It seems like there is a core disagreement between nwb-schema and pynwb implementation that should be resolved, but it isn't clear to me which of the two should be changed, or if there's another way to make them both work. Regardless, it seems like this makes this current PR a breaking change, and the disagreement between nwb-schema and pywnb should be resolved before this PR is released. Thoughts?

There's a lot of things involved here, and my interpretation of everything going on might not be right, so please correct me if anything above is incorrect!

dsleiter added a commit to agencyenterprise/pynwb that referenced this pull request Jun 9, 2021
* These unit tests will begin to fail after the update to the hdmf
  validator that increase validation coverage
* Ref hdmf-dev/hdmf#585
* Ref hdmf-dev/hdmf#609
* See discussion: hdmf-dev/hdmf#609 (comment)
@oruebel
Copy link
Contributor

oruebel commented Jun 9, 2021

I believe both `BuilderH5ReferenceDatasetandBuilderH5TableDataset`` inherit from ``hdmf.query.ReferenceResolver`` so it may be possible to use that for checking in the validator @ajtritt @rly

@dsleiter
Copy link
Contributor Author

@oruebel that seems like it would work.

At first I thought that we might have to treat BuilderH5ReferenceDataset objects differently from BuilderH5TableDataset objects, but it looks like in either case we can check for ReferenceResolver and return the object's dtype.

From a conceptual standpoint, is that the appropriate abstraction? Does it make sense to consider a dataset with a composite dtype (which seems to be the case when there is a BuilderH5TableDataset based on this comment) as a Reference?

Also, as far as I can tell, these BuilderH5... objects are only instantiated when reading an existing file from disk and not when writing a newly created nwbfile object to disk. So there is a difference in builder graph when going from new container -> builder vs going from file -> builder. It would be beneficial for the validator and just conceptually easier if there were only one way to represent nwb data with a builder graph. Would it be possible and worthwhile to add another step while loading files to convert/resolve the BuilderH5... objects to the same builder objects you'd get when starting with a set of containers? Maybe this wouldn't be a priority, but if it is potentially beneficial, I can add an issue.

@oruebel
Copy link
Contributor

oruebel commented Jun 10, 2021

Does it make sense to consider a dataset with a composite dtype (which seems to be the case when there is a BuilderH5TableDataset based on this comment) as a Reference?

If I remember correctly, composite types are only wrapped if the compound datatype contains a reference. I believe compound types that only contain numerical data are treated as regular datasets. @ajtritt is that correct?

Also, as far as I can tell, these BuilderH5... objects are only instantiated when reading an existing file from disk and not when writing a newly created nwbfile object to disk

This is because for datasets that contain reference we need to resolve the references to appropriate builder/container object. To do this on read we'd need to read all datasets that contain references into memory and resolve all the links. To support lazy read of datasets with references, the datasets are, therefore, wrapped on read.

So there is a difference in builder graph when going from new container -> builder vs going from file -> builder.

I'm not sure we can fully avoid this. The reason being that the source of the data are different on write and read. When writing data the data is organized in Container classes and usually lives in memory (or is retrieved iteratively with DataChunkIterators and/or wrapped with DataIO). On read the data sits in an HDF5 file and we don't yet have Containers. In particular when working with links this is tricky. One possible approach that we had discussed at some point would be to always wrap all data arrays. However, the effort involved with doing this is substantial and will likely end up breaking a lot of things, which is why we have so far avoided going this route. This has come up, e.g., also during #480 where we now also need to wrap string datasets since h5py3 changed how strings are being handled. I don't think this should cause issues for the validator since StrDataset extends h5py.Dataset.

@dsleiter
Copy link
Contributor Author

If I remember correctly, composite types are only wrapped if the compound datatype contains a reference. I believe compound types that only contain numerical data are treated as regular datasets. @ajtritt is that correct?

Ah, ok, that makes sense, thanks for the clarification.

This is because for datasets that contain reference we need to resolve the references to appropriate builder/container object. To do this on read we'd need to read all datasets that contain references into memory and resolve all the links. To support lazy read of datasets with references, the datasets are, therefore, wrapped on read.

Gotcha, the need for lazy evaluation makes sense as to why it is like it is. Thanks.

I'm not sure we can fully avoid this. The reason being that the source of the data are different on write and read. When writing data the data is organized in Container classes and usually lives in memory (or is retrieved iteratively with DataChunkIterators and/or wrapped with DataIO). On read the data sits in an HDF5 file and we don't yet have Containers. In particular when working with links this is tricky. One possible approach that we had discussed at some point would be to always wrap all data arrays. However, the effort involved with doing this is substantial and will likely end up breaking a lot of things, which is why we have so far avoided going this route. This has come up, e.g., also during #480 where we now also need to wrap string datasets since h5py3 changed how strings are being handled. I don't think this should cause issues for the validator since StrDataset extends h5py.Dataset.

Good to know you've been discussing all of this. I understand the motivation, and it seems like it'd be hard to get around having two different implementations. So for now, I'll go ahead with the solution we discussed and add unit tests.

The one final thing that needs to have a course of action before I can finalize this PR is this item:

  1. the nwb schema expects the filtering column of an electrodes table to be of type float32 but many tests provide a string, and NWBFile.add_electrode requires a string for filtering

@rly
Copy link
Contributor

rly commented Jun 14, 2021

The one final thing that needs to have a course of action before I can finalize this PR is this item:

  1. the nwb schema expects the filtering column of an electrodes table to be of type float32 but many tests provide a string, and NWBFile.add_electrode requires a string for filtering

This is a bug in the NWB schema. This PR will fix it: NeurodataWithoutBorders/nwb-schema#478

@rly
Copy link
Contributor

rly commented Jun 14, 2021

I will also make a corresponding PR in PyNWB.

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix hdmf-dev#585
@dsleiter dsleiter changed the base branch from dev to rc/3.0.0 June 17, 2021 07:10
@dsleiter dsleiter marked this pull request as ready for review June 17, 2021 07:11
@dsleiter
Copy link
Contributor Author

Thanks for taking care of the electrode filtering issue @rly!

I believe this PR is now ready to go. I retargeted for rc/3.0.0 because this will cause errors on pynwb if it is merged in before the nwb schema update is merged and this PR is merged: NeurodataWithoutBorders/pynwb#1366, but let me know if you want it to merge into dev.

Copy link
Contributor

@oruebel oruebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsleiter thanks for all the hard work on these fixes to the validator! Looks good to me.

@rly
Copy link
Contributor

rly commented Jun 25, 2021

LGTM. Thanks for all the hard work on this!

@rly rly merged commit b93cde4 into hdmf-dev:rc/3.0.0 Jun 25, 2021
@dsleiter dsleiter deleted the bug/validate_extended_inc_data_types branch June 25, 2021 01:49
@dsleiter
Copy link
Contributor Author

Awesome, thanks for the support in sorting through all of the details here!

rly added a commit that referenced this pull request Jul 6, 2021
…ecs (#609)

* Validate builders against both top level data type specs and inner specs

* Fix #585
* Ref #542
* Builder can now be validated against more than one spec in order to validate against additional
  fields added to inner data types
* Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Update changelog

* Ref #585

* Fix pynwb validation errors related to reference and compound data types

* Ref #585
* This is just a workaround for checking the data_type of
  BuilderH5ReferenceDataset and BuilderH5TableDataset objects
* Plan to add unit tests after some discussion to validate the approach

* Remove validator reference to H5-specific classes and add unit tests

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset
* Fix #585

* Update tests/unit/validator_tests/test_validate.py

* Update tests/unit/validator_tests/test_validate.py

Co-authored-by: Ryan Ly <rly@lbl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator should allow extensions to data_types which only define data_type_inc
3 participants