Validator should allow extensions to data_types which only define data_type_inc #585

dsleiter · 2021-04-22T07:01:04Z

Description

This comes out of the schema discussion here: hdmf-dev/hdmf-schema-language#13
in the context of how to handle extra fields #542

If a spec group/dataset is defined with only a data_type_inc, it should be allowed to extend the data_type without defining a new data type via data_type_def. For example, a group can contain a dataset which both includes/inherits from VectorData and extends to add a new attribute without defining data_type_def.

- data_type_def: MyTable
  datasets:
  - name: new_column
    data_type_inc: VectorData
    attributes:
    - name: new_attr
      dtype: text
      doc: a new attribute

The validator should allow for these circumstances and validate against all changes/extensions to the original data_type.

Ref:

implementation notes:

The current validator creates a map of specs/validators (ValidatorMap) using the defined data_type (via data_type_def) as the key. When validating a builder against the above spec, it validates new_column against the original spec for VectorData rather than the modified version above, and not return a MissingError if the attribute is missing. Furthermore, as of #542, an ExtraFieldWarning will be returned saying that new_attr is not part of the spec.

The validator will need to keep track of where it is in the spec tree rather than relying on specific data types. Perhaps the ValidatorMap will only be used for base-level data_types.

Steps to Reproduce

The following should return a MissingError because the dataset builder does not have the new_attr attribute.

from hdmf.build import GroupBuilder, DatasetBuilder
from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace, DatasetSpec, AttributeSpec
from hdmf.validate import ValidatorMap

spec_catalog = SpecCatalog()
dataset_spec = DatasetSpec(doc='A test dataset', data_type_def='Bar')
group_spec = GroupSpec(
    doc='A test group',
    data_type_def='Foo',
    datasets=[
        DatasetSpec(
            'A modified Bar',
            data_type_inc='Bar',
            attributes=[
                AttributeSpec(name='new_attr', doc='An extra attribute', dtype='text')
            ]
        )
    ]
)
spec_catalog.register_spec(dataset_spec, 'test.yaml')
spec_catalog.register_spec(group_spec, 'test.yaml')
namespace = SpecNamespace('a test namespace', 'test_core', [{'source': 'test.yaml'}],
                          version='0.1.0', catalog=spec_catalog)
vmap = ValidatorMap(namespace)

builder = GroupBuilder('foo', attributes={'data_type': 'Foo'},
                       datasets=[DatasetBuilder('bar', attributes={'data_type': 'Bar'})])
result = vmap.validate(builder)
print(result)

Furthermore, after re-merging of #542, if the new_attr attribute is present, it should not return an ExtraFieldWarning.

Environment

Python Executable: Conda
Python Version: Python 3.8
Operating System: Linux
HDMF Version: 2.4.0

Checklist

Have you ensured the bug was not already reported ?
Have you included a brief and descriptive title?
Have you included a clear description of the problem you are trying to solve?
Have you included a minimal code snippet that reproduces the issue you are encountering?
Have you checked our Contributing document?

The text was updated successfully, but these errors were encountered:

dsleiter · 2021-04-22T07:02:39Z

@rly @oruebel is there any difference in what types of extensions should be allowed when a spec defines data_type_inc but not data_type_def vs when it defines both?

oruebel · 2021-04-22T07:17:27Z

I think additions (i.e., new attributes, datasets, etc.) and restrictions (i.e., defining a more specific datatype or shape for a dataset that are compliant with the parent spec) should be allowed. Changes to existing fields that violate the parent schema should not be allowed, however.

For the validator, I think it could be useful if this check would be a configurable behavior (e.g., --warn-on-type-modifications). I.e., by default let this pass and allow a user to enable more strict checking to warn if a data_type is being modified without creating a data_type_def. Although for the validation of data files this may not be necessary, but is something that would be useful for schema validator not a data validator.

dsleiter · 2021-04-22T07:19:06Z

Thinking further about this, it seems like this would essentially allow for diamond inheritance.

Example: imagine groups A and B where B inherits from A. Then you have another group Foo which has a child-group A where it adds an attribute. What validation should be done if a Foo builder has a child group of type B?

Example spec:

- data_type_def: A
  doc: group of type A
- data_type_def: B
  data_type_inc: A
  doc: group of type B
  attributes:
  - name: attr_b
    dtype: text
    doc: attribute of type B
- data_type_def: Foo
  doc: an example group
  groups:
  - data_type_inc: A
    doc: an extended A group
    attributes:
    - name: attr_a_ext
      dtype: text
      doc: attribute of type A

Now we have a GroupBuilder of type Foo which has a child-GroupBuilder of type B. What attributes should be expected on B? It seems to me that the most consistent expectation is that it should have both attr_b and attr_a_ext, do you agree @rly and @oruebel ?

What should happen now if B and the extended A define an attribute named attr_conflict but with a different dtype? The spec is valid because attr_conflict is an addition to A in both cases, but there doesn't seem to be any way to satisfy the spec when passing a subgroup of type B to Foo.

dsleiter · 2021-04-22T07:24:34Z

I think additions (i.e., new attributes, datasets, etc.) and restrictions (i.e., defining a more specific datatype or shape for a dataset that are compliant with the parent spec) should be allowed. Changes to existing fields that violate the parent schema should not be allowed, however.

Makes sense. Is this not the case when data_type_def is used? i.e. could a spec change the dtype of a dataset when using data_type_inc as long as it includes data_type_def?

For the validator, I think it could be useful if this check would be a configurable behavior (e.g., --warn-on-type-modifications). I.e., by default let this pass and allow a user to enable more strict checking to warn if a data_type is being modified without creating a data_type_def. Although for the validation of data files this may not be necessary, but is something that would be useful for schema validator not a data validator.

Ok, I'll keep that in mind as I look for an approach.

oruebel · 2021-04-22T07:27:33Z

i.e. could a spec change the dtype of a dataset when using data_type_inc as long as it includes data_type_def?

I believe the same restrictions apply with data_type_def, i.e., you can't make changes that violate the parent schema.

oruebel · 2021-04-22T07:41:42Z

Thinking further about this, it seems like this would essentially allow for diamond inheritance.

I'm not sure that this example shows diamond inheritance, but I agree that this is a tricky case in that you modify a type but still want to allow other subtypes with the same modifications. I don't think this case occurs in any current schema so I think it should be fine to not allow subtypes of A to be used if A is being modified. This behavior I believe is identical to what would happen if you declared a new data_type_def C in Foo (instead of just modifying A), because once you create a subtype C of A you could no longer use B inside Foo because it does not inherit from the new type.

Ideally, we would not allow these kind of modifications without declaring a new type, but unfortunately that would break a number of existing parts of NWB.

dsleiter · 2021-04-28T08:02:42Z

@oruebel thanks for the discussion. I'm diving into this deeper, and am uncovering even more questions. I'll lay them out below.

Schema vs builder validator

Do you agree that the builder validator should take whatever spec it is given to be the source of truth, that any problems with the spec should be caught by a schema validator, and that if there is a problem with the spec that the builder validator behavior is undefined?

eg. if dataset of type A has a dtype of int, and dataset of type B which inherits from A has a dtype of text - this should be a shema validation error and we don't need to specify behavior for the builder validator

Does a schema validator already exist?

Validation of inherited requirements

It looks like the current validator doesn't validate against inherited requirements.

e.g. I would expect that the following should return a validation error because attribute foo is not present on the builder of type G2.

from hdmf.build import GroupBuilder
from hdmf.spec import GroupSpec, SpecCatalog, SpecNamespace, AttributeSpec
from hdmf.validate import ValidatorMap

spec_catalog = SpecCatalog()
g1_spec = GroupSpec(doc='g1', data_type_def='G1',
                    attributes=[AttributeSpec(name='foo', doc='an attribute', dtype='text')])
g2_spec = GroupSpec(doc='g2', data_type_def='G2', data_type_inc='G1')

spec_catalog.register_spec(g1_spec, 'test.yaml')
spec_catalog.register_spec(g2_spec, 'test.yaml')
namespace = SpecNamespace('a test namespace', 'test_core', [{'source': 'test.yaml'}],
                          version='0.1.0', catalog=spec_catalog)
vmap = ValidatorMap(namespace)
builder = GroupBuilder('bar', attributes={'data_type': 'G2'})

result = vmap.validate(builder)
print(result)

But no error is returned. Is this expected behavior?

Two approaches

I've been thinking about two approaches to resolving this issue:

treat extended data types without data_type_def as anonymous types in that they have a single well-defined spec (and data type inheritance tree) but without an explicit name. The validator will need to know how to combine the specs together into a single spec, and then validate any builders against that anonymous spec.
treat the base data type spec and the extended spec as separate specs and validate a builder against both during validation

At first, I was thinking of (1) (and that's where I think the diamond inheritance question arises), but now I think (2) would be less complex.

(2) is nice because it:

avoids the complexity of trying to merge specs together
avoids logic needed to distinguish between providing a fixed name to a child (which doesn't extend a data type) and adding an attribute (which does extend a data type)

(2) could also be extended to resolve the issue with validating inherited requirements by just validating against the entire inheritance tree. So in that case the builder would be validated against both G2 and G1. Furthermore, this could also handle the above discussion on situations where you provide a child data type in a location where the spec extends the parent (eg. in the example above, the data type would be validated against the spec for A, B, and the extension for A. It would be impossible to satisfy the spec with a data type of type B, but we wouldn't have to create a special rule about providing a child data type in place of a parent with extended rules)

Downsides with (2) are:

it will be more difficult clearly communicate to the user the source of the error (eg. the user provided a data type of type B, but the error comes from validation against an extension of type A).
the validator will now need to manage the list of specs to validate against for each builder

(1) would require defining and programming rules for merging specs, but then allow each data type to be validated against just 1 spec.

Do you have any additional thoughts on the approach and tradeoffs?

dsleiter · 2021-04-28T08:24:26Z

Ideally, we would not allow these kind of modifications without declaring a new type, but unfortunately that would break a number of existing parts of NWB.

Yeah, no worries, I understand why we need to make it work here. :-)

Is there any possibility of explicitly forbidding extension of data types without defining a new data type in the next major version of the hdmf-schema-language? I understand that would take extra work to update the existing specs to be in compliance (and new major release versions for those), but it also seems like untyped extensions create a larger surface area for bugs and for users to do unexpected things.

As a recent new spec creator, I can say that even after doing my best to read the documentation, it wasn't always clear what was the right way to go about defining a spec, and I often used the existing specs from hdmf-common-schema and nwb-schema as examples. So even if we need to support old spec versions in parts of hdmf, it might be good in the long-term to prevent future specs from being able to extend without defining a type.

oruebel · 2021-04-28T08:49:08Z

Does a schema validator already exist?

Only in the sense of that there is a JSON-schema that you can use to validate and that the schema can be read by the spec reader. That however only means compliance with the schema language itself but does not validate inheritance rules. Having a true spec validator that also checks for compliance with a number of additional rules would be very useful to have, e.g., to check:

Are types always defined at the root (nesting specs can be problematic)
Are modifications in inherited types compliant with the base type
etc.

It looks like the current validator doesn't validate against inherited requirements.

This sounds like a bug. @rly thoughts?

I've been thinking about two approaches to resolving this issue:

I have to admit, this is a non-trivial issue. I'm not sure what the best solution for this is right now.

it might be good in the long-term to prevent future specs from being able to extend without defining a type.

I agree, that would be useful. One way to get (at least partially there) is to create a SpecValidator that we can set in the constructor of the spec classes (e.g,. GroupSpec) to optionally validate the spec on init and each time it is being modified.

Is there any possibility of explicitly forbidding extension of data types without defining a new data type in the next major version of the hdmf-schema-language?

The problem is less with the next major version of the language as it is with breaking NWB. I'm not against the idea, but I'm not sure right now how to do this without causing chaos.

There are a number of complicated issues here, and I think it may be best to plan for chat at some point to discuss options via videochat.

* Fix hdmf-dev#585 * Ref hdmf-dev#542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated

* Ref hdmf-dev#585

* Ref hdmf-dev#585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach

* These unit tests will begin to fail after the update to the hdmf validator that increase validation coverage * Ref hdmf-dev/hdmf#585 * Ref hdmf-dev/hdmf#609 * See discussion: hdmf-dev/hdmf#609 (comment)

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix hdmf-dev#585

…ecs (#609) * Validate builders against both top level data type specs and inner specs * Fix #585 * Ref #542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated * Update changelog * Ref #585 * Fix pynwb validation errors related to reference and compound data types * Ref #585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach * Remove validator reference to H5-specific classes and add unit tests * use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix #585 * Update tests/unit/validator_tests/test_validate.py * Update tests/unit/validator_tests/test_validate.py Co-authored-by: Ryan Ly <rly@lbl.gov>

dsleiter · 2021-06-25T01:54:59Z

Closing this via #609 (I think it wasn't automatically closed because the merge was to rc/3.0.0)

…ecs (#609) * Validate builders against both top level data type specs and inner specs * Fix #585 * Ref #542 * Builder can now be validated against more than one spec in order to validate against additional fields added to inner data types * Also make validation Errors comparable as a way to remove duplicates that can sometimes be generated * Update changelog * Ref #585 * Fix pynwb validation errors related to reference and compound data types * Ref #585 * This is just a workaround for checking the data_type of BuilderH5ReferenceDataset and BuilderH5TableDataset objects * Plan to add unit tests after some discussion to validate the approach * Remove validator reference to H5-specific classes and add unit tests * use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix #585 * Update tests/unit/validator_tests/test_validate.py * Update tests/unit/validator_tests/test_validate.py Co-authored-by: Ryan Ly <rly@lbl.gov>

dsleiter added the category: bug errors in the code or code behavior label Apr 22, 2021

dsleiter mentioned this issue May 13, 2021

Validate builders against both top level data type specs and inner specs #609

Merged

6 tasks

dsleiter added a commit to agencyenterprise/hdmf that referenced this issue May 13, 2021

Update changelog

eb1411f

* Ref hdmf-dev#585

dsleiter mentioned this issue Jun 3, 2021

Create validation warnings on extra fields #624

Open

5 tasks

dsleiter mentioned this issue Jun 9, 2021

Fix integration tests with invalid test data that will be caught by future hdmf validator version NeurodataWithoutBorders/pynwb#1366

Merged

6 tasks

dsleiter added a commit to agencyenterprise/hdmf that referenced this issue Jun 17, 2021

Remove validator reference to H5-specific classes and add unit tests

1ec83d6

* use ReferenceResolver instead of referencing BuilderH5ReferenceDataset or BuilderH5TableDataset * Fix hdmf-dev#585

dsleiter closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validator should allow extensions to data_types which only define data_type_inc #585

Validator should allow extensions to data_types which only define data_type_inc #585

dsleiter commented Apr 22, 2021

dsleiter commented Apr 22, 2021

oruebel commented Apr 22, 2021

dsleiter commented Apr 22, 2021

dsleiter commented Apr 22, 2021

oruebel commented Apr 22, 2021

oruebel commented Apr 22, 2021

dsleiter commented Apr 28, 2021

dsleiter commented Apr 28, 2021

oruebel commented Apr 28, 2021

dsleiter commented Jun 25, 2021

Validator should allow extensions to data_types which only define data_type_inc #585

Validator should allow extensions to data_types which only define data_type_inc #585

Comments

dsleiter commented Apr 22, 2021

Description

Steps to Reproduce

Environment

Checklist

dsleiter commented Apr 22, 2021

oruebel commented Apr 22, 2021

dsleiter commented Apr 22, 2021

dsleiter commented Apr 22, 2021

oruebel commented Apr 22, 2021

oruebel commented Apr 22, 2021

dsleiter commented Apr 28, 2021

Schema vs builder validator

Validation of inherited requirements

Two approaches

dsleiter commented Apr 28, 2021

oruebel commented Apr 28, 2021

dsleiter commented Jun 25, 2021