Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify dset/attr builders based on sidecar JSON #677

Draft
wants to merge 27 commits into
base: dev
Choose a base branch
from
Draft

Conversation

rly
Copy link
Contributor

@rly rly commented Nov 11, 2021

Motivation

Fix #676.

This is DEMO code to demonstrate how HDMF might support the reading of file modifications from a sidecar JSON. I made up the formatting/schema of the JSON file. It would look like the below.

In this set up, the JSON file has a list of versions which have a label, description, and list of changes.
Each change specifies an object id, relative path to the dataset/attribute being changed from the group/dataset with the object id, and the new value, which can be a scalar, list, or nested list.

The builder values are replaced after the file is fully built by HDF5IO.

Lots of details to be worked out (changing data types? changing shape? compound dtypes?). Let me know what you think of this approach. @oruebel @ajtritt @bendichter

{
    "versions": [
        {
            "label": "version 2",
            "description": "change attr1 from 'old' to 'my experiment' and my_data from [1, 2, 3] to [4, 5, 6, 7]",
            "changes": [
                {
                    "object_id": "3350e602-5073-4bd8-b835-2c771e8b11f9",
                    "relative_path": "attr1",
                    "new_value": "my experiment"
                },
                {
                    "object_id": "3350e602-5073-4bd8-b835-2c771e8b11f9",
                    "relative_path": "my_data",
                    "new_value": [
                        4,
                        5,
                        6,
                        7
                    ]
                }
            ]
        },
        {
            "label": "version 3",
            "description": "change sub_foo/my_data from [-1, -2, -3] to [[0]]",
            "changes": [
                {
                    "object_id": "a0eebdae-5c13-455c-bcf6-2e5b630a3079",
                    "relative_path": "my_data",
                    "new_value": [
                        [
                            0
                        ]
                    ]
                }
            ]
        }
    ]
}

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Base: 87.65% // Head: 87.18% // Decreases project coverage by -0.47% ⚠️

Coverage data is based on head (fee5245) compared to base (61eec5c).
Patch coverage: 69.47% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #677      +/-   ##
==========================================
- Coverage   87.65%   87.18%   -0.48%     
==========================================
  Files          44       45       +1     
  Lines        8845     9082     +237     
  Branches     2051     2108      +57     
==========================================
+ Hits         7753     7918     +165     
- Misses        777      829      +52     
- Partials      315      335      +20     
Impacted Files Coverage Δ
src/hdmf/spec/spec.py 90.86% <ø> (ø)
src/hdmf/build/builders.py 94.33% <62.79%> (-5.67%) ⬇️
src/hdmf/backends/builderupdater.py 66.66% <66.66%> (ø)
src/hdmf/backends/__init__.py 100.00% <100.00%> (ø)
src/hdmf/backends/io.py 96.49% <100.00%> (+0.26%) ⬆️
src/hdmf/build/__init__.py 100.00% <100.00%> (ø)
src/hdmf/build/errors.py 100.00% <100.00%> (ø)
src/hdmf/build/objectmapper.py 93.12% <100.00%> (+0.39%) ⬆️
src/hdmf/backends/hdf5/h5tools.py 82.35% <0.00%> (+0.22%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@oruebel
Copy link
Contributor

oruebel commented Nov 11, 2021

Let me know what you think of this approach.

Generally, this approach makes sense to me for updates on read.

Sidecar file schema

  • The general content in your example looks good.
  • The label of the version I think should require semantic versioning. In the API then, a user should only be able to indicate whether the changes are major, minor, or patch version bump and the API would then set the version automatically. Users can add there own comments in the description.
  • We should look at PROV to see if following the PROV-JSON schema for describing these changes will work. This would make this part of the sidecar file idea more accessible to web standards and PROV tools and potentially help with maintaining this functionality if we can us PROV tools. This library may be of interest for this https://prov.readthedocs.io/en/latest/readme.html. It may very well be that this is not suitable here, but I think it is worth taking a look at.
  • We should require a ISO8061 datetime stamp to be included with each version.
  • Even if we don't use PROV, it will be useful to have a key for an "agent" to record who made the changes related to a particular version
  • new_value should be renamed to simply value
  • We should include a dtype field here. JSON doesn't know the difference between int16 and int32, so we'll need to cast the data to correct type after read. This would also help address your question of "changing dtype" as this would mean specifying the dtype in the change but not the value.
  • Similarly, to dtype you could also allow for a key shape in the changes section (although I'm not sure it is necessary, unless you have to reshape the dataset, but I'm not sure what the use-case for that is right now).

Questions

  • Updating datasets with object_references and probably also updating compound_datasets is not supported yet in the current design, correct?
  • You mentioned "changing data types? changing shape? I'm not sure what the use-case for these operations is. The shape and dtype of the data is dictated by the schema, so why would we reshape or cast a dataset in a valid NWB file?

@rly
Copy link
Contributor Author

rly commented Nov 11, 2021

Let me know what you think of this approach.

Generally, this approach makes sense to me for updates on read.

Sidecar file schema

  • The general content in your example looks good.
  • The label of the version I think should require semantic versioning. In the API then, a user should only be able to indicate whether the changes are major, minor, or patch version bump and the API would then set the version automatically. Users can add there own comments in the description.

How is semantic versioning defined for data? What constitutes a major, minor, or patch version bump?

  • We should look at PROV to see if following the PROV-JSON schema for describing these changes will work. This would make this part of the sidecar file idea more accessible to web standards and PROV tools and potentially help with maintaining this functionality if we can us PROV tools. This library may be of interest for this prov.readthedocs.io/en/latest/readme.html. It may very well be that this is not suitable here, but I think it is worth taking a look at.

It's worth noting that if we go with PROV, there will be a tradeoff between having the changes be accessible using PROV tools and human-readability/simplicity. Is "a user wants to edit this file by hand" a primary user story that we want to support?

  • We should require a ISO8061 datetime stamp to be included with each version.

👍

  • Even if we don't use PROV, it will be useful to have a key for an "agent" to record who made the changes related to a particular version

👍

  • new_value should be renamed to simply value

👍

  • We should include a dtype field here. JSON doesn't know the difference between int16 and int32, so we'll need to cast the data to correct type after read. This would also help address your question of "changing dtype" as this would mean specifying the dtype in the change but not the value.

👍

  • Similarly, to dtype you could also allow for a key shape in the changes section (although I'm not sure it is necessary, unless you have to reshape the dataset, but I'm not sure what the use-case for that is right now).

👍

Questions

  • Updating datasets with object_references and probably also updating compound_datasets is not supported yet in the current design, correct?
  • You mentioned "changing data types? changing shape? I'm not sure what the use-case for these operations is. The shape and dtype of the data is dictated by the schema, so why would we reshape or cast a dataset in a valid NWB file?

Some type specs allow a choice from multiple shapes and data types, e.g. TimeSeries.

@oruebel
Copy link
Contributor

oruebel commented Nov 11, 2021

It's worth noting that if we go with PROV, there will be a tradeoff between having the changes be accessible using PROV tools and human-readability/simplicity.

I agree that this is a concern. Understanding PROV is not trivial. Another main concern is that I believe the PROV data model assumes that entities are immutable , i.e., all changes to an entity (e.g., dataset or attribute in our case) would be assumed to be represented by the creation of a new entity. In this case, however, we want to update the entity, rather than creating a new one. The following paper may be relevant for this topic https://link.springer.com/chapter/10.1007/978-3-319-98379-0_7

Ultimately, PROV may or may not be the right approach for this, but I think its worth spending a little bit of time looking at it to see if it makes sense. If we decide against it, then at least we have a clear answer for folks why we chose to not use it.

Some type specs allow a choice from multiple shapes and data types, e.g. TimeSeries.

Correct, but I don't understand why one would need to change the shape of the data after the fact (unless you are replacing the values all-together)

How is semantic versioning defined for data?

I think semantic versioning can be applied fairly straight-forward here:

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible data changes. I.e., when adding new main data type, e.g., a new ElectricalSeries, or when changing the dtype or shape of an object.
  2. MINOR version when you add functionality in a backwards-compatible manner. Here this would mean, e.g., adding columns to a DynamicTable.
  3. PATCH version when you make backwards-compatible bug fixes, i.e., update the value of an existing dataset or attribute (without changing shape or dtype).

Since we have a data schema, I think it may be possible to codify this to automatically determine version numbers based on the type of changes made. Providing some strict versioning rules from the start I think will make things easier down the line.

This being said, I think it makes sense to have both a user-defined label key and a separate version key for a version number. I.e., for now, I would suggest leaving the label as is and just adding a version key that requires a semantic versioning string (which initially could just be set by a user and then as we make progress can be codified further).

Question: Which changes to do we allow via the sidecar file?
I think (at least at for now) we should only allow replacing datasets and attributes, while changes to individual values or adding new containers should be done in the HDF5 file. This means the following cases would not be supported (at least in the first iteration of this) but are modifications that would only be possible via the APIs by modifying the HDF5 file:

- Adding rows to a dynamic table
- Adding a new recording or processing result (i.e., add a new neurodata_type)
- Adding a new group
- Updating datasets with object_references
- Updating datasets with compound data type
- ....

You could still record those changes as a version in the sidecar file (essentially by having a record with an empty changes key), so you would still have a record of the changes, but the changes would then not be reversible.

Question: How do we envision the sidecar file to be created in the API?
I think it would be worth having a separate discussion about this via Zoom to sketch out in pseudocode how this would look. My current thinking is that we would have a separate class to interact with sidecar files, which would server as a "low-level" interface for making modifications to a sidecar file directly. Container classes would then need to provide high-level options to make changes, e.g., when changing an attribute value those changes would be routed to the sidecar file (either directly or when calling write, depending on the change).

Question: How to deal with larger changes?
I think for now, we probably only need to worry about smaller datasets, but it may be worth considering how we want to deal with larger changes as well. Instead of recording new values directly in the sidecare file I could imagine that we may want to record changes in separate HDF5 files to load the values from there. This would not just allow for more compact storage of binary data but it would also make dealing with compound data types, object references, and adding of new containers (without modifying the main file) possible.

@bendichter
Copy link
Contributor

What if you want to remove values? e.g. what if you have the age of a subject and then realize that this age is incorrect but you don't know what the correct values is.

@rly
Copy link
Contributor Author

rly commented Nov 30, 2021

What if you want to remove values? e.g. what if you have the age of a subject and then realize that this age is incorrect but you don't know what the correct values is.

You would set value: null

@rly
Copy link
Contributor Author

rly commented Dec 1, 2021

Latest example JSON:

{
    "versions": [
        {
            "label": "2.0.0",
            "description": "change attr1 from 'old' to 'my experiment' and my_data from [1, 2, 3] to [4, 5]",
            "datetime": "2020-10-29T19:15:15.789Z",
            "agent": "John Doe",
            "changes": [
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "attr1",
                    "value": "my experiment"
                },
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data",
                    "value": [
                        4,
                        5
                    ],
                    "dtype": "int32"
                }
            ]
        },
        {
            "label": "3.0.0",
            "description": "change sub_foo/my_data from [-1, -2, -3] to [[0]], delete my_data/attr2, and change dtype of my_data",
            "datetime": "2021-11-30T20:16:16.790Z",
            "agent": "Jane Doe",
            "changes": [
                {
                    "object_id": "26b95d5a-b632-407d-921a-8e255752b0f7",
                    "relative_path": "my_data",
                    "value": [
                        [
                            0
                        ]
                    ]
                },
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data/attr2",
                    "value": null
                },
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data",
                    "value": [
                        6,
                        7
                    ],
                    "dtype": "int8"
                }
            ]
        },
        {
            "label": "3.0.1",
            "description": "change my_data from [4, 5] to [6, 7]",
            "datetime": "2021-11-30T20:17:16.790Z",
            "agent": "Jane Doe",
            "changes": [
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data",
                    "value": [
                        6,
                        7
                    ]
                }
            ]
        }
    ],
    "schema_version": "0.1.0"
}

@rly
Copy link
Contributor Author

rly commented Dec 1, 2021

TODO:

  • create jsonschema for sidecar JSON
  • validate sidecar JSON before read
  • set new dtype correctly
  • create documentation

Nice to have:

  • allow loading different versions from HDMFIO
  • allow modification of object references
  • allow modification of links
  • allow modification of compound dtypes
  • allow modification of slices of datasets or attributes

@bendichter
Copy link
Contributor

I feel like "label" should be "version"

@bendichter
Copy link
Contributor

Can groups be added? Would they be added implicitly if we gave a relative path that does not exist yet, or should be create them explicitly? For instance, if "a" exists as a group but "b" does not, and you want to create a dataset "d" inside of "b" inside of "a", then you might try adding "a/b/d" with some value, and have "b" implicitly created. Alternatively, we might want to require that "b" is created explicitly before we can add a dataset to it.

Can neurodata_type instances be added? (I'd lean towards no)

@rly
Copy link
Contributor Author

rly commented Dec 2, 2021

Under the current setup, new HDF5 elements cannot be added. But I see your point that we may want that, especially for datasets and attributes. Let's say a user wants to add "related_publications" when the dataset doesn't exist.

For new neurodata_type instances, I also lean toward no. At that point, I would recommend modifying the file directly.

@rly
Copy link
Contributor Author

rly commented Dec 6, 2021

TODO:

  • add uuid to each version so that different sidecar jsons can be more easily compared. e.g., two people make two different "version": "2.0.0" and we want to tell the difference between them and maybe even make a version map.
    • con: the file is more complicated to create by hand
  • add "operation" key that can take value "replace", "delete", etc. this will make it easier to expand functionality without breaking existing functionality

@rly
Copy link
Contributor Author

rly commented Dec 6, 2021

Comment from @yarikoptic

Versioning within this sidecar file is messy. Provenance is half-baked. Use a dedicated version system. If Version 2 changes A to B and Version 3 changes B to C, do we need to store the record of B? Just store that C replaces A. Just store an ordered list of changes.

Consider using "full_path" instead of "object_id" & "relative_path"

New example:

{
    "operations": [
        {
            "type": "replace",
            "description": "change attr1 to 'my experiment'",
            "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
            "relative_path": "attr1",
            "value": "my experiment"
        },
        {
            "type": "replace",
	        "description": "change my_data to [4, 5] (int32)",
            "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
            "relative_path": "my_data",
            "value": [
                4,
                5
            ],
            "dtype": "int32"
        },
        {
    	    "type": "remove",
    	    "description": "delete my_data/attr2",
            "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
            "relative_path": "my_data/attr2",
        }
    ],
    "schema_version": "0.1.0"
}

Also should check that when a file is read and file is exported to a new file, new data is present. When it is opened in append mode, do not replace the values (but note that if the file or container is modified, then the new values would be written! this could result in partial overwrites, where changes A is applied but change B is not, depending on what container is modified between read and write)

Advanced edge case: File A links to File B with its own sidecar JSON

@bendichter
Copy link
Contributor

I have an idea that I don't think should be in this MVP, but I wanted to make a note of it and see what you all thought. Another "type" might be "add_alternate_value", which would be used as a value if the first value was somehow unaccessible. This could be a good solution for providing a link to a file that might be locally or remotely stored.

@rly
Copy link
Contributor Author

rly commented Dec 8, 2021

I have an idea that I don't think should be in this MVP, but I wanted to make a note of it and see what you all thought. Another "type" might be "add_alternate_value", which would be used as a value if the first value was somehow unaccessible. This could be a good solution for providing a link to a file that might be locally or remotely stored.

This idea could work, but feels hacky and I would prefer to build this kind of functionality (alternate paths to look up a linked file) into core HDMF. I'm drafting an extension for that. If that takes too long though to release, then we can test out this solution.

@oruebel
Copy link
Contributor

oruebel commented Dec 8, 2021

I'm drafting an extension for that. If that takes too long though to release, then we can test out this solution.

Just an idea. One option could also be to use ExternalResources to assign alternate online paths. In this scenario, the dataset would store the original local filepath and via ExternalResources on could associate and arbitrary number of external online resources with it.

@oruebel
Copy link
Contributor

oruebel commented Dec 8, 2021

New example:

I think it may be useful to add the following keys at the top level (i.e., not for each operation, but for the whole sidecar file) could be useful:

  • description : Describe the purpose and summarize changes
  • label and/or version : Even when using version control systems to manage sidecar files it will be useful to indicate the version of the sidecar file in the file itself. I.e., this would be to help with versioning the sidecar file, not to version individual changes as in the original proposal.
  • authors : List of authors of the sidecar file (or alternatively we could make it best practice to update the author field in NWB if the authors of the sidecar file are different from the authors of the NWB file).

@rly rly marked this pull request as ready for review April 12, 2022 08:19
@rly rly changed the title [WIP] Update builders with modifications from sidecar JSON Replace values of or delete dset/attr builders based on sidecar JSON Apr 12, 2022
@rly rly changed the title Replace values of or delete dset/attr builders based on sidecar JSON Modify dset/attr builders based on sidecar JSON Apr 12, 2022
@rly rly requested review from oruebel and bendichter April 12, 2022 08:21
@rly
Copy link
Contributor Author

rly commented Apr 12, 2022

See https://hdmf--677.org.readthedocs.build/en/677/sidecar.html for documentation on what is currently supported in this PR. Feedback is appreciated.

I preserved some code that I wrote for other types of modifications in builderupdater.py but they are not currently used.

returns='The same input GroupBuilder, now modified.',
rtype='GroupBuilder'
)
def update_builder_from_sidecar(self, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add a post_read_builder function to HDMFIO itself to provide a standard place for I/O backends to update builders after read

Users may want to update part of an HDMF file without rewriting the entire file.
To do so, HDMF supports the use of a "sidecar" JSON file that lives adjacent to the HDMF file on disk and
specifies modifications to the HDMF file. Only a limited set of modifications are supported; for example, users can
delete a dataset or attribute but cannot create a new dataset or attribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delete a dataset or attribute but cannot create a new dataset or attribute.
hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute.

I think delete is misleading since we are not actually deleting any data from a file but the JSON file can only indicate that the dataset/attribute should be ignored on read (maybe hide or invalid would be more precise).

Does delete also apply to groups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll make the change. For now, I have not allowed hiding of groups because the use case is unclear. But it is technically not very different from hiding of datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a main use-case for hiding groups would instances of a data_type, e.g., to hide a TimeSeries that for some reason contains bad data. If it's trivial, then I think allowing to hide groups is something we could allow, but if it adds a lot of complexity then I would hold off until a specific need arises.

Users may want to update part of an HDMF file without rewriting the entire file.
To do so, HDMF supports the use of a "sidecar" JSON file that lives adjacent to the HDMF file on disk and
specifies modifications to the HDMF file. Only a limited set of modifications are supported; for example, users can
delete a dataset or attribute but cannot create a new dataset or attribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delete a dataset or attribute but cannot create a new dataset or attribute.
hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute.

I think delete is misleading since we are not actually deleting any data from a file but the JSON file can only indicate that the dataset/attribute should be ignored on read (maybe hide or invalid would be more precise).

Does delete also apply to groups?

Comment on lines 35 to 36
The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file
located at the root of the HDMF repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are sidecar files automatically validated by the validator as well?

Comment on lines 35 to 36
The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file
located at the root of the HDMF repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are sidecar files automatically validated by the validator as well?

docs/source/sidecar.rst Outdated Show resolved Hide resolved
"""Update the file builder in-place with the values specified in the sidecar JSON."""
# the sidecar json must have the same name as the file but with suffix .json
f_builder, path = getargs('file_builder', 'file_path', kwargs)
sidecar_path = Path(path).with_suffix('.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, then this assumes that the sidecar file has the same name but different suffix than the main file. While this is a good default strategy, I'm wondering whether this will work in DANDI. I think there external files (e.g., videos) are placed in a folder with the same name as file, and I'm wondering whether they would place the sidecar file there instead?

"""Update the file builder in-place with the values specified in the sidecar JSON."""
# the sidecar json must have the same name as the file but with suffix .json
f_builder, path = getargs('file_builder', 'file_path', kwargs)
sidecar_path = Path(path).with_suffix('.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, then this assumes that the sidecar file has the same name but different suffix than the main file. While this is a good default strategy, I'm wondering whether this will work in DANDI. I think there external files (e.g., videos) are placed in a folder with the same name as file, and I'm wondering whether they would place the sidecar file there instead?

Modifying an HDMF File with a Sidecar JSON File
===============================================

Users may want to update part of an HDMF file without rewriting the entire file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Users may want to update part of an HDMF file without rewriting the entire file.
Users may want to update part of an HDMF file without rewriting the entire file.

I think it would be useful to elaborate a little bit on this to clarify the intent and scope of the sidecar file, i.e., this is for small updates and corrections only.

Modifying an HDMF File with a Sidecar JSON File
===============================================

Users may want to update part of an HDMF file without rewriting the entire file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Users may want to update part of an HDMF file without rewriting the entire file.
Users may want to update part of an HDMF file without rewriting the entire file.

I think it would be useful to elaborate a little bit on this to clarify the intent and scope of the sidecar file, i.e., this is for small updates and corrections only.

rly and others added 3 commits April 21, 2022 00:58
Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
@yarikoptic
Copy link
Contributor

reordered for TL;DR:

Note: The default (or required) NWB/HDMF sidecar JSON file need not be named [nwb_file_base_name].json. It could be [nwb_file_base_name].overwrite.json to be more clear about its function.

I think this is a good idea/alternative!

  • I only worry about .overwrite being "too long" to be considered by some tools to be a part of the extension.
e.g. FWIW git-annex would not consider it to be an extension to be used in annexed key but it is not critical, just wanted to mention
$> ls -ld 123.overwrite.json
lrwxrwxrwx 1 yoh yoh 118 Apr 21 14:35 123.overwrite.json -> .git/annex/objects/fm/24/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.json/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.json

# you can see that it is just .json not .overwrite.json.  Let's try shorter:

$> touch 123.over.json 
$> git annex add 123.over.json
add 123.over.json 
ok
(recording state in git...)

$> ls -ld 123.over.json
lrwxrwxrwx 1 yoh yoh 128 Apr 21 14:35 123.over.json -> .git/annex/objects/6P/1w/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.over.json/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.over.json

But what about .nwb.json ???? kinda cute ;) If it would be possible for such overwrite files to be expressive enough to pretty much populate from scratch an .nwb file (i.e. start without any .nwb) would make even more sense to e.g. encapsulate all desired metadata (before running acquisition) without actual data file, and then eventually to produce the .nwb with data to accompany it.

individual short answers

@yarikoptic Sorry, I am a little confused. Are you saying that because BIDS allows sidecar JSON files and NWB/HDMF (will) allow sidecar JSON files, then it would not be clear which schema to use for which sidecar JSON file?

yes

And therefore we should have one sidecar JSON file that marries the two?

that was my ugly idea (I now love yours more)

I think the BIDS sidecar JSON file serves a different purpose than the NWB/HDMF one which is to update/overwrite data from the bulk data file, but maybe I am mistaken. If it does serve a different purpose, I think two separate files would be best.

agree

@rly
Copy link
Contributor Author

rly commented Apr 21, 2022

  • I only worry about .overwrite being "too long" to be considered by some tools to be a part of the extension.

e.g. FWIW git-annex would not consider it to be an extension to be used in annexed key but it is not critical, just wanted to mention

I like [nwb_file_base_name].overwrite.json. I'm a bit confused about the issue with the extension being too long. Why does the extension need to be fully preserved when the file is being renamed by git-annex? Renaming the file would break the cross-file linkages anyway. (aside: it seems that git annex can be configured to allow overwrite.json to be considered an extension (annex.maxextensionlength has default value of 4)).

@oruebel
Copy link
Contributor

oruebel commented Apr 21, 2022

@yarikoptic Sorry, I am a little confused. Are you saying that because BIDS allows sidecar JSON files and NWB/HDMF (will) allow sidecar JSON files, then it would not be clear which schema to use for which sidecar JSON file?

Shouldn't the reference to schema be typically part of the comment in the first line of the JSON file?

@yarikoptic
Copy link
Contributor

Shouldn't the reference to schema be typically part of the comment in the first line of the JSON file?

please clarify/give example since AFAIK JSON doesn't even have comments supported (unlike its superset YAML).

@yarikoptic
Copy link
Contributor

Why does the extension need to be fully preserved when the file is being renamed by git-annex?

it doesn't need to, but:

  • such feature was requested in the childhood of git-annex and before inception of datalad to be able to use tools which first dereference the symlink but then need to know the extension of the file ;)
  • it shouldn't matter for pynwb as long as symlinks are not dereferenced before extension is "assessed"
  • it is a choice user makes by choosing backend to use for annex'ed keys. If backend ends with E (e.g. MD5E which is default for datalad datasets, vs plain MD5 which would come then without extension) -- it is the one which would keep extension

I like [nwb_file_base_name].overwrite.json.

eh, I wish you said

I like [nwb_file].json.

;-)

@oruebel
Copy link
Contributor

oruebel commented Aug 18, 2022

The latest HDF5 1.13.2 releases adds the Onion virtual file driver (VFD). According to the release notes: “The onion VFD allows creating “versioned” HDF5 files. File open/close operations after initial file creation will add changes to an external “onion” file (.onion extension by default) instead of the original file. Each written revision can be opened independently.” (see here for the release notes and here for an in-depth description of the Onion VFD). While 1.13 is an experimental release, it may be interesting to try and see how onion compares with this JSON sidecar approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support modifications of a read file in an external overlay
4 participants