-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify dset/attr builders based on sidecar JSON #677
base: dev
Are you sure you want to change the base?
Conversation
Codecov ReportBase: 87.65% // Head: 87.18% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## dev #677 +/- ##
==========================================
- Coverage 87.65% 87.18% -0.48%
==========================================
Files 44 45 +1
Lines 8845 9082 +237
Branches 2051 2108 +57
==========================================
+ Hits 7753 7918 +165
- Misses 777 829 +52
- Partials 315 335 +20
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Generally, this approach makes sense to me for updates on read. Sidecar file schema
Questions
|
How is semantic versioning defined for data? What constitutes a major, minor, or patch version bump?
It's worth noting that if we go with PROV, there will be a tradeoff between having the changes be accessible using PROV tools and human-readability/simplicity. Is "a user wants to edit this file by hand" a primary user story that we want to support?
👍
👍
👍
👍
👍
Some type specs allow a choice from multiple shapes and data types, e.g. |
I agree that this is a concern. Understanding PROV is not trivial. Another main concern is that I believe the PROV data model assumes that entities are immutable , i.e., all changes to an entity (e.g., dataset or attribute in our case) would be assumed to be represented by the creation of a new entity. In this case, however, we want to update the entity, rather than creating a new one. The following paper may be relevant for this topic https://link.springer.com/chapter/10.1007/978-3-319-98379-0_7 Ultimately, PROV may or may not be the right approach for this, but I think its worth spending a little bit of time looking at it to see if it makes sense. If we decide against it, then at least we have a clear answer for folks why we chose to not use it.
Correct, but I don't understand why one would need to change the shape of the data after the fact (unless you are replacing the values all-together)
I think semantic versioning can be applied fairly straight-forward here: Given a version number MAJOR.MINOR.PATCH, increment the:
Since we have a data schema, I think it may be possible to codify this to automatically determine version numbers based on the type of changes made. Providing some strict versioning rules from the start I think will make things easier down the line. This being said, I think it makes sense to have both a user-defined Question: Which changes to do we allow via the sidecar file?
You could still record those changes as a version in the sidecar file (essentially by having a record with an empty Question: How do we envision the sidecar file to be created in the API? Question: How to deal with larger changes? |
What if you want to remove values? e.g. what if you have the age of a subject and then realize that this age is incorrect but you don't know what the correct values is. |
You would set |
Latest example JSON: {
"versions": [
{
"label": "2.0.0",
"description": "change attr1 from 'old' to 'my experiment' and my_data from [1, 2, 3] to [4, 5]",
"datetime": "2020-10-29T19:15:15.789Z",
"agent": "John Doe",
"changes": [
{
"object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
"relative_path": "attr1",
"value": "my experiment"
},
{
"object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
"relative_path": "my_data",
"value": [
4,
5
],
"dtype": "int32"
}
]
},
{
"label": "3.0.0",
"description": "change sub_foo/my_data from [-1, -2, -3] to [[0]], delete my_data/attr2, and change dtype of my_data",
"datetime": "2021-11-30T20:16:16.790Z",
"agent": "Jane Doe",
"changes": [
{
"object_id": "26b95d5a-b632-407d-921a-8e255752b0f7",
"relative_path": "my_data",
"value": [
[
0
]
]
},
{
"object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
"relative_path": "my_data/attr2",
"value": null
},
{
"object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
"relative_path": "my_data",
"value": [
6,
7
],
"dtype": "int8"
}
]
},
{
"label": "3.0.1",
"description": "change my_data from [4, 5] to [6, 7]",
"datetime": "2021-11-30T20:17:16.790Z",
"agent": "Jane Doe",
"changes": [
{
"object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
"relative_path": "my_data",
"value": [
6,
7
]
}
]
}
],
"schema_version": "0.1.0"
} |
TODO:
Nice to have:
|
I feel like "label" should be "version" |
Can groups be added? Would they be added implicitly if we gave a relative path that does not exist yet, or should be create them explicitly? For instance, if "a" exists as a group but "b" does not, and you want to create a dataset "d" inside of "b" inside of "a", then you might try adding "a/b/d" with some value, and have "b" implicitly created. Alternatively, we might want to require that "b" is created explicitly before we can add a dataset to it. Can neurodata_type instances be added? (I'd lean towards no) |
Under the current setup, new HDF5 elements cannot be added. But I see your point that we may want that, especially for datasets and attributes. Let's say a user wants to add "related_publications" when the dataset doesn't exist. For new neurodata_type instances, I also lean toward no. At that point, I would recommend modifying the file directly. |
TODO:
|
Comment from @yarikoptic Versioning within this sidecar file is messy. Provenance is half-baked. Use a dedicated version system. If Version 2 changes A to B and Version 3 changes B to C, do we need to store the record of B? Just store that C replaces A. Just store an ordered list of changes. Consider using "full_path" instead of "object_id" & "relative_path" New example:
Also should check that when a file is read and file is exported to a new file, new data is present. When it is opened in append mode, do not replace the values (but note that if the file or container is modified, then the new values would be written! this could result in partial overwrites, where changes A is applied but change B is not, depending on what container is modified between read and write) Advanced edge case: File A links to File B with its own sidecar JSON |
I have an idea that I don't think should be in this MVP, but I wanted to make a note of it and see what you all thought. Another "type" might be "add_alternate_value", which would be used as a value if the first value was somehow unaccessible. This could be a good solution for providing a link to a file that might be locally or remotely stored. |
This idea could work, but feels hacky and I would prefer to build this kind of functionality (alternate paths to look up a linked file) into core HDMF. I'm drafting an extension for that. If that takes too long though to release, then we can test out this solution. |
Just an idea. One option could also be to use ExternalResources to assign alternate online paths. In this scenario, the dataset would store the original local filepath and via ExternalResources on could associate and arbitrary number of external online resources with it. |
I think it may be useful to add the following keys at the top level (i.e., not for each operation, but for the whole sidecar file) could be useful:
|
See https://hdmf--677.org.readthedocs.build/en/677/sidecar.html for documentation on what is currently supported in this PR. Feedback is appreciated. I preserved some code that I wrote for other types of modifications in |
src/hdmf/backends/hdf5/h5tools.py
Outdated
returns='The same input GroupBuilder, now modified.', | ||
rtype='GroupBuilder' | ||
) | ||
def update_builder_from_sidecar(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add a post_read_builder
function to HDMFIO
itself to provide a standard place for I/O backends to update builders after read
docs/source/sidecar.rst
Outdated
Users may want to update part of an HDMF file without rewriting the entire file. | ||
To do so, HDMF supports the use of a "sidecar" JSON file that lives adjacent to the HDMF file on disk and | ||
specifies modifications to the HDMF file. Only a limited set of modifications are supported; for example, users can | ||
delete a dataset or attribute but cannot create a new dataset or attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete a dataset or attribute but cannot create a new dataset or attribute. | |
hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute. |
I think delete
is misleading since we are not actually deleting any data from a file but the JSON file can only indicate that the dataset/attribute should be ignored on read (maybe hide
or invalid
would be more precise).
Does delete
also apply to groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll make the change. For now, I have not allowed hiding of groups because the use case is unclear. But it is technically not very different from hiding of datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a main use-case for hiding groups would instances of a data_type, e.g., to hide a TimeSeries that for some reason contains bad data. If it's trivial, then I think allowing to hide groups is something we could allow, but if it adds a lot of complexity then I would hold off until a specific need arises.
docs/source/sidecar.rst
Outdated
Users may want to update part of an HDMF file without rewriting the entire file. | ||
To do so, HDMF supports the use of a "sidecar" JSON file that lives adjacent to the HDMF file on disk and | ||
specifies modifications to the HDMF file. Only a limited set of modifications are supported; for example, users can | ||
delete a dataset or attribute but cannot create a new dataset or attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete a dataset or attribute but cannot create a new dataset or attribute. | |
hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute. |
I think delete
is misleading since we are not actually deleting any data from a file but the JSON file can only indicate that the dataset/attribute should be ignored on read (maybe hide
or invalid
would be more precise).
Does delete
also apply to groups?
docs/source/sidecar.rst
Outdated
The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file | ||
located at the root of the HDMF repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are sidecar files automatically validated by the validator as well?
docs/source/sidecar.rst
Outdated
The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file | ||
located at the root of the HDMF repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are sidecar files automatically validated by the validator as well?
"""Update the file builder in-place with the values specified in the sidecar JSON.""" | ||
# the sidecar json must have the same name as the file but with suffix .json | ||
f_builder, path = getargs('file_builder', 'file_path', kwargs) | ||
sidecar_path = Path(path).with_suffix('.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, then this assumes that the sidecar file has the same name but different suffix than the main file. While this is a good default strategy, I'm wondering whether this will work in DANDI. I think there external files (e.g., videos) are placed in a folder with the same name as file, and I'm wondering whether they would place the sidecar file there instead?
"""Update the file builder in-place with the values specified in the sidecar JSON.""" | ||
# the sidecar json must have the same name as the file but with suffix .json | ||
f_builder, path = getargs('file_builder', 'file_path', kwargs) | ||
sidecar_path = Path(path).with_suffix('.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, then this assumes that the sidecar file has the same name but different suffix than the main file. While this is a good default strategy, I'm wondering whether this will work in DANDI. I think there external files (e.g., videos) are placed in a folder with the same name as file, and I'm wondering whether they would place the sidecar file there instead?
docs/source/sidecar.rst
Outdated
Modifying an HDMF File with a Sidecar JSON File | ||
=============================================== | ||
|
||
Users may want to update part of an HDMF file without rewriting the entire file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may want to update part of an HDMF file without rewriting the entire file. | |
Users may want to update part of an HDMF file without rewriting the entire file. |
I think it would be useful to elaborate a little bit on this to clarify the intent and scope of the sidecar file, i.e., this is for small updates and corrections only.
docs/source/sidecar.rst
Outdated
Modifying an HDMF File with a Sidecar JSON File | ||
=============================================== | ||
|
||
Users may want to update part of an HDMF file without rewriting the entire file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may want to update part of an HDMF file without rewriting the entire file. | |
Users may want to update part of an HDMF file without rewriting the entire file. |
I think it would be useful to elaborate a little bit on this to clarify the intent and scope of the sidecar file, i.e., this is for small updates and corrections only.
Co-authored-by: Oliver Ruebel <oruebel@users.noreply.github.com>
reordered for TL;DR:
I think this is a good idea/alternative!
e.g. FWIW git-annex would not consider it to be an extension to be used in annexed key but it is not critical, just wanted to mention$> ls -ld 123.overwrite.json
lrwxrwxrwx 1 yoh yoh 118 Apr 21 14:35 123.overwrite.json -> .git/annex/objects/fm/24/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.json/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.json
# you can see that it is just .json not .overwrite.json. Let's try shorter:
$> touch 123.over.json
$> git annex add 123.over.json
add 123.over.json
ok
(recording state in git...)
$> ls -ld 123.over.json
lrwxrwxrwx 1 yoh yoh 128 Apr 21 14:35 123.over.json -> .git/annex/objects/6P/1w/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.over.json/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.over.json
But what about
individual short answers
yes
that was my ugly idea (I now love yours more)
agree |
I like |
Shouldn't the reference to schema be typically part of the comment in the first line of the JSON file? |
please clarify/give example since AFAIK JSON doesn't even have comments supported (unlike its superset YAML). |
it doesn't need to, but:
eh, I wish you said
;-) |
The latest HDF5 1.13.2 releases adds the Onion virtual file driver (VFD). According to the release notes: “The onion VFD allows creating “versioned” HDF5 files. File open/close operations after initial file creation will add changes to an external “onion” file (.onion extension by default) instead of the original file. Each written revision can be opened independently.” (see here for the release notes and here for an in-depth description of the Onion VFD). While 1.13 is an experimental release, it may be interesting to try and see how onion compares with this JSON sidecar approach. |
Motivation
Fix #676.
This is DEMO code to demonstrate how HDMF might support the reading of file modifications from a sidecar JSON. I made up the formatting/schema of the JSON file. It would look like the below.
In this set up, the JSON file has a list of versions which have a label, description, and list of changes.
Each change specifies an object id, relative path to the dataset/attribute being changed from the group/dataset with the object id, and the new value, which can be a scalar, list, or nested list.
The builder values are replaced after the file is fully built by
HDF5IO
.Lots of details to be worked out (changing data types? changing shape? compound dtypes?). Let me know what you think of this approach. @oruebel @ajtritt @bendichter
Checklist
flake8
from the source directory.