Patch datasets #814

effigies · 2021-06-01T13:55:29Z

In dataset_description.json, we currently have two DatasetTypes: "raw" and "derivative". Here I propose an additional type "patch" (or similar). This is a particular kind of derivative that would have the interpretation that the data/metadata elements contained in the patch dataset should be added to or supersede those of the source dataset.

Consider the following use cases:

Quality control metrics can be used to exclude data from processing or select among control flow paths. These are additional metadata, but not standalone datasets.
You acquire a fieldmap for every BOLD run. Some are unusable, but all are from the same session, so you find the best fieldmap for each, and provide a new IntendedFor.

I'm sure there are others. Right now, this can be done in an ad hoc way, but a clear directive that this is how derived data is declared to augment/override raw data would help avoid inconsistencies among tools.

xref nipreps/mriqc#885 (comment)

The text was updated successfully, but these errors were encountered:

sappelhoff · 2021-11-29T09:26:00Z

Would it be an alternative to simply update the "raw" or "derivative" dataset, and bump the version of that dataset? If yes, then I can still see how your proposal would save time and perhaps even be more transparent (and less error prone?).

However, without dedicated tooling, this would be way too challenging to handle for a majority of end users IMHO. Do you or (others you know) want to develop tools for this and need support for this in the spec first?

effigies · 2021-11-29T12:14:31Z

The idea is transparency as it is a kind of derivative dataset, and you could choose another.

It would not be difficult to write a tool to make a view of a dataset with a patch applied.

effigies · 2023-05-18T18:20:50Z

Briefly mentioned with @christinerogers in context of third-party annotations, where permissions to modify the annotated dataset are limited.

If there's broader interest, this could be a discussion in Copenhagen. Not sure how much flexibility there is for adding topics. cc @melanieganz @CPernet

christinerogers · 2023-05-18T18:30:52Z

from a software point of view, I think this could be best represented as another derivative layer, but the concept/ label for it is valuable.

effigies · 2023-05-18T21:27:59Z

Could you clarify what you mean by "another derivative layer"? I feel like you understood me, but I don't understand the distinction between this and my post.

CPernet · 2023-05-19T07:05:56Z

we can discuss what we want, but annotation is not high on the list of priorities

yarikoptic · 2023-05-22T12:22:50Z

Note: For a complete "patch" semantic it should also reserve ways to "delete" any file/entry (in .tsv, .json, etc), which IMHO wouldn't be quite trivial if reasonably possible at all.
Overall, probably the task could be solved in a data management solution specific way, e.g. via git/git-annex on top of the forks of e.g. https://github.com/OpenNeuroDatasets, so there could be forks with desired fixes of any kind and there would be no need for any dedicated support of the "patch" dataset type.

effigies · 2023-05-22T15:26:29Z

I agree that deleting entries is complicated. I don't think we need an ultimate patch spec to have something useful. I think there's value in having a lightweight way of adding annotations to datasets that you don't necessarily control, and as much as I love DataLad, making mastery of that the only way to publish them seems like an unnecessary barrier.

These datasets already exist, they are just poorly defined. MRIQC, for example, has no BIDS valid files because it purely produces metadata and the data that are described exist elsewhere. Right now, people dump MRIQC results in a derivatives/ subdirectory, which means tooling still needs a way of associating the derivative metadata with the raw metadata, but we don't define the semantics, so people need to write their own.

Another case is NeuroScout. There, regressors are generated from the movies that subjects were watching and then stored in a bundle, which then needs to be indexed alongside a dataset containing the preprocessed BOLD data and any other confounds that might be desired. PyBIDS handles this fine, but this is largely thanks to who works on PyBIDS, and not due to there being a well-defined mechanism for combining related datasets.

If a third party wants to host BIDS annotations as a database, do they need to buy into the DataLad model just to publish an annotation?

effigies added enhancement New feature or request derivatives opinions wanted Please read and offer your opinion on this matter labels Jun 1, 2021

Remi-Gau mentioned this issue Apr 19, 2022

MPM fmap in derivatives are intended for data in the raw dataset bids-standard/bids-examples#312

Open

Remi-Gau mentioned this issue May 21, 2023

Overall purpose/point... WiP con/job#1

Open

Remi-Gau changed the title ~~RFC: Patch datasets~~ Patch datasets Apr 16, 2024

effigies mentioned this issue Feb 7, 2025

Automatically use available fieldmaps if IntendedFor is missing nipreps/fmriprep#2312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch datasets #814

Patch datasets #814

effigies commented Jun 1, 2021

sappelhoff commented Nov 29, 2021

effigies commented Nov 29, 2021

effigies commented May 18, 2023

christinerogers commented May 18, 2023

effigies commented May 18, 2023

CPernet commented May 19, 2023

yarikoptic commented May 22, 2023

effigies commented May 22, 2023

Patch datasets #814

Patch datasets #814

Comments

effigies commented Jun 1, 2021

sappelhoff commented Nov 29, 2021

effigies commented Nov 29, 2021

effigies commented May 18, 2023

christinerogers commented May 18, 2023

effigies commented May 18, 2023

CPernet commented May 19, 2023

yarikoptic commented May 22, 2023

effigies commented May 22, 2023