Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch datasets #814

Open
effigies opened this issue Jun 1, 2021 · 8 comments
Open

Patch datasets #814

effigies opened this issue Jun 1, 2021 · 8 comments
Labels
derivatives enhancement New feature or request opinions wanted Please read and offer your opinion on this matter

Comments

@effigies
Copy link
Collaborator

effigies commented Jun 1, 2021

In dataset_description.json, we currently have two DatasetTypes: "raw" and "derivative". Here I propose an additional type "patch" (or similar). This is a particular kind of derivative that would have the interpretation that the data/metadata elements contained in the patch dataset should be added to or supersede those of the source dataset.

Consider the following use cases:

  1. Quality control metrics can be used to exclude data from processing or select among control flow paths. These are additional metadata, but not standalone datasets.
  2. You acquire a fieldmap for every BOLD run. Some are unusable, but all are from the same session, so you find the best fieldmap for each, and provide a new IntendedFor.

I'm sure there are others. Right now, this can be done in an ad hoc way, but a clear directive that this is how derived data is declared to augment/override raw data would help avoid inconsistencies among tools.

xref nipreps/mriqc#885 (comment)

@effigies effigies added enhancement New feature or request derivatives opinions wanted Please read and offer your opinion on this matter labels Jun 1, 2021
@sappelhoff
Copy link
Member

Would it be an alternative to simply update the "raw" or "derivative" dataset, and bump the version of that dataset? If yes, then I can still see how your proposal would save time and perhaps even be more transparent (and less error prone?).

However, without dedicated tooling, this would be way too challenging to handle for a majority of end users IMHO. Do you or (others you know) want to develop tools for this and need support for this in the spec first?

@effigies
Copy link
Collaborator Author

The idea is transparency as it is a kind of derivative dataset, and you could choose another.

It would not be difficult to write a tool to make a view of a dataset with a patch applied.

@effigies
Copy link
Collaborator Author

Briefly mentioned with @christinerogers in context of third-party annotations, where permissions to modify the annotated dataset are limited.

If there's broader interest, this could be a discussion in Copenhagen. Not sure how much flexibility there is for adding topics. cc @melanieganz @CPernet

@christinerogers
Copy link
Contributor

from a software point of view, I think this could be best represented as another derivative layer, but the concept/ label for it is valuable.

@effigies
Copy link
Collaborator Author

Could you clarify what you mean by "another derivative layer"? I feel like you understood me, but I don't understand the distinction between this and my post.

@CPernet
Copy link
Collaborator

CPernet commented May 19, 2023

we can discuss what we want, but annotation is not high on the list of priorities

@yarikoptic
Copy link
Collaborator

Note: For a complete "patch" semantic it should also reserve ways to "delete" any file/entry (in .tsv, .json, etc), which IMHO wouldn't be quite trivial if reasonably possible at all.
Overall, probably the task could be solved in a data management solution specific way, e.g. via git/git-annex on top of the forks of e.g. https://github.com/OpenNeuroDatasets, so there could be forks with desired fixes of any kind and there would be no need for any dedicated support of the "patch" dataset type.

@effigies
Copy link
Collaborator Author

I agree that deleting entries is complicated. I don't think we need an ultimate patch spec to have something useful. I think there's value in having a lightweight way of adding annotations to datasets that you don't necessarily control, and as much as I love DataLad, making mastery of that the only way to publish them seems like an unnecessary barrier.

These datasets already exist, they are just poorly defined. MRIQC, for example, has no BIDS valid files because it purely produces metadata and the data that are described exist elsewhere. Right now, people dump MRIQC results in a derivatives/ subdirectory, which means tooling still needs a way of associating the derivative metadata with the raw metadata, but we don't define the semantics, so people need to write their own.

Another case is NeuroScout. There, regressors are generated from the movies that subjects were watching and then stored in a bundle, which then needs to be indexed alongside a dataset containing the preprocessed BOLD data and any other confounds that might be desired. PyBIDS handles this fine, but this is largely thanks to who works on PyBIDS, and not due to there being a well-defined mechanism for combining related datasets.

If a third party wants to host BIDS annotations as a database, do they need to buy into the DataLad model just to publish an annotation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
derivatives enhancement New feature or request opinions wanted Please read and offer your opinion on this matter
Projects
None yet
Development

No branches or pull requests

5 participants