Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate --validation skip and --allow-any-path decision tree #1074

Open
satra opened this issue Jul 19, 2022 · 4 comments
Open

investigate --validation skip and --allow-any-path decision tree #1074

satra opened this issue Jul 19, 2022 · 4 comments

Comments

@satra
Copy link
Member

satra commented Jul 19, 2022

in #1069 it seemed that the CLI is trying to detect if something is a bids dataset (also see #1073).

this issue is about the behavior of the CLI when validation is skipped or any paths are allowed:

  • if any paths are allowed, then there is no guarantee that the path is a bids path. should any bids proximity be checked?
  • same for validation, if it is skipped what does it mean? i think we have two different meanings. for nwb, validation was about running the validator on the nwb file, for bids, validation is about structure of the dataset rather than contents of file.
  • what about metadata? knowing if something is bids allows us to think about metadata better, but if we cannot determine or are being explicitly asked, then technically we should ignore adding any bids specific metadata.
  • what if a bids dataset is detected, do we validate the entire tree or just the partial tree that's relevant to the directory or files being uploaded?

some of the considerations here really come from the standards (nwb is self-contained in a file), bids is a structure around a set of files.

@TheChymera and @yarikoptic - filed here. would be good to discuss the different options.

@yarikoptic
Copy link
Member

First of I would like to clarify that --allow-any-path is a DEVEL option, i.e. typically shouldn't be used by users, and we should strive to arrive to that. Also --validation skip should be discouraged since most likely it would result in dandiset with validation errors and thus not publishable. So, typically, we should advise that neither of those options should be used.

  • if any paths are allowed, then there is no guarantee that the path is a bids path.

Then we are to upload it, possibly without any extra (beyond the one of the GenericAsset) metadata

should any bids proximity be checked?

what do you mean by proximity here? that filename looks BIDS-like so we could potentially extract some entities?

  • same for validation, if it is skipped what does it mean?

hm, good question. I think for nwb it means that we can still extract/upload metadata even though it doesn't pass validation. How does it work for BIDS file paths @TheChymera -- do we extract metadata for e.g. subject/session and from entities (thus may be relating to question above) if validation of that path fails?

I think we have two different meanings. for nwb, validation was about running the validator on the nwb file, for bids, validation is about structure of the dataset rather than contents of file.

Yes/no. For our DANDI layout we should have also added validation of the path to confirm to our "DANDI layout". Altogether, I think that the situation is similar between BIDS (over nii or nwb or whatnot) and DANDI layout on .nwb files, and there should (if not already there) be validation of both structure and contents.

what if a bids dataset is detected, do we validate the entire tree or just the partial tree that's relevant to the directory or files being uploaded?

I think ATM it is just about the partial tree, @TheChymera can correct me if I am wrong. Later, we indeed probably should ensure that it works for "partial tree" but with consideration of "dataset component".
This would relate then to your #1075 issue about "partial BIDS dataset download/validation". For DANDI layout is could be done easily ATM since we have only dandiset.yaml and corresponding files without any dependencies (if we do not look within .nwb for external refs). For BIDS -- I am still not 100% certain beyond validating some (e.g. not content dependent) aspects, possibly as overlay on top of what is in the archive, possibly with some smart access to remote content where/if needed. But that is "bright future" level of importance ATM IMHO.

@TheChymera
Copy link
Contributor

TheChymera commented Jul 23, 2022

what do you mean by proximity here? that filename looks BIDS-like so we could potentially extract some entities?

This sounds like the sort of fuzzy feature that could cause us a lot of misadventures :(

it means that we can still extract/upload metadata even though it doesn't pass validation. How does it work for BIDS file paths [...] do we extract metadata for e.g. subject/session and from entities (thus may be relating to question above) if validation of that path fails?

Currently we only return the BIDS dataset roots which pass validation:

dandi-cli/dandi/upload.py

Lines 463 to 465 in 1c94736

return validated_datasets
else:
return bids_datasets

However, regardless of whether the dataset is evaluated as valid, the metadata of all matches is exposed:

validator_result = validate_bids(bd)

Invalid files do not have metadata. Partial validation is on my nice-to-have roadmap, but I would argue it's only sensible to use such a feature to improve hinting for the end-user (i.e. this is probably where the mistake is, probably). I really don't think we should annotate metadata from invalid files, it's simply unreliable and potentially broken annotation is always worse than missing annotation.

what if a bids dataset is detected, do we validate the entire tree or just the partial tree that's relevant to the directory or files being uploaded?

Validation is run on directories which contain dataset_description.json at the top level.

@satra
Copy link
Member Author

satra commented Jul 23, 2022

@TheChymera and @yarikoptic - my question still remains. in the case of NWB, i can upload a single file independent of the validity of any other NWB file.

in the case of BIDS, will i be able to upload a single file/directory if it finds invalid things in other directories outside of that specific tree. given the current bids datasets, it may be very onerous to fix every aspect before renewing upload. hence the check being limited to the sub tree relevant to that file and the bids_description.json.

please look at this through the lens of the current datasets and their complexities not any general bids case for now.

@satra
Copy link
Member Author

satra commented Jul 23, 2022

just to clarify for these datasets the following should work:

for 000108
cd some ses directory
dandi upload --existing overwrite --sync .

for 000026
cd some sub/ses directory or derivatives directory
dandi upload --existing overwrite --sync .

i include existing as we would want to overwrite the metadata with appropriate extraction from the filename

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants