Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open access workshop data access #3020

Open
Lestropie opened this issue Oct 8, 2024 · 2 comments
Open

Open access workshop data access #3020

Lestropie opened this issue Oct 8, 2024 · 2 comments
Labels

Comments

@Lestropie
Copy link
Member

Following some discussion with @stebo85 in Lestropie/neurocontainers#1, I want to evaluate the prospect of open access workshop configuration being primarily an MRtrix3 functionality rather than a Neurodesk one. The issue with integrating into the Neurodesk official image list, IIUC, is that that content would then need to be downloaded for anybody installing the Neurodesk app, regardless of whether they have any intention of running an MRtrix3 container. For the size of data we're dealing with, I don't think that's reasonable.

What I'm contemplating instead is a script that would potentially be a part of the MRtrix3 main repository itself. Upon execution, that script would:

  • Download and unpack workshop content into a user-specified location
    • If it does not already exist there
    • In a containerised environment, this should be a mounted location on the host system so that the data are preserved if the container is closed
    • Could download slide content in addition to data
  • Set up an OverlayFS mount, with the downloaded data read-only but the ability to write files with the same names
    • Ideally check for OverlayFS capabilities and issue useful error message if absent
  • Change directory to the OverlayFS mount so that workshop content can be copy-pasted to execute

This would make the workshop environment accessible both within Neurodesktop (hopefully) and within any other environment where MRtrix3 has been installed and configured appropriately.

Open to alternative suggestions.

@Lestropie Lestropie changed the title Open access workshop configuration script Open access workshop data access Oct 15, 2024
@Lestropie
Copy link
Member Author

As noted in Lestropie/neurocontainers#1 I've been looking a little into DataLad. I'ts something I hear in context all the time but have never fully wrapped my head around it or gotten my hands dirty.

Essentially, it's a git wrapper that is more tailored to data than to source code.
Here's how I would currently envisage this working, based on preliminary reading:

  • Attendees / anybody accessing the workshop material would need to install DataLad; this is pretty easy through pip.
  • Sub-components of the workshop data (slide deck, single-subject, dwifslpreproc, FBA) would each be stored as a DataLad dataset on an appropriate remote server.
    • For in-person workshops there could hypothetically be a system with high network & disk performance that could act as a substitute remote server.
  • "The workshop repository" would contain each of the above as a nested dataset; a little bit like git submodules.
  • The initial clone of the dataset onto an attendee's system (wither their own or in a computer lab) only clones the provenance structure and small text files; larger files appear in their location in the filesystem, but with a size of zero bytes.
  • Slide decks would be updated to include a datalad get call at the start of the session, which downloads the actual content of those files that need to be present for execution of the interactive content of that session.
  • If a user changes the content of a workshop data file, it can be reverted in a similar manner to git reset. So while this doesn't provide the benefit of OverlayFS in terms of not necessitating use of the -force option to overwrite the original dataset file, it does provide the ability to restore the original content.
    (Hypothetically, we could detect that a file is under DataLad version control, and use that to disable necessity of the -force option; but not sure if we want to jump in that deep)
  • We would all be able to asynchronously commit updates to workshop material.
  • Where derivative files are generated through a bash script (which is most if not all of it), the execution provenance can be tracked using datalad run, which commits both all file changes and provenance information about what was executed.
  • OSF supports git annex, which means Datalad datasets can be stored there (our workshop data is probably a bit too large to be putting on GitHub). There's even an extension to expedite this.

So from what I've read I think DataLad is worth pursuing.
Hypothetically, anyone could still use NeuroDesk to access and interact with the workshop content if they wanted to. But given they would need to install this extra dependency, and also clone the dataset onto a location on the host system, it doesn't provide that much, at least currently. Maybe if it were to offer an installer that installs DataLad, clones the dataset in some default location on the host mount, fully clones the slide contents, and it's then up to the user to manually clone further content as they encounter it, it may provide some benefit.

@stebo85
Copy link

stebo85 commented Oct 15, 2024

Agree that datalad would be a great solution for this :) Datalad is already integrated in Neurodesk, so it would be easy for users to download the workshop data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants